summaryrefslogtreecommitdiffstats
path: root/drivers/block
Commit message (Collapse)AuthorAgeFilesLines
* nbd: use the default discard granularityChristoph Hellwig2023-12-291-5/+1
| | | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Also don't bother clearing it as a discard granularity without discard_sectors doesn't mean anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: don't abuse BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-1/+2
| | | | | | | | | | BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors limits. Don't use it to initialize max_hw_setors, which is a hardware / driver capacility. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* aoe: don't abuse BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-1/+2
| | | | | | | | | | BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors limits. Don't use it to initialize max_hw_setors, which is a hardware / driver capacility. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-10/+2
| | | | | | | | | | | | | | | null_blk has some rather odd capping of the max_hw_sectors value to BLK_DEF_MAX_SECTORS, which doesn't make sense - max_hw_sector is the hardware limit, and BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors field used for normal file system I/O. Remove all the capping, and simply leave it to the block layer or user to take up or not all of that for file system I/O. Fixes: ea17fd354ca8 ("null_blk: Allow controlling max_hw_sectors limit") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: don't update discard limits from loop_set_statusChristoph Hellwig2023-12-271-2/+0
| | | | | | | | | loop_set_status doesn't change anything relevant to the discard and write_zeroes setting, so don't bother calling loop_config_discard. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* drbd: actlog: fix kernel-doc warnings and spellingRandy Dunlap2023-12-221-6/+10
| | | | | | | | | | | | | | | | | | | | | | Fix all kernel-doc warnings in drbd_actlog.c: drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io' drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io' drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all' Fix one spelling error (s/ore/or/). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: <drbd-dev@lists.linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify disk_set_zonedChristoph Hellwig2023-12-193-3/+3
| | | | | | | | | | | | | Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove support for the host aware zone modelChristoph Hellwig2023-12-193-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* virtio_blk: remove the broken zone revalidation supportChristoph Hellwig2023-12-191-26/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | virtblk_revalidate_zones is called unconditionally from virtblk_config_changed_work from the virtio config_changed callback. virtblk_revalidate_zones is a bit odd in that it re-clears the zoned state for host aware or non-zoned devices, which isn't needed unless the zoned mode changed - but a zone mode change to a host managed model isn't handled at all, and virtio_blk also doesn't handle any other config change except for a capacity change is handled (and even if it was the upper layers above virtio_blk wouldn't handle it very well). But even the useful case of a size change that would add or remove zones isn't handled properly as blk_revalidate_disk_zones expects the device capacity to cover all zones, but the capacity is only updated after virtblk_revalidate_zones. As this code appears to be entirely untested and is getting in the way remove it for now, but it can be readded in a fixed version with proper test coverage if needed. Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices") Fixes: f1ba4e674feb ("virtio-blk: fix to match virtio spec") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* virtio_blk: cleanup zoned device probingChristoph Hellwig2023-12-191-28/+22
| | | | | | | | | | | | | | | | | Move reading and checking the zoned model from virtblk_probe_zoned_device into the caller, leaving only the code to perform the actual setup for host managed zoned devices in virtblk_probe_zoned_device. This allows to share the model reading and sharing between builds with and without CONFIG_BLK_DEV_ZONED, and improve it for the !CONFIG_BLK_DEV_ZONED case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/rnbd-srv: Check for unlikely string overflowKees Cook2023-12-131-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "dev_search_path" can technically be as large as PATH_MAX, there was a risk of truncation when copying it and a second string into "full_path" since it was also PATH_MAX sized. The W=1 builds were reporting this warning: drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra': drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=] 616 | snprintf(full_path, PATH_MAX, "%s/%s", | ^~ In function 'rnbd_srv_get_full_path', inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096 616 | snprintf(full_path, PATH_MAX, "%s/%s", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 617 | dev_search_path, dev_name); | ~~~~~~~~~~~~~~~~~~~~~~~~~~ To fix this, unconditionally check for truncation (as was already done for the case where "%SESSNAME%" was present). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/ Cc: Md. Haris Iqbal <haris.iqbal@ionos.com> Cc: Jack Wang <jinpu.wang@ionos.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/rnbd: use %pe to print errorsSupriti Singh2023-11-272-13/+13
| | | | | | | | | | | | While printing error, replace %ld by %pe. %pe prints a string whereas %ld would print an error code. Signed-off-by: Supriti Singh <supriti.singh@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://lore.kernel.org/r/20231124213422.113449-3-haris.iqbal@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/rnbd: add support for REQ_OP_WRITE_ZEROESSantosh Pradhan2023-11-273-8/+18
| | | | | | | | | | | Remove REQ_OP_WRITE_SAME in favour of REQ_OP_WRITE_ZEROES. Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://lore.kernel.org/r/20231124213422.113449-2-haris.iqbal@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: pass nbd_sock to nbd_read_reply() instead of indexLi Nan2023-11-211-13/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be krealloc in nbd_add_socket(), and a garbage request is received now, a UAF may occurs. T1 nbd_ioctl __nbd_ioctl nbd_add_socket blk_mq_freeze_queue T2 recv_work nbd_read_reply sock_xmit krealloc config->socks def config->socks Pass nbd_sock to nbd_read_reply(). And introduce a new function sock_xmit_recv(), which differs from sock_xmit only in the way it get socket. ================================================================== BUG: KASAN: use-after-free in sock_xmit+0x525/0x550 Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779 Workqueue: knbd4-recv recv_work Call Trace: __dump_stack dump_stack+0xbe/0xfd print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 sock_xmit+0x525/0x550 nbd_read_reply+0xfe/0x2c0 recv_work+0x1c2/0x750 process_one_work+0x6b6/0xf10 worker_thread+0xdd/0xd80 kthread+0x30a/0x410 ret_from_fork+0x22/0x30 Allocated by task 18784: kasan_save_stack+0x1b/0x40 kasan_set_track set_alloc_info __kasan_kmalloc __kasan_kmalloc.constprop.0+0xf0/0x130 slab_post_alloc_hook slab_alloc_node slab_alloc __kmalloc_track_caller+0x157/0x550 __do_krealloc krealloc+0x37/0xb0 nbd_add_socket +0x2d3/0x880 __nbd_ioctl nbd_ioctl+0x584/0x8e0 __blkdev_driver_ioctl blkdev_ioctl+0x2a0/0x6e0 block_ioctl+0xee/0x130 vfs_ioctl __do_sys_ioctl __se_sys_ioctl+0x138/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Freed by task 18784: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x40 __kasan_slab_free.part.0+0x13f/0x1b0 slab_free_hook slab_free_freelist_hook slab_free kfree+0xcb/0x6c0 krealloc+0x56/0xb0 nbd_add_socket+0x2d3/0x880 __nbd_ioctl nbd_ioctl+0x584/0x8e0 __blkdev_driver_ioctl blkdev_ioctl+0x2a0/0x6e0 block_ioctl+0xee/0x130 vfs_ioctl __do_sys_ioctl __se_sys_ioctl+0x138/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/null_blk: Fix double blk_mq_start_request() warningChengming Zhou2023-11-201-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq() would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request, which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request(). Then null_queue_rqs() put these requests in the rqlist, return back to the block layer core, which would try to queue them individually again, so the warning in blk_mq_start_request() triggered. Fix it by splitting the null_queue_rq() into two parts: the first is the preparation of request, the second is the handling of request. We put the blk_mq_start_request() after the preparation part, which may fail and return back to the block layer core. The throttling also belongs to the preparation part, so move it before blk_mq_start_request(). And change the return type of null_handle_cmd() to void, since it always return BLK_STS_OK now. Reported-by: <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com> Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/ Fixes: d78bfa1346ab ("block/null_blk: add queue_rqs() support") Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix null-ptr-dereference while accessing 'nbd->config'Li Nan2023-11-201-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | Memory reordering may occur in nbd_genl_connect(), causing config_refs to be set to 1 while nbd->config is still empty. Opening nbd at this time will cause null-ptr-dereference. T1 T2 nbd_open nbd_get_config_unlocked nbd_genl_connect nbd_alloc_and_init_config //memory reordered refcount_set(&nbd->config_refs, 1) // 2 nbd->config ->null point nbd->config = config // 1 Fix it by adding smp barrier to guarantee the execution sequence. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: factor out a helper to get nbd_config without holding 'config_lock'Li Nan2023-11-201-8/+19
| | | | | | | | | | There are no functional changes, just to make code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fold nbd config initialization into nbd_alloc_config()Li Nan2023-11-201-22/+19
| | | | | | | | | | There are no functional changes, make the code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix uaf in nbd_openLi Lingfeng2023-11-071-2/+9
| | | | | | | | | | | | | | | | Commit 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") cleans up disk by blk_cleanup_disk() and it won't set disk->private_data as NULL as before. UAF may be triggered in nbd_open() if someone tries to open nbd device right after nbd_put() since nbd has been free in nbd_dev_remove(). Fix this by implementing ->free_disk and free private data in it. Fixes: 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231107103435.2074904-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2023-11-051-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio updates from Michael Tsirkin: "vhost,virtio,vdpa: features, fixes, cleanups. vdpa/mlx5: - VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK - new maintainer vdpa: - support for vq descriptor mappings - decouple reset of iotlb mapping from device reset and fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) vdpa_sim: implement .reset_map support vdpa/mlx5: implement .reset_map driver op vhost-vdpa: clean iotlb map during reset for older userspace vdpa: introduce .compat_reset operation callback vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vhost-vdpa: reset vendor specific mapping to initial state in .release vdpa: introduce .reset_map operation callback virtio_pci: add check for common cfg size virtio-blk: fix implicit overflow on virtio_max_dma_size virtio_pci: add build offset check for the new common cfg items virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit vduse: make vduse_class constant vhost-scsi: Spelling s/preceeding/preceding/g virtio: kdoc for struct virtio_pci_modern_device vdpa: Update sysfs ABI documentation MAINTAINERS: Add myself as mlx5_vdpa driver virtio-balloon: correct the comment of virtballoon_migratepage() mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK vdpa/mlx5: Update cvq iotlb mapping on ASID change vdpa/mlx5: Make iotlb helper functions more generic ...
| * virtio-blk: fix implicit overflow on virtio_max_dma_sizezhenwei pi2023-11-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The following codes have an implicit conversion from size_t to u32: (u32)max_size = (size_t)virtio_max_dma_size(vdev); This may lead overflow, Ex (size_t)4G -> (u32)0. Once virtio_max_dma_size() has a larger size than U32_MAX, use U32_MAX instead. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20230904061045.510460-1-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds2023-11-014-132/+237
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...
| * | block: ublk_drv: Remove unused functionJiapeng Chong2023-10-191-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function are defined in the ublk_drv.c file, but not called elsewhere, so delete the unused function. drivers/block/ublk_drv.c:1211:20: warning: unused function 'ublk_abort_io_cmds'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6938 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Fixes: b4e1353f4651 ("ublk: simplify aborting request") Reviewed-by: Ming Lei <ming.lei@rehdat.com> Link: https://lore.kernel.org/r/20231019030444.53680-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: simplify aborting requestMing Lei2023-10-171-30/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now ublk_abort_queue() is run exclusively with ublk_queue_rq() and the ubq_daemon task, so simplify aborting request: - set UBLK_IO_FLAG_ABORTED in ublk_abort_queue() just for aborting this request - abort request in ublk_queue_rq() if ubq->canceling is set Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: replace monitor with cancelable uring_cmdMing Lei2023-10-171-89/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Monitor work actually introduces one extra context for handling abort, this way is easy to cause race, and also introduce extra delay when handling aborting. Now we start to support cancelable uring_cmd, so use it instead: 1) this cancel callback is either run from the uring cmd submission task context or called after the io_uring context is exit, so the callback is run exclusively with ublk_ch_uring_cmd() and __ublk_rq_task_work(). 2) the previous patch freezes request queue when calling ublk_abort_queue(), which is now completely exclusive with ublk_queue_rq() and ublk_ch_uring_cmd()/__ublk_rq_task_work(). 3) in timeout handler, if all IOs are in-flight, then all uring commands are completed, uring command canceling can't help us to provide forward progress any more, so call ublk_abort_requests() in timeout handler. This way simplifies aborting queue, and is helpful for adding new feature, such as, relax the limit of using single task for handling one queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: quiesce request queue when aborting queueMing Lei2023-10-171-9/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far aborting queue ends request when the ubq daemon is exiting, and it can be run concurrently with ublk_queue_rq(), this way is fragile and we depend on the tricky usage of UBLK_IO_FLAG_ABORTED for avoiding such race. Quiesce queue when aborting queue, and the two code paths can be run completely exclusively, then it becomes easier to add new ublk feature, such as relaxing single same task limit for each queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: rename mm_lock as lockMing Lei2023-10-171-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename mm_lock field of ublk_device as lock, so that this lock can be reused for protecting access of ub->ub_disk, which will be used for simplifying ublk_abort_queue() by quiesce queue in next patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: move ublk_cancel_dev() out of ub->mutexMing Lei2023-10-171-17/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ublk_cancel_dev() just calls ublk_cancel_queue() to cancel all pending io commands after ublk request queue is idle. The only protection is just the read & write of ubq->nr_io_ready and avoid duplicated command cancel, so add one per-queue lock with cancel flag for providing this protection, meantime move ublk_cancel_dev() out of ub->mutex. Then we needn't to call io_uring_cmd_complete_in_task() to cancel pending command. And the same cancel logic will be re-used for cancelable uring command. This patch basically reverts commit ac5902f84bb5 ("ublk: fix AB-BA lockdep warning"). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: make sure io cmd handled in submitter task contextMing Lei2023-10-171-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In well-done ublk server implementation, ublk io command won't be linked into any link chain. Meantime they are always handled in no-wait style, so basically io cmd is always handled in submitter task context. However, the server may set IOSQE_ASYNC, or io command is linked to one chain mistakenly, then we may still run into io-wq context and ctx->uring_lock isn't held. So in case of IO_URING_F_UNLOCKED, schedule this command by io_uring_cmd_complete_in_task to force running it in submitter task. Then ublk_ch_uring_cmd_local() is guaranteed to run with context uring_lock held, and we needn't to worry about sync among submission code path any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: don't get ublk device reference in ublk_abort_queue()Ming Lei2023-10-171-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ublk_abort_queue() is called in ublk_daemon_monitor_work(), in which it is guaranteed that the device is live because monitor work is canceled when removing device, so no need to get the device reference. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: Make ublks_max configurableMike Christie2023-10-171-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are converting tcmu applications to ublk, but have systems with up to 1k devices. This patch allows us to configure the ublks_max from userspace with the ublks_max modparam. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231012150600.6198-3-michael.christie@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ublk: Limit dev_id/ub_number valuesMike Christie2023-10-171-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dev_id/ub_number is used for the ublk dev's char device's minor number so it has to fit into MINORMASK. This patch adds checks to prevent userspace from passing a number that's too large and limits what can be allocated by the ublk_index_idr for the case where userspace has the kernel allocate the dev_id/ub_number. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231012150600.6198-2-michael.christie@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | aoe: replace strncpy with strscpyJustin Stitt2023-10-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `strncpy` is deprecated for use on NUL-terminated destination strings [1]. `aoe_iflist` is expected to be NUL-terminated which is evident by its use with string apis later on like `strspn`: | p = aoe_iflist + strspn(aoe_iflist, WHITESPACE); It also seems `aoe_iflist` does not need to be NUL-padded which means `strscpy` [2] is a suitable replacement due to the fact that it guarantees NUL-termination on the destination buffer while not unnecessarily NUL-padding. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Cc: Xu Panda <xu.panda@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-aoe-aoenet-c-v2-1-3d5d158410e9@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | null_blk: replace strncpy with strscpyJustin Stitt2023-10-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `strncpy` is deprecated for use on NUL-terminated destination strings [1]. We should favor a more robust and less ambiguous interface. We expect that both `nullb->disk_name` and `disk->disk_name` be NUL-terminated: | snprintf(nullb->disk_name, sizeof(nullb->disk_name), | "%s", config_item_name(&dev->group.cg_item)); ... | pr_info("disk %s created\n", nullb->disk_name); It seems like NUL-padding may be required due to __assign_disk_name() utilizing a memcpy as opposed to a `str*cpy` api. | static inline void __assign_disk_name(char *name, struct gendisk *disk) | { | if (disk) | memcpy(name, disk->disk_name, DISK_NAME_LEN); | else | memset(name, 0, DISK_NAME_LEN); | } Then we go and print it with `__print_disk_name` which wraps `nullb_trace_disk_name()`. | #define __print_disk_name(name) nullb_trace_disk_name(p, name) This function obviously expects a NUL-terminated string. | const char *nullb_trace_disk_name(struct trace_seq *p, char *name) | { | const char *ret = trace_seq_buffer_ptr(p); | | if (name && *name) | trace_seq_printf(p, "disk=%s, ", name); | trace_seq_putc(p, 0); | | return ret; | } >From the above, we need both 1) a NUL-terminated string and 2) a NUL-padded string. So, let's use strscpy_pad() as per Kees' suggestion from v1. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-null_blk-main-c-v3-1-10cf0a87a2c3@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block/null_blk: add queue_rqs() supportChengming Zhou2023-09-221-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add batched mq_ops.queue_rqs() support in null_blk for testing. The implementation is much easy since null_blk doesn't have commit_rqs(). We simply handle each request one by one, if errors are encountered, leave them in the passed in list and return back. There is about 3.6% improvement in IOPS of fio/t/io_uring on null_blk with hw_queue_depth=256 on my test VM, from 1.09M to 1.13M. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-6-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: update driver tags request table when start requestChengming Zhou2023-09-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we update driver tags request table in blk_mq_get_driver_tag(), so the driver that support queue_rqs() have to update that inflight table by itself. Move it to blk_mq_start_request(), which is a better place where we setup the deadline for request timeout check. And it's just where the request becomes inflight. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'hardening-v6.7-rc1' of ↵Linus Torvalds2023-10-301-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One of the more voluminous set of changes is for adding the new __counted_by annotation[1] to gain run-time bounds checking of dynamically sized arrays with UBSan. - Add LKDTM test for stuck CPUs (Mark Rutland) - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo) - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva) - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh) - Convert group_info.usage to refcount_t (Elena Reshetova) - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva) - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn) - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook) - Fix strtomem() compile-time check for small sources (Kees Cook)" * tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits) hwmon: (acpi_power_meter) replace open-coded kmemdup_nul reset: Annotate struct reset_control_array with __counted_by kexec: Annotate struct crash_mem with __counted_by virtio_console: Annotate struct port_buffer with __counted_by ima: Add __counted_by for struct modsig and use struct_size() MAINTAINERS: Include stackleak paths in hardening entry string: Adjust strtomem() logic to allow for smaller sources hardening: x86: drop reference to removed config AMD_IOMMU_V2 randstruct: Fix gcc-plugin performance mode to stay in group mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by irqchip/imx-intmux: Annotate struct intmux_data with __counted_by KVM: Annotate struct kvm_irq_routing_table with __counted_by virt: acrn: Annotate struct vm_memory_region_batch with __counted_by hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by sparc: Annotate struct cpuinfo_tree with __counted_by isdn: kcapi: replace deprecated strncpy with strscpy_pad isdn: replace deprecated strncpy with strscpy NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by ...
| * | | drbd: Annotate struct fifo_buffer with __counted_byKees Cook2023-10-021-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct fifo_buffer. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230915200316.never.707-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
* | | block: move bdev_mark_dead out of disk_check_media_changeChristoph Hellwig2023-10-282-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disk_check_media_change is mostly called from ->open where it makes little sense to mark the file system on the device as dead, as we are just opening it. So instead of calling bdev_mark_dead from disk_check_media_change move it into the few callers that are not in an open instance. This avoid calling into bdev_mark_dead and thus taking s_umount with open_mutex held. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | zram: Convert to use bdev_open_by_dev()Jan Kara2023-10-282-18/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert zram to use bdev_open_by_dev() and pass the handle around. CC: Minchan Kim <minchan@kernel.org> CC: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-8-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | xen/blkback: Convert to bdev_open_by_dev()Jan Kara2023-10-283-23/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert xen/blkback to use bdev_open_by_dev() and pass the handle around. CC: xen-devel@lists.xenproject.org Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-7-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | rnbd-srv: Convert to use bdev_open_by_path()Jan Kara2023-10-282-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert rnbd-srv to use bdev_open_by_path() and pass the handle around. CC: Jack Wang <jinpu.wang@ionos.com> CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com> Acked-by: "Md. Haris Iqbal" <haris.iqbal@ionos.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-6-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | pktcdvd: Convert to bdev_open_by_dev()Jan Kara2023-10-281-35/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert pktcdvd to use bdev_open_by_dev(). Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-5-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | drdb: Convert to use bdev_open_by_path()Jan Kara2023-10-282-33/+34
| |/ |/| | | | | | | | | | | | | | | | | | | Convert drdb to use bdev_open_by_path(). CC: drbd-dev@lists.linbit.com Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-4-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | Merge tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linuxLinus Torvalds2023-10-061-1/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Just two minor fixes, for nbd and md" * tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux: nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl md/raid5: release batch_last before waiting for another stripe_head
| * | nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctlChristoph Hellwig2023-10-031-1/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mark_disk_dead is the proper interface to shut down a block device, but it also makes the disk unusable forever. nbd_clear_sock_ioctl on the other hand wants to shut down the file system, but allow the block device to be used again when when connected to another socket. Switch nbd to use disk_force_media_change and nbd_bdev_reset to go back to a behavior of the old __invalidate_device call, with the added benefit of incrementing the device generation as there is no guarantee the old content comes back when the device is reconnected. Reported-by: Samuel Holland <samuel.holland@sifive.com> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 0c1c9a27ce90 ("nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20231003153106.1331363-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | rbd: take header_rwsem in rbd_dev_refresh() only when updatingIlya Dryomov2023-09-261-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rbd_dev_refresh() has been holding header_rwsem across header and parent info read-in unnecessarily for ages. With commit 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held"), the potential for deadlocks became much more real owning to a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores not allowing new readers after a writer is registered. For example, assuming that I/O request 1, I/O request 2 and header read-in request all target the same OSD: 1. I/O request 1 comes in and gets submitted 2. watch error occurs 3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and releases lock_rwsem 4. after reestablishing the watch, rbd_reregister_watch() calls rbd_dev_refresh() which takes header_rwsem for write and submits a header read-in request 5. I/O request 2 comes in: after taking lock_rwsem for read in __rbd_img_handle_request(), it blocks trying to take header_rwsem for read in rbd_img_object_requests() 6. another watch error occurs 7. rbd_watch_errcb() blocks trying to take lock_rwsem for write 8. I/O request 1 completion is received by the messenger but can't be processed because lock_rwsem won't be granted anymore 9. header read-in request completion can't be received, let alone processed, because the messenger is stranded Change rbd_dev_refresh() to take header_rwsem only for actually updating rbd_dev->header. Header and parent info read-in don't need any locking. Cc: stable@vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition Cc: stable@vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header Cc: stable@vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev Cc: stable@vger.kernel.org Fixes: 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
* | rbd: decouple parent info read-in from updating rbd_devIlya Dryomov2023-09-261-62/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike header read-in, parent info read-in is already decoupled in get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along with the processing logic. Separate the initial read-in and update read-in logic into rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and have rbd_dev_v2_parent_info() just populate struct parent_image_info (i.e. what get_parent_info() did). Some existing QoI issues, like flatten of a standalone clone being disregarded on refresh, remain. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
* | rbd: decouple header read-in from updating rbd_dev->headerIlya Dryomov2023-09-261-92/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Make rbd_dev_header_info() populate a passed struct rbd_image_header instead of rbd_dev->header and introduce rbd_dev_update_header() for updating mutable fields in rbd_dev->header upon refresh. The initial read-in of both mutable and immutable fields in rbd_dev_image_probe() passes in rbd_dev->header so no update step is required there. rbd_init_layout() is now called directly from rbd_dev_image_probe() instead of individually in format 1 and format 2 implementations. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
* | rbd: move rbd_dev_refresh() definitionIlya Dryomov2023-09-261-35/+33
|/ | | | | | | | | Move rbd_dev_refresh() definition further down to avoid having to move struct parent_image_info definition in the next commit. This spares some forward declarations too. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>