summaryrefslogtreecommitdiffstats
path: root/drivers/block/nbd.c
Commit message (Collapse)AuthorAgeFilesLines
* nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig2024-03-011-4/+11
| | | | | | | | | Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: freeze the queue for queue limits updatesChristoph Hellwig2024-03-011-1/+13
| | | | | | | | | | | nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig2024-03-011-1/+2
| | | | | | | | | | | | | nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: pass queue_limits to blk_mq_alloc_diskChristoph Hellwig2024-02-191-6/+7
| | | | | | | | | Pass the few limits nbd imposes directly to blk_mq_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240215070300.2200308-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: null check for nla_nest_startNavid Emamdoost2024-02-181-0/+6
| | | | | | | | | | | | nla_nest_start() may fail and return NULL. Insert a check and set errno based on other call sites within the same source code. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Fixes: 47d902b90a32 ("nbd: add a status netlink command") Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240218042534.it.206-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass a queue_limits argument to blk_mq_alloc_diskChristoph Hellwig2024-02-131-1/+1
| | | | | | | | | | | | | | | | | Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: always initialize struct msghdr completelyEric Dumazet2024-01-171-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot complains that msg->msg_get_inq value can be uninitialized [1] struct msghdr got many new fields recently, we should always make sure their values is zero by default. [1] BUG: KMSAN: uninit-value in tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571 tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571 inet_recvmsg+0x131/0x580 net/ipv4/af_inet.c:879 sock_recvmsg_nosec net/socket.c:1044 [inline] sock_recvmsg+0x12b/0x1e0 net/socket.c:1066 __sock_xmit+0x236/0x5c0 drivers/block/nbd.c:538 nbd_read_reply drivers/block/nbd.c:732 [inline] recv_work+0x262/0x3100 drivers/block/nbd.c:863 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700 worker_thread+0xf45/0x1490 kernel/workqueue.c:2781 kthread+0x3ed/0x540 kernel/kthread.c:388 ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 Local variable msg created at: __sock_xmit+0x4c/0x5c0 drivers/block/nbd.c:513 nbd_read_reply drivers/block/nbd.c:732 [inline] recv_work+0x262/0x3100 drivers/block/nbd.c:863 CPU: 1 PID: 7465 Comm: kworker/u5:1 Not tainted 6.7.0-rc7-syzkaller-00041-gf016f7547aee #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 Workqueue: nbd5-recv recv_work Fixes: f94fd25cb0aa ("tcp: pass back data left in socket after receive") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: stable@vger.kernel.org Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: nbd@other.debian.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240112132657.647112-1-edumazet@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use the default discard granularityChristoph Hellwig2023-12-291-5/+1
| | | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Also don't bother clearing it as a discard granularity without discard_sectors doesn't mean anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: pass nbd_sock to nbd_read_reply() instead of indexLi Nan2023-11-211-13/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be krealloc in nbd_add_socket(), and a garbage request is received now, a UAF may occurs. T1 nbd_ioctl __nbd_ioctl nbd_add_socket blk_mq_freeze_queue T2 recv_work nbd_read_reply sock_xmit krealloc config->socks def config->socks Pass nbd_sock to nbd_read_reply(). And introduce a new function sock_xmit_recv(), which differs from sock_xmit only in the way it get socket. ================================================================== BUG: KASAN: use-after-free in sock_xmit+0x525/0x550 Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779 Workqueue: knbd4-recv recv_work Call Trace: __dump_stack dump_stack+0xbe/0xfd print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 sock_xmit+0x525/0x550 nbd_read_reply+0xfe/0x2c0 recv_work+0x1c2/0x750 process_one_work+0x6b6/0xf10 worker_thread+0xdd/0xd80 kthread+0x30a/0x410 ret_from_fork+0x22/0x30 Allocated by task 18784: kasan_save_stack+0x1b/0x40 kasan_set_track set_alloc_info __kasan_kmalloc __kasan_kmalloc.constprop.0+0xf0/0x130 slab_post_alloc_hook slab_alloc_node slab_alloc __kmalloc_track_caller+0x157/0x550 __do_krealloc krealloc+0x37/0xb0 nbd_add_socket +0x2d3/0x880 __nbd_ioctl nbd_ioctl+0x584/0x8e0 __blkdev_driver_ioctl blkdev_ioctl+0x2a0/0x6e0 block_ioctl+0xee/0x130 vfs_ioctl __do_sys_ioctl __se_sys_ioctl+0x138/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Freed by task 18784: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x40 __kasan_slab_free.part.0+0x13f/0x1b0 slab_free_hook slab_free_freelist_hook slab_free kfree+0xcb/0x6c0 krealloc+0x56/0xb0 nbd_add_socket+0x2d3/0x880 __nbd_ioctl nbd_ioctl+0x584/0x8e0 __blkdev_driver_ioctl blkdev_ioctl+0x2a0/0x6e0 block_ioctl+0xee/0x130 vfs_ioctl __do_sys_ioctl __se_sys_ioctl+0x138/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix null-ptr-dereference while accessing 'nbd->config'Li Nan2023-11-201-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | Memory reordering may occur in nbd_genl_connect(), causing config_refs to be set to 1 while nbd->config is still empty. Opening nbd at this time will cause null-ptr-dereference. T1 T2 nbd_open nbd_get_config_unlocked nbd_genl_connect nbd_alloc_and_init_config //memory reordered refcount_set(&nbd->config_refs, 1) // 2 nbd->config ->null point nbd->config = config // 1 Fix it by adding smp barrier to guarantee the execution sequence. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: factor out a helper to get nbd_config without holding 'config_lock'Li Nan2023-11-201-8/+19
| | | | | | | | | | There are no functional changes, just to make code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fold nbd config initialization into nbd_alloc_config()Li Nan2023-11-201-22/+19
| | | | | | | | | | There are no functional changes, make the code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix uaf in nbd_openLi Lingfeng2023-11-071-2/+9
| | | | | | | | | | | | | | | | Commit 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") cleans up disk by blk_cleanup_disk() and it won't set disk->private_data as NULL as before. UAF may be triggered in nbd_open() if someone tries to open nbd device right after nbd_put() since nbd has been free in nbd_dev_remove(). Fix this by implementing ->free_disk and free private data in it. Fixes: 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231107103435.2074904-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctlChristoph Hellwig2023-10-031-1/+2
| | | | | | | | | | | | | | | | | | | | | blk_mark_disk_dead is the proper interface to shut down a block device, but it also makes the disk unusable forever. nbd_clear_sock_ioctl on the other hand wants to shut down the file system, but allow the block device to be used again when when connected to another socket. Switch nbd to use disk_force_media_change and nbd_bdev_reset to go back to a behavior of the old __invalidate_device call, with the added benefit of incrementing the device generation as there is no guarantee the old content comes back when the device is reconnected. Reported-by: Samuel Holland <samuel.holland@sifive.com> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 0c1c9a27ce90 ("nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20231003153106.1331363-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds2023-08-291-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
| * nbd: automatically load module on genl accessThomas Weißschuh2023-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a module alias to nbd.ko that allows the generic netlink core to automatically load the module when netlink messages for nbd are received. This frees the user from manually having to load the module before using nbd functionality via netlink. If the system policy allows it this can even be used to load the nbd module from containers which would otherwise not have access to the necessary module files to do a normal "modprobe nbd". For example this avoids the following error when using nbd-client: $ nbd-client localhost 10809 /dev/nbd0 ... Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Josef Bacik <josef@toxicpadna.com> Link: https://lore.kernel.org/r/20230713-b4-nbd-genl-v3-1-226cbddba04b@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctlChristoph Hellwig2023-08-211-5/+3
|/ | | | | | | | | | | | | nbd_clear_sock_ioctl kills the socket and with that the block device. Instead of just invalidating file system buffers, mark the device as dead, which will also invalidate the buffers as part of the proper shutdown sequence. This also includes invalidating partitions if there are any. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-8-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
* block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig2023-06-121-2/+2
| | | | | | | | | | | | | | The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the unused mode argument to ->releaseChristoph Hellwig2023-06-121-1/+1
| | | | | | | | | | | | The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass a gendisk to ->openChristoph Hellwig2023-06-121-4/+4
| | | | | | | | | | | | ->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: Add the maximum limit of allocated index in nbd_dev_addZhong Jinghua2023-06-071-1/+2
| | | | | | | | | | | | | If the index allocated by idr_alloc greater than MINORMASK >> part_shift, the device number will overflow, resulting in failure to create a block device. Fix it by imiting the size of the max allocation. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230605122159.2134384-1-zhongjinghua@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: Fix debugfs_create_dir error checkingIvan Orlov2023-05-121-2/+2
| | | | | | | | | | The debugfs_create_dir function returns ERR_PTR in case of error, and the only correct way to check if an error occurred is 'IS_ERR' inline function. This patch will replace the null-comparison with IS_ERR. Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Link: https://lore.kernel.org/r/20230512130533.98709-1-ivan.orlov0322@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block nbd: use req.cookie instead of req.handleEric Blake2023-04-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NBD spec was recently changed [1] to refer to the opaque client identifier as a 'cookie' rather than a 'handle', but has for a much longer time listed it as a 64-bit value, and declares that all values in the NBD protocol are sent in network byte order (big-endian). Because the value is opaque to the server, it doesn't usually matter what endianness we send as the client - as long as we are consistent that either we byte-swap on both write and read, or on neither, then we can match server replies back to our requests. That said, our internal use of the cookie is as a 64-bit number (well, as two 32-bit numbers concatenated together), rather than as 8 individual bytes; so prior to this commit, we ARE leaking the native endianness of our internals as a client out to the server. We don't know of any server that will actually inspect the opaque value and behave differently depending on whether a little-endian or big-endian client is sending requests, but since we DO log the cookie value, a wireshark capture of the network traffic is easier to correlate back to the kernel traffic of a big-endian host (where the u64 and char[8] representations are the same) than of a little-endian host (where if wireshark honors the NBD spec and displays a u64 in network byte order, it is byte-swapped from what the kernel logged). The fix in this patch is thus two-part: it now consistently uses network byte order for the opaque value (no difference to a big-endian machine, but an extra byteswap on a little-endian machine; probably in the noise compared to the overhead of network traffic in general), and now uses a 64-bit integer instead of char[8] as its preferred access to the opaque value (direct assignment instead of memcpy()). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230410180611.1051618-4-eblake@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/drivers: remove dead clear of random flagChaitanya Kulkarni2023-04-251-1/+0
| | | | | | | | | | | | | QUEUE_FLAG_ADD_RANDOM is not set before we clear it for "null_blk", "brd", "nbd", "zram", and "bcache" since by default we don't set "QUEUE_FLAG_ADD_RANDOM" to MQ ops. Remove dead clear of QUEUE_FLAG_ADD_RANDOM in above listed drivers. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram Link: https://lore.kernel.org/r/20230424234628.45544-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix incomplete validation of ioctl argZhong Jinghua2023-04-201-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We tested and found an alarm caused by nbd_ioctl arg without verification. The UBSAN warning calltrace like below: UBSAN: Undefined behaviour in fs/buffer.c:1709:35 signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long long int' CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x170/0x1dc lib/dump_stack.c:118 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161 handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206 __block_write_full_page+0x94c/0xa20 fs/buffer.c:1709 block_write_full_page+0x1f0/0x280 fs/buffer.c:2934 blkdev_writepage+0x34/0x40 fs/block_dev.c:607 __writepage+0x68/0xe8 mm/page-writeback.c:2305 write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240 generic_writepages+0xdc/0x148 mm/page-writeback.c:2329 blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114 do_writepages+0xd4/0x250 mm/page-writeback.c:2344 The reason for triggering this warning is __block_write_full_page() -> i_size_read(inode) - 1 overflow. inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize. We think it is necessary to limit the size of arg to prevent errors. Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int. Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine), it will become 1 after the coercion, which will return unexpected results. Fix it by adding checks to prevent passing in too large numbers. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use the structured req attr checkJakub Kicinski2023-03-141-4/+4
| | | | | | | | | | | | Use the macro for checking presence of required attributes. It has the advantage of reporting to the user which attr was missing in a machine-readable format (extack). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230224021301.1630703-2-kuba@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: allow genl access outside init_netJakub Kicinski2023-03-141-0/+1
| | | | | | | | | | | NBD doesn't have much to do with networking, allow users outside init_net to access the family. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230224021301.1630703-1-kuba@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Treewide: Stop corrupting socket's task_fragBenjamin Coddington2022-12-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim. The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate. Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect. CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* use less confusing names for iov_iter direction initializersAl Viro2022-11-251-5/+5
| | | | | | | | | | | | | READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge tag 'driver-core-6.1-rc1' of ↵Linus Torvalds2022-10-071-2/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core and debug printk changes for 6.1-rc1. Included in here is: - dynamic debug updates for the core and the drm subsystem. The drm changes have all been acked by the relevant maintainers - kernfs fixes for syzbot reported problems - kernfs refactors and updates for cgroup requirements - magic number cleanups and removals from the kernel tree (they were not being used and they really did not actually do anything) - other tiny cleanups All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits) docs: filesystems: sysfs: Make text and code for ->show() consistent Documentation: NBD_REQUEST_MAGIC isn't a magic number a.out: restore CMAGIC device property: Add const qualifier to device_get_match_data() parameter drm_print: add _ddebug descriptor to drm_*dbg prototypes drm_print: prefer bare printk KERN_DEBUG on generic fn drm_print: optimize drm_debug_enabled for jump-label drm-print: add drm_dbg_driver to improve namespace symmetry drm-print.h: include dyndbg header drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro drm_print: interpose drm_*dbg with forwarding macros drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers. drm_print: condense enum drm_debug_category debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs() Documentation: ENI155_MAGIC isn't a magic number Documentation: NBD_REPLY_MAGIC isn't a magic number nbd: remove define-only NBD_MAGIC, previously magic number Documentation: FW_HEADER_MAGIC isn't a magic number Documentation: EEPROM_MAGIC_VALUE isn't a magic number ...
| * nbd: remove define-only NBD_MAGIC, previously magic numberнаб2022-09-241-2/+0
| | | | | | | | | | | | | | | | | | | | commit f4507164e779 ("nbd: rename the nbd_device variable from lo to nbd") renamed LO_MAGIC to NBD_MAGIC; commit 5ea8d10802ec ("nbd: separate out the config information") removed the last users of that Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Link: https://lore.kernel.org/r/10a80681c5966fed1a1afc696e3db114f481514c.1663280877.git.nabijaczleweli@nabijaczleweli.xyz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds2022-10-071-2/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
| * | nbd: Fix hung when signal interrupts nbd_start_device_ioctl()Shigeru Yoshida2022-09-081-2/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported hung task [1]. The following program is a simplified version of the reproducer: int main(void) { int sv[2], fd; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) < 0) return 1; if ((fd = open("/dev/nbd0", 0)) < 0) return 1; if (ioctl(fd, NBD_SET_SIZE_BLOCKS, 0x81) < 0) return 1; if (ioctl(fd, NBD_SET_SOCK, sv[0]) < 0) return 1; if (ioctl(fd, NBD_DO_IT) < 0) return 1; return 0; } When signal interrupt nbd_start_device_ioctl() waiting the condition atomic_read(&config->recv_threads) == 0, the task can hung because it waits the completion of the inflight IOs. This patch fixes the issue by clearing queue, not just shutdown, when signal interrupt nbd_start_device_ioctl(). Link: https://syzkaller.appspot.com/bug?id=7d89a3ffacd2b83fdd39549bc4d8e0a89ef21239 [1] Reported-by: syzbot+38e6c55d4969a14c1534@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220907163502.577561-1-syoshida@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* / genetlink: start to validate reserved header bytesJakub Kicinski2022-08-291-0/+1
|/ | | | | | | | | | | | | | | | | We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net>
* nbd: add missing definition of pr_fmtYu Kuai2022-08-021-2/+4
| | | | | | | | | | | | | | commit 1243172d5894 ("nbd: use pr_err to output error message") tries to define pr_fmt and use short pr_err() to output error message, however, the definition is missed. This patch also remove existing "nbd:" inside pr_err(). Fixes: 1243172d5894 ("nbd: use pr_err to output error message") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220723082427.3890655-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Drop 'reserved' arg of busy_tag_iter_fnJohn Garry2022-07-061-1/+1
| | | | | | | | | | | | | | We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter function so it may be dropped. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Drop blk_mq_ops.timeout 'reserved' argJohn Garry2022-07-061-2/+1
| | | | | | | | | | | | | | | | | | | With new API blk_mq_is_reserved_rq() we can tell if a request is from the reserved pool, so stop passing 'reserved' arg. There is actually only a single user of that arg for all the callback implementations, which can use blk_mq_is_reserved_rq() instead. This will also allow us to stop passing the same 'reserved' around the blk-mq iter functions next. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove blk_cleanup_diskChristoph Hellwig2022-06-281-2/+2
| | | | | | | | | | blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use pr_err to output error messageYu Kuai2022-05-271-22/+18
| | | | | | | | | | | | | | | | | | Instead of using the long printk(KERN_ERR "nbd: ...") to output error message, defining pr_fmt and using the short pr_err("") to do that. The replacemen is done by using the following command: sed -i 's/printk(KERN_ERR "nbd: /pr_err("/g' \ drivers/block/nbd.c This patch also rewrap to 80 columns where possible. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220521073749.3146892-7-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix possible overflow on 'first_minor' in nbd_dev_add()Zhang Wensheng2022-05-271-11/+12
| | | | | | | | | | | | | | | | When 'index' is a big numbers, it may become negative which forced to 'int'. then 'index << part_shift' might overflow to a positive value that is not greater than '0xfffff', then sysfs might complains about duplicate creation. Because of this, move the 'index' judgment to the front will fix it and be better. Fixes: b0d9111a2d53 ("nbd: use an idr to keep track of nbd devices") Fixes: 940c264984fd ("nbd: fix possible overflow for 'first_minor' in nbd_dev_add()") Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220521073749.3146892-6-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix io hung while disconnecting deviceYu Kuai2022-05-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In our tests, "qemu-nbd" triggers a io hung: INFO: task qemu-nbd:11445 blocked for more than 368 seconds. Not tainted 5.18.0-rc3-next-20220422-00003-g2176915513ca #884 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:qemu-nbd state:D stack: 0 pid:11445 ppid: 1 flags:0x00000000 Call Trace: <TASK> __schedule+0x480/0x1050 ? _raw_spin_lock_irqsave+0x3e/0xb0 schedule+0x9c/0x1b0 blk_mq_freeze_queue_wait+0x9d/0xf0 ? ipi_rseq+0x70/0x70 blk_mq_freeze_queue+0x2b/0x40 nbd_add_socket+0x6b/0x270 [nbd] nbd_ioctl+0x383/0x510 [nbd] blkdev_ioctl+0x18e/0x3e0 __x64_sys_ioctl+0xac/0x120 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fd8ff706577 RSP: 002b:00007fd8fcdfebf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fd8ff706577 RDX: 000000000000000d RSI: 000000000000ab00 RDI: 000000000000000f RBP: 000000000000000f R08: 000000000000fbe8 R09: 000055fe497c62b0 R10: 00000002aff20000 R11: 0000000000000246 R12: 000000000000006d R13: 0000000000000000 R14: 00007ffe82dc5e70 R15: 00007fd8fcdff9c0 "qemu-ndb -d" will call ioctl 'NBD_DISCONNECT' first, however, following message was found: block nbd0: Send disconnect failed -32 Which indicate that something is wrong with the server. Then, "qemu-nbd -d" will call ioctl 'NBD_CLEAR_SOCK', however ioctl can't clear requests after commit 2516ab1543fd("nbd: only clear the queue on device teardown"). And in the meantime, request can't complete through timeout because nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', which means such request will never be completed in this situation. Now that the flag 'NBD_CMD_INFLIGHT' can make sure requests won't complete multiple times, switch back to call nbd_clear_sock() in nbd_clear_sock_ioctl(), so that inflight requests can be cleared. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220521073749.3146892-5-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completedYu Kuai2022-05-271-4/+14
| | | | | | | | | | Otherwise io will hung because request will only be completed if the cmd has the flag 'NBD_CMD_INFLIGHT'. Fixes: 07175cb1baf4 ("nbd: make sure request completion won't concurrent") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220521073749.3146892-4-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix race between nbd_alloc_config() and module removalYu Kuai2022-05-271-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nbd module is being removing, nbd_alloc_config() may be called concurrently by nbd_genl_connect(), although try_module_get() will return false, but nbd_alloc_config() doesn't handle it. The race may lead to the leak of nbd_config and its related resources (e.g, recv_workq) and oops in nbd_read_stat() due to the unload of nbd module as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000040 Oops: 0000 [#1] SMP PTI CPU: 5 PID: 13840 Comm: kworker/u17:33 Not tainted 5.14.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: knbd16-recv recv_work [nbd] RIP: 0010:nbd_read_stat.cold+0x130/0x1a4 [nbd] Call Trace: recv_work+0x3b/0xb0 [nbd] process_one_work+0x1ed/0x390 worker_thread+0x4a/0x3d0 kthread+0x12a/0x150 ret_from_fork+0x22/0x30 Fixing it by checking the return value of try_module_get() in nbd_alloc_config(). As nbd_alloc_config() may return ERR_PTR(-ENODEV), assign nbd->config only when nbd_alloc_config() succeeds to ensure the value of nbd->config is binary (valid or NULL). Also adding a debug message to check the reference counter of nbd_config during module removal. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220521073749.3146892-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: call genl_unregister_family() first in nbd_cleanup()Yu Kuai2022-05-271-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise there may be race between module removal and the handling of netlink command, which can lead to the oops as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000098 Oops: 0002 [#1] SMP PTI CPU: 1 PID: 31299 Comm: nbd-client Tainted: G E 5.14.0-rc4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:down_write+0x1a/0x50 Call Trace: start_creating+0x89/0x130 debugfs_create_dir+0x1b/0x130 nbd_start_device+0x13d/0x390 [nbd] nbd_genl_connect+0x42f/0x748 [nbd] genl_family_rcv_msg_doit.isra.0+0xec/0x150 genl_rcv_msg+0xe5/0x1e0 netlink_rcv_skb+0x55/0x100 genl_rcv+0x29/0x40 netlink_unicast+0x1a8/0x250 netlink_sendmsg+0x21b/0x430 ____sys_sendmsg+0x2a4/0x2d0 ___sys_sendmsg+0x81/0xc0 __sys_sendmsg+0x62/0xb0 __x64_sys_sendmsg+0x1f/0x30 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Modules linked in: nbd(E-) Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220521073749.3146892-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: Fix hung on disconnect request if socket is closed beforeXie Yongji2022-05-161-4/+9
| | | | | | | | | | | | | | | | | When userspace closes the socket before sending a disconnect request, the following I/O requests will be blocked in wait_for_reconnect() until dead timeout. This will cause the following disconnect request also hung on blk_mq_quiesce_queue(). That means we have no way to disconnect a nbd device if there are some I/O requests waiting for reconnecting until dead timeout. It's not expected. So let's wake up the thread waiting for reconnecting directly when a disconnect request is sent. Reported-by: Xu Jianhai <zero.xu@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220322080639.142-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: don't set the discard_alignment queue limitChristoph Hellwig2022-05-031-3/+0
| | | | | | | | | | | | The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by nbd is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add a disk_openers helperChristoph Hellwig2022-04-181-2/+2
| | | | | | | | | | | Add a helper that returns the openers for a given gendisk to avoid having drivers poke into disk->part0 to get at this information in a somewhat cumbersome way. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use the correct block_device in nbd_bdev_resetChristoph Hellwig2022-04-181-7/+7
| | | | | | | | | | | | The bdev parameter to ->ioctl contains the block device that the ioctl is called on, which can be the partition. But the openers check in nbd_bdev_reset really needs to check use the whole device, so switch to using that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove QUEUE_FLAG_DISCARDChristoph Hellwig2022-04-171-4/+1
| | | | | | | | | | | | | | | | | | | Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"Jens Axboe2022-04-021-12/+12
| | | | | | | | | | | | | | | | | This reverts commit 6d35d04a9e18990040e87d2bbf72689252669d54. Both Gabriel and Borislav report that this commit casues a regression with nbd: sysfs: cannot create duplicate filename '/dev/block/43:0' Revert it before 5.18-rc1 and we'll investigage this separately in due time. Link: https://lore.kernel.org/all/YkiJTnFOt9bTv6A2@zn.tnic/ Reported-by: Gabriel L. Somlo <somlo@cmu.edu> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>