summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-5.15/libata-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-307-214/+243
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull libata updates from Jens Axboe: "libata changes for the 5.15 release: - NCQ priority improvements (Damien, Niklas) - coccinelle warning fix (Jing) - dwc_460ex phy fix (Andy)" * tag 'for-5.15/libata-2021-08-30' of git://git.kernel.dk/linux-block: include:libata: fix boolreturn.cocci warnings docs: sysfs-block-device: document ncq_prio_supported docs: sysfs-block-device: improve ncq_prio_enable documentation libata: Introduce ncq_prio_supported sysfs sttribute libata: print feature list on device scan libata: fix ata_read_log_page() warning libata: cleanup NCQ priority handling libata: cleanup ata_dev_configure() libata: cleanup device sleep capability detection libata: simplify ata_scsi_rbuf_fill() libata: fix ata_host_start() ata: sata_dwc_460ex: No need to call phy_exit() befre phy_init()
| * include:libata: fix boolreturn.cocci warningsJing Yangyang2021-08-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ./include/linux/libata.h:1462:8-9:WARNING: return of 0/1 in function 'ata_is_host_link' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: scripts/coccinelle/misc/boolreturn.cocci Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jing Yangyang <jing.yangyang@zte.com.cn> Link: https://lore.kernel.org/r/20210824060702.59006-1-deng.changcheng@zte.com.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * docs: sysfs-block-device: document ncq_prio_supportedDamien Le Moal2021-08-181-1/+24
| | | | | | | | | | | | | | | | | | | | Add documentation for the new device attribute file ncq_prio_supported, and its SAS HBA equivalent sas_ncq_prio_supported. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-12-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * docs: sysfs-block-device: improve ncq_prio_enable documentationNiklas Cassel2021-08-181-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NCQ priority is an optional feature of the NCQ feature set and should not be confused with the NCQ feature set itself. Clarify the description of the ncq_prio_enable attribute to avoid this confusion. Also add the missing documentation for the equivalent sas_ncq_prio_enable attribute. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-11-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: Introduce ncq_prio_supported sysfs sttributeDamien Le Moal2021-08-183-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the only way a user can determine if a SATA device supports NCQ priority is to try to enable the use of this feature using the ncq_prio_enable sysfs device attribute. If enabling the feature fails, it is because the device does not support NCQ priority. Otherwise, the feature is enabled and success indicates that the device supports NCQ priority. Improve this odd interface by introducing the read-only ncq_prio_supported sysfs device attribute to indicate if a SATA device supports NCQ priority. The value of this attribute reflects the status of device flag ATA_DFLAG_NCQ_PRIO, which is set only for devices supporting NCQ priority. Add this new sysfs attribute to the device attributes group of libahci and libata-sata. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210816014456.2191776-10-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: print feature list on device scanDamien Le Moal2021-08-182-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | Print a list of features supported by a drive when it is configured in ata_dev_configure() using the new function ata_dev_print_features(). The features printed are not already advertized and are: trusted send-recev support, device attention support, device sleep support, NCQ send-recv support and NCQ priority support. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-9-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: fix ata_read_log_page() warningDamien Le Moal2021-08-181-34/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support for the READ LOG PAGE DMA EXT command is indicated by words 119 and 120 of a device identify data. This is tested in ata_read_log_page() with ata_id_has_read_log_dma_ext() and the READ LOG PAGE DMA command used if the device reports supports for it. However, some devices lie about this support and using the DMA version of the command fails, generating the warning message "READ LOG DMA EXT failed, trying PIO". Since READ LOG PAGE DMA EXT is an optional command, this warning is not at all important but may be scary for the user. Change ata_read_log_page() to suppres this warning and to print an error message if both DMA and PIO attempts failed. With this change, there is no need to print again an error message when ata_read_log_page() returns an error. So simplify the users of this function. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-8-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: cleanup NCQ priority handlingDamien Le Moal2021-08-182-43/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ata device flag ATA_DFLAG_NCQ_PRIO indicates if a device supports the NCQ Priority feature while the ATA_DFLAG_NCQ_PRIO_ENABLE device flag indicates if the feature is enabled. Enabling NCQ priority use is controlled by the user through the device sysfs attribute ncq_prio_enable. As a result, the ATA_DFLAG_NCQ_PRIO flag should not be cleared when ATA_DFLAG_NCQ_PRIO_ENABLE is not set as the device still supports the feature even after the user disables it. This leads to the following cleanups: - In ata_build_rw_tf(), set a command high priority bit based on the ATA_DFLAG_NCQ_PRIO_ENABLE flag, not on the ATA_DFLAG_NCQ flag. That is, set a command high priority only if the user enabled NCQ priority use. - In ata_dev_config_ncq_prio(), ATA_DFLAG_NCQ_PRIO should not be cleared if ATA_DFLAG_NCQ_PRIO_ENABLE is not set. If the device does not support NCQ priority, both ATA_DFLAG_NCQ_PRIO and ATA_DFLAG_NCQ_PRIO_ENABLE must be cleared. With the above ata_dev_config_ncq_prio() change, ATA_DFLAG_NCQ_PRIO flag is set on device scan and revalidation. There is no need to trigger a device revalidation in ata_ncq_prio_enable_store() when the user enables the use of NCQ priority. Remove the revalidation code from that funciton to simplify it. Also change the return value from -EIO to -EINVAL when a user tries to enable NCQ priority for a device that does not support this feature. While at it, also simplify ata_ncq_prio_enable_show(). Overall, there is no functional change introduced by this patch. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-7-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: cleanup ata_dev_configure()Damien Le Moal2021-08-181-56/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the helper functions ata_dev_config_lba() and ata_dev_config_chs() to configure the addressing capabilities of a device. To control message printing in these new helpers, as well as in ata_dev_configure() and in ata_hpa_resize(), add the helper function ata_dev_print_info() to avoid open coding for the eh context ATA_EHI_PRINTINFO flag in multiple functions. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-6-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: cleanup device sleep capability detectionDamien Le Moal2021-08-181-23/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | Move the code to retrieve the device sleep capability and timings out of ata_dev_configure() into the helper function ata_dev_config_devslp(). While at it, mark the device as supporting the device sleep capability only if the sata settings page was retrieved successfully to ensure that the timing information is correctly initialized. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-5-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: simplify ata_scsi_rbuf_fill()Damien Le Moal2021-08-181-51/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sparse complains about context imbalance in ata_scsi_rbuf_get() and ata_scsi_rbuf_put() due to these functions respectively only taking and releasing the ata_scsi_rbuf_lock spinlock. Since these functions are only called from ata_scsi_rbuf_fill() with ata_scsi_rbuf_get() being called with a copy_in argument always false, the code can be simplified and ata_scsi_rbuf_{get|put} removed. This change both simplifies the code and fixes the sparse warning. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-4-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * libata: fix ata_host_start()Damien Le Moal2021-08-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The loop on entry of ata_host_start() may not initialize host->ops to a non NULL value. The test on the host_stop field of host->ops must then be preceded by a check that host->ops is not NULL. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210816014456.2191776-3-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: sata_dwc_460ex: No need to call phy_exit() befre phy_init()Andy Shevchenko2021-07-271-8/+4
| | | | | | | | | | | | | | | | | | | | | | Last change to device managed APIs cleaned up error path to simple phy_exit() call, which in some cases has been executed with NULL parameter. This per se is not a problem, but rather logical misconception: no need to free resource when it's for sure has not been allocated yet. Fix the driver accordingly. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210727125130.19977-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-3052-13926/+462
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Sitting on top of the core block changes, here are the driver changes for the 5.15 merge window: - NVMe updates via Christoph: - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch) - remove lightnvm subsystem. It's served its purpose and ultimately led to zoned nvme support, we no longer need it (Christoph) - revert floppy O_NDELAY fix (Denis) - nbd fixes (Hou, Pavel, Baokun) - nbd locking fixes (Tetsuo) - nbd device removal fixes (Christoph) - raid10 rcu warning fix (Xiao) - raid1 write behind fix (Guoqing) - rnbd fixes (Gioh, Md Haris) - misc fixes (Colin)" * tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits) Revert "floppy: reintroduce O_NDELAY fix" raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard nbd: remove nbd->destroy_complete nbd: only return usable devices from nbd_find_unused nbd: set nbd->index before releasing nbd_index_mutex nbd: prevent IDR lookups from finding partially initialized devices nbd: reset NBD to NULL when restarting in nbd_genl_connect nbd: add missing locking to the nbd_dev_add error path nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers block: nbd: add sanity check for first_minor nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage ...
| * \ Merge tag 'floppy-for-5.15' of https://github.com/evdenis/linux-floppy into ↵Jens Axboe2021-08-291-15/+15
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for-5.15/drivers Pull floppy fix from Denis: "Bring back O_NDELAY for floppy Only one commit this time with revert of O_NDELAY removal for the floppy. Users reported that the commit breaks userspace utils and known floppy workflow patterns. We already reverted the same commit back in 2016 presumably for the same reason. Completely drop O_NDELAY for floppy seems excessive to solve problems it introduces. I started to write basic selftests for the floppy to prevent this kind of userspace breaks in the future. Signed-off-by: Denis Efremov <efremov@linux.com>" * tag 'floppy-for-5.15' of https://github.com/evdenis/linux-floppy: Revert "floppy: reintroduce O_NDELAY fix"
| | * | Revert "floppy: reintroduce O_NDELAY fix"Denis Efremov2021-08-281-15/+15
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch breaks userspace implementations (e.g. fdutils) and introduces regressions in behaviour. Previously, it was possible to O_NDELAY open a floppy device with no media inserted or with write protected media without an error. Some userspace tools use this particular behavior for probing. It's not the first time when we revert this patch. Previous revert is in commit f2791e7eadf4 (Revert "floppy: refactor open() flags handling"). This reverts commit 8a0c014cd20516ade9654fc13b51345ec58e7be8. Link: https://lore.kernel.org/linux-block/de10cb47-34d1-5a88-7751-225ca380f735@compro.net/ Reported-by: Mark Hounschell <markh@compro.net> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wim Osterholt <wim@djo.tudelft.nl> Cc: Kurt Garloff <kurt@garloff.de> Cc: <stable@vger.kernel.org> Signed-off-by: Denis Efremov <efremov@linux.com>
| * | Merge branch 'md-next' of ↵Jens Axboe2021-08-272-4/+29
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.15/drivers Pull MD changes from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
| | * | raid1: ensure write behind bio has less than BIO_MAX_VECS sectorsGuoqing Jiang2021-08-271-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't split write behind bio with more than BIO_MAX_VECS sectors, otherwise the below call trace was triggered because we could allocate oversized write behind bio later. [ 8.097936] bvec_alloc+0x90/0xc0 [ 8.098934] bio_alloc_bioset+0x1b3/0x260 [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1] [ 8.100988] ? __bio_clone_fast+0xa8/0xe0 [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod] [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod] [ 8.104084] submit_bio_noacct+0x139/0x530 [ 8.105127] submit_bio+0x78/0x1d0 [ 8.106163] ext4_io_submit+0x48/0x60 [ext4] [ 8.107242] ext4_writepages+0x652/0x1170 [ext4] [ 8.108300] ? do_writepages+0x41/0x100 [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4] [ 8.110406] do_writepages+0x41/0x100 [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100 [ 8.112513] file_write_and_wait_range+0x61/0xb0 [ 8.113564] ext4_sync_file+0x73/0x370 [ext4] [ 8.114607] __x64_sys_fsync+0x33/0x60 [ 8.115635] do_syscall_64+0x33/0x40 [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae Thanks for the comment from Christoph. [1]. https://bugs.archlinux.org/task/70992 Cc: stable@vger.kernel.org # v5.12+ Reported-by: Jens Stutte <jens@chianterastutte.eu> Tested-by: Jens Stutte <jens@chianterastutte.eu> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discardXiao Ni2021-08-261-4/+10
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are seeing the following warning in raid10_handle_discard. [ 695.110751] ============================= [ 695.131439] WARNING: suspicious RCU usage [ 695.151389] 4.18.0-319.el8.x86_64+debug #1 Not tainted [ 695.174413] ----------------------------- [ 695.192603] drivers/md/raid10.c:1776 suspicious rcu_dereference_check() usage! [ 695.225107] other info that might help us debug this: [ 695.260940] rcu_scheduler_active = 2, debug_locks = 1 [ 695.290157] no locks held by mkfs.xfs/10186. In the first loop of function raid10_handle_discard. It already determines which disk need to handle discard request and add the rdev reference count rdev->nr_pending. So the conf->mirrors will not change until all bios come back from underlayer disks. It doesn't need to use rcu_dereference to get rdev. Cc: stable@vger.kernel.org Fixes: d30588b2731f ('md/raid10: improve raid10 discard request') Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com>
| * | nbd: remove nbd->destroy_completeChristoph Hellwig2021-08-251-38/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nbd->destroy_complete pointer is not really needed. For creating a device without a specific index we now simplify skip devices marked NBD_DESTROY_ON_DISCONNECT as there is not much point to reuse them. For device creation with a specific index there is no real need to treat the case of a requested but not finished disconnect different than any other device that is being shutdown, i.e. we can just return an error, as a slightly different race window would anyway. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot+2c98885bcd769f56b6d6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-7-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: only return usable devices from nbd_find_unusedChristoph Hellwig2021-08-251-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device marked as NBD_DESTROY_ON_DISCONNECT can and should be skipped given that they won't survive the disconnect. So skip them and try to grab a reference directly and just continue if the the devices is being torn down or created and thus has a zero refcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-6-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: set nbd->index before releasing nbd_index_mutexTetsuo Handa2021-08-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set nbd->index before releasing nbd_index_mutex, as populate_nbd_status() might access nbd->index as soon as nbd_index_mutex is released. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-5-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: prevent IDR lookups from finding partially initialized devicesTetsuo Handa2021-08-251-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously nbd_index_mutex was held during whole add/remove/lookup operations in order to guarantee that partially initialized devices are not reachable via idr_find() or idr_for_each(). But now that partially initialized devices become reachable as soon as idr_alloc() succeeds, we need to skip partially initialized devices. Since it seems that all functions use refcount_inc_not_zero(&nbd->refs) in order to skip destroying devices, update nbd->refs from zero to non-zero as the last step of device initialization in order to also skip partially initialized devices. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch, added comments] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-4-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: reset NBD to NULL when restarting in nbd_genl_connectChristoph Hellwig2021-08-251-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nbd_genl_connect restarts to wait for a disconnecting device, nbd needs to be reset to NULL. Do that by facoring out a helper to find an unused device. Fixes: 6177b56c96ff ("nbd: refactor device search and allocation in nbd_genl_connect") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-3-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: add missing locking to the nbd_dev_add error pathTetsuo Handa2021-08-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | idr_remove needs external synchronization. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-2-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge tag 'nvme-5.15-2021-08-18' of git://git.infradead.org/nvme into ↵Jens Axboe2021-08-1817-123/+295
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for-5.15/drivers Pull NVMe updates from Christoph: "nvme updates for Linux 5.15. - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch)" * tag 'nvme-5.15-2021-08-18' of git://git.infradead.org/nvme: (21 commits) nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage nvme-pci: disable hmb on idle suspend nvmet: remove redundant assignments of variable status nvmet: add set feature tracing support nvme: add set feature tracing support nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options nvme-pci: cmb sysfs: one file, one value nvme-pci: use attribute group for cmb sysfs nvme: code command_id with a genctr for use-after-free validation nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data nvme-pci: limit maximum queue depth to 4095 ...
| | * | nvme: remove the unused NVME_NS_* enumChristoph Hellwig2021-08-171-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These values are unused now that the lightnvm support is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
| | * | nvme: remove nvm_ndev from nsKeith Busch2021-08-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the lightnvm driver is removed, we don't need a pointer to it's now non-existent struct. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: Have NVME_FABRICS select NVME_CORE instead of transport driversSagi Grimberg2021-08-162-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transport drivers need both core and fabrics modules, instead of selecting both, have the selection transitive such that NVME_FABRICS selects NVME_CORE and transport drivers select NVME_FABRICS. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvmet: check that host sqsize does not exceed ctrl MQESAmit Engel2021-08-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check that host sqsize is not greater-than Maximum Queue Entries Supported (MQES) value supported by the controller. Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvmet: avoid duplicate qid in connect cmdAmit Engel2021-08-162-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the NVMe specification, if the host sends a Connect command specifying a queue id which has already been created, a status value of NVME_SC_CMD_SEQ_ERROR is returned. Signed-off-by: Amit Engel <amit.engel@dell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvmet: pass back cntlid on successful completionAmit Engel2021-08-161-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the NVMe specification, the response dword 0 value of the Connect command is based on status code: return cntlid for successful compeltion return IPO and IATTR for connect invalid parameters. Fix a missing error information for a zero sized queue, and return the cntlid also for I/O queue Connect commands. Signed-off-by: Amit Engel <amit.engel@dell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-rdma: don't update queue count when failing to set io queuesRuozhu Li2021-08-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by: Ruozhu Li <liruozhu@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-tcp: don't update queue count when failing to set io queuesRuozhu Li2021-08-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by: Ruozhu Li <liruozhu@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-tcp: pair send_mutex init with destroyKeith Busch2021-08-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each mutex_init() should have a corresponding mutex_destroy(). Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: allow user toggling hmb usageKeith Busch2021-08-161-1/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NVMe host memory buffer may consume a non-negligable amount of memory. Controllers are required to function without the host memory buffer enabled, but with possibly degraded performance. Export a sysfs property to toggle this feature on a per-device granularity so users may choose to reclaim memory at the expense of storage performance. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-pci: disable hmb on idle suspendKeith Busch2021-08-161-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An idle suspend may or may not disable host memory access from devices placed in low power mode. Either way, it should always be safe to disable the host memory buffer prior to entering the low power mode, and this should also always be faster than a full device shutdown. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvmet: remove redundant assignments of variable statusColin Ian King2021-08-161-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two occurrances where variable status is being assigned a value that is never read and it is being re-assigned a new value almost immediately afterwards on an error exit path. The assignments are redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvmet: add set feature tracing supportHou Pu2021-08-161-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A nvme connect command produces following trace from the target side. Before: kworker/0:1H-56 [000] .... 9012.155139: nvmet_req_init: nvmet1: qid=0, cmdid=16, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/0:1H-56 [000] .... 9012.872272: nvmet_req_init: nvmet1: qid=0, cmdid=13, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) cmdline:/sys/kernel/debug/tracing# cat trace | grep feature kworker/0:1H-56 [000] .... 203.493914: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-56 [000] .... 204.197079: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: add set feature tracing supportHou Pu2021-08-161-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A nvme connect command produces following trace. Before: /sys/kernel/debug/tracing# cat trace | grep feature kworker/5:1H-98 [005] .... 3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/4:1H-124 [004] .... 3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) After: /sys/kernel/debug/tracing# cat trace | grep feature kworker/0:1H-253 [000] .... 196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-253 [000] .... 196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_optionsHou Pu2021-08-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Opts->host is NULL there. It is checked just before. So remove nvmf_host_put. It is introduced by commit 59a2f3f00fd7 ("nvme: fix potential memory leak in option parsing"). Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-pci: cmb sysfs: one file, one valueKeith Busch2021-08-161-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An attribute should only be exporting one value as recommended in Documentation/filesystems/sysfs.rst. Implement CMB attributes this way. The old attribute will remain for backward compatibility. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-pci: use attribute group for cmb sysfsKeith Busch2021-08-161-26/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Appending sysfs files to the controller kobject is a bit clunky and becomes a maintenance problem as more attributes are added. The attribute group infrastructure handles this better, so use that. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: code command_id with a genctr for use-after-free validationSagi Grimberg2021-08-166-20/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot detect a (perhaps buggy) controller that is sending us a completion for a request that was already completed (for example sending a completion twice), this phenomenon was seen in the wild a few times. So to protect against this, we use the upper 4 msbits of the nvme sqe command_id to use as a 4-bit generation counter and verify it matches the existing request generation that is incrementing on every execution. The 16-bit command_id structure now is constructed by: | xxxx | xxxxxxxxxxxx | gen request tag This means that we are giving up some possible queue depth as 12 bits allow for a maximum queue depth of 4095 instead of 65536, however we never create such long queues anyways so no real harm done. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu dataSagi Grimberg2021-08-161-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already validate it when receiving the c2hdata pdu header and this is not changing so this is a redundant check. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-pci: limit maximum queue depth to 4095Sagi Grimberg2021-08-161-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are going to use the upper 4-bits of the command_id for a generation counter, so enforce the new queue depth upper limit. As we enforce both min and max queue depth, use param_set_uint_minmax istead of open coding it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | params: lift param_set_uint_minmax to common codeSagi Grimberg2021-08-163-18/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is a useful helper hence move it to common code so others can enjoy it. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | block: nbd: add sanity check for first_minorPavel Skripkin2021-08-161-0/+10
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot hit WARNING in internal_create_group(). The problem was in too big disk->first_minor. disk->first_minor is initialized by value, which comes from userspace and there wasn't any sanity checks about value correctness. It can cause duplicate creation of sysfs files/links, because disk->first_minor will be passed to MKDEV() which causes truncation to byte. Since maximum minor value is 0xff, let's check if first_minor is correct minor number. NOTE: the root case of the reported warning was in wrong error handling in register_disk(), but we can avoid passing knowingly wrong values to sysfs API, because sysfs error messages can confuse users. For example: user passed 1048576 as index, but sysfs complains about duplicate creation of /dev/block/43:0. It's not obvious how 1048576 becomes 0. Log and reproducer for above example can be found on syzkaller bug report page. Link: https://syzkaller.appspot.com/bug?id=03c2ae9146416edf811958d5fd7acfab75b143d1 Fixes: b0d9111a2d53 ("nbd: use an idr to keep track of nbd devices") Reported-by: syzbot+9937dc42271cd87d4b98@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | remove the lightnvm subsystemChristoph Hellwig2021-08-1430-13678/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts to produce Open Channel SSDs and never made it into the NVMe spec proper. They have since been superceeded by NVMe enhancements such as ZNS support. Remove the support per the deprecation schedule. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.de Reviewed-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nbd: reduce the nbd_index_mutex scopeChristoph Hellwig2021-08-131-27/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nbd_index_mutex is currently held over add_disk and inside ->open, which leads to lock order reversals. Refactor the device creation code path so that nbd_dev_add is called without nbd_index_mutex lock held and only takes it for the IDR insertation. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-7-hch@lst.de [axboe: fix whitespace] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>