summaryrefslogtreecommitdiffstats
path: root/drivers/nvme
Commit message (Collapse)AuthorAgeFilesLines
* nvme-tcp: fix link failure for TCP authArnd Bergmann2024-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2d5a333e09c388189238291577e443221baacba0 ] The nvme fabric driver calls the nvme_tls_key_lookup() function from nvmf_parse_key() when the keyring is enabled, but this is broken in a configuration with CONFIG_NVME_FABRICS=y and CONFIG_NVME_TCP=m because this leads to the function definition being in a loadable module: x86_64-linux-ld: vmlinux.o: in function `nvmf_parse_key': fabrics.c:(.text+0xb1bdec): undefined reference to `nvme_tls_key_lookup' Move the 'select' up to CONFIG_NVME_FABRICS itself to force this part to be built-in as well if needed. Fixes: 5bc46b49c828 ("nvme-tcp: check for invalidated or revoked key") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-tcp: check for invalidated or revoked keyHannes Reinecke2024-10-104-2/+25
| | | | | | | | | | | | | [ Upstream commit 5bc46b49c828a6dfaab80b71ecb63fe76a1096d2 ] key_lookup() will always return a key, even if that key is revoked or invalidated. So check for invalid keys before continuing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-tcp: sanitize TLS key handlingHannes Reinecke2024-10-104-17/+43
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 363895767fbfa05891b0b4d9e06ebde7a10c6a07 ] There is a difference between TLS configured (ie the user has provisioned/requested a key) and TLS enabled (ie the connection is encrypted with TLS). This becomes important for secure concatenation, where the initial authentication is run on an unencrypted connection (ie with TLS configured, but not enabled), and then the queue is reset to run over TLS (ie TLS configured _and_ enabled). So to differentiate between those two states store the generated key in opts->tls_key (as we're using the same TLS key for all queues), the key serial of the resulting TLS handshake in ctrl->tls_pskid (to signal that TLS on the admin queue is enabled), and a simple flag for the queues to indicated that TLS has been enabled. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-keyring: restrict match length for version '1' identifiersHannes Reinecke2024-10-101-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 79559c75332458985ab8a21f11b08bf7c9b833b0 ] TP8018 introduced a new TLS PSK identifier version (version 1), which appended a PSK hash value to the existing identifier (cf NVMe TCP specification v1.1, section 3.6.1.3 'TLS PSK and PSK Identity Derivation'). An original (version 0) identifier has the form: NVMe0<type><hmac> <hostnqn> <subsysnqn> and a version 1 identifier has the form: NVMe1<type><hmac> <hostnqn> <subsysnqn> <hash> This patch modifies the lookup algorthm to compare only the first part of the identifier (excluding the hash value) to handle both version 0 and version 1 identifiers. And the spec declares 'version 0' identifiers obsolete, so the lookup algorithm is modified to prever v1 identifiers. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-multipath: system fails to create generic nvme deviceHannes Reinecke2024-10-041-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 63bcf9014e95a7d279d10d8e2caa5d88db2b1855 ] NVME_NSHEAD_DISK_LIVE is a flag for struct nvme_ns_head, not nvme_ns. The current code has a typo causing NVME_NSHEAD_DISK_LIVE never to be cleared once device_add_disk_fails, causing the system never to create the 'generic' character device. Even several rescan attempts will change the situation and the system has to be rebooted to fix the issue. Fixes: 11384580e332 ("nvme-multipath: add error handling support for add_disk()") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-pci: qdepth 1 quirkKeith Busch2024-09-302-9/+14
| | | | | | | | | | | | | | | | | | | commit 83bdfcbdbe5d901c5fa432decf12e1725a840a56 upstream. Another device has been reported to be unreliable if we have more than one outstanding command. In this new case, data corruption may occur. Since we have two devices now needing this quirky behavior, make a generic quirk flag. The same Apple quirk is clearly not "temporary", so update the comment while moving it. Link: https://lore.kernel.org/linux-nvme/191d810a4e3.fcc6066c765804.973611676137075390@collabora.com/ Reported-by: Robert Beckett <bob.beckett@collabora.com> Reviewed-by: Christoph Hellwig hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: "Gagniuc, Alexandru" <alexandru.gagniuc@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nvmet: Identify-Active Namespace ID List command should reject invalid nsidMaurizio Lombardi2024-09-121-0/+10
| | | | | | | | | | | | | | [ Upstream commit 899d2e5a4e3d36689e8938e152f4b69a4bcc6b4d ] nsid values of 0xFFFFFFFE and 0XFFFFFFFF should be rejected with a status code of "Invalid Namespace or Format". See NVMe Base Specification, Active Namespace ID list (CNS 02h). Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme: rename CDR/MORE/DNR to NVME_STATUS_*Weiwen Hu2024-09-1215-122/+122
| | | | | | | | | | | | | | | [ Upstream commit dd0b0a4a2c5d7209457dc172997d1243ad269cfa ] CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename them to NVME_STATUS_* to avoid confusion. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 899d2e5a4e3d ("nvmet: Identify-Active Namespace ID List command should reject invalid nsid") Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme: fix status magic numbersWeiwen Hu2024-09-125-14/+14
| | | | | | | | | | | | | | [ Upstream commit d89a5c6705998ddc42b104f8eabd3c4b9e8fde08 ] Replaced some magic numbers about SC and SCT with enum and macro. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 899d2e5a4e3d ("nvmet: Identify-Active Namespace ID List command should reject invalid nsid") Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_errWeiwen Hu2024-09-121-5/+5
| | | | | | | | | | | | | | | | [ Upstream commit 22f19a584d7045e0509f103dbc5c0acfd6415163 ] This should better match its semantic. "sc" is used in the NVMe spec to specifically refer to the last 8 bits in the status field. We should not reuse "sc" here. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 899d2e5a4e3d ("nvmet: Identify-Active Namespace ID List command should reject invalid nsid") Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-pci: allocate tagset on reset if necessaryKeith Busch2024-09-121-0/+6
| | | | | | | | | | | | | | | [ Upstream commit 6f01bdbfef3b62955cf6503a8425d527b3a5cf94 ] If a drive is unable to create IO queues on the initial probe, a subsequent reset will need to allocate the tagset if IO queue creation is successful. Without this, blk_mq_update_nr_hw_queues will crash on a bad pointer due to the invalid tagset. Fixes: eac3ef262941f62 ("nvme-pci: split the initial probe from the rest path") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvmet-tcp: fix kernel crash if commands allocation failsMaurizio Lombardi2024-09-121-1/+3
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5572a55a6f830ee3f3a994b6b962a5c327d28cb3 ] If the commands allocation fails in nvmet_tcp_alloc_cmds() the kernel crashes in nvmet_tcp_release_queue_work() because of a NULL pointer dereference. nvmet: failed to install queue 0 cntlid 1 ret 6 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Fix the bug by setting queue->nr_cmds to zero in case nvmet_tcp_alloc_cmd() fails. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-pci: Add sleep quirk for Samsung 990 EvoGeorg Gottleuber2024-09-121-0/+11
| | | | | | | | | | | | | | | | commit 61aa894e7a2fda4ee026523b01d07e83ce2abb72 upstream. On some TUXEDO platforms, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (2-3 watts). This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 0.5 watts. Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Cc: <stable@vger.kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nvme: move stopping keep-alive into nvme_uninit_ctrl()Ming Lei2024-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a54a93d0e3599b05856971734e15418ac551a14c ] Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup") moves starting keep-alive from nvme_start_ctrl() into nvme_init_ctrl_finish(), but don't move stopping keep-alive into nvme_uninit_ctrl(), so keep-alive work can be started and keep pending after failing to start controller, finally use-after-free is triggered if nvme host driver is unloaded. This patch fixes kernel panic when running nvme/004 in case that connection failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl(). This way is reasonable because keep-alive is now started in nvme_init_ctrl_finish(). Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position") Cc: Hannes Reinecke <hare@suse.de> Cc: Mark O'Donovan <shiftee@posteo.net> Reported-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme/pci: Add APST quirk for Lenovo N60z laptopWangYuli2024-08-191-0/+7
| | | | | | | | | | | | | | | commit ab091ec536cb7b271983c0c063b17f62f3591583 upstream. There is a hardware power-saving problem with the Lenovo N60z board. When turn it on and leave it for 10 hours, there is a 20% chance that a nvme disk will not wake up until reboot. Link: https://lore.kernel.org/all/2B5581C46AC6E335+9c7a81f1-05fb-4fd0-9fbb-108757c21628@uniontech.com Signed-off-by: hmy <huanglin@uniontech.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* block: change rq_integrity_vec to respect the iteratorMikulas Patocka2024-08-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cf546dd289e0f6d2594c25e2fb4e19ee67c6d988 ] If we allocate a bio that is larger than NVMe maximum request size, attach integrity metadata to it and send it to the NVMe subsystem, the integrity metadata will be corrupted. Splitting the bio works correctly. The function bio_split will clone the bio, trim the iterator of the first bio and advance the iterator of the second bio. However, the function rq_integrity_vec has a bug - it returns the first vector of the bio's metadata and completely disregards the metadata iterator that was advanced when the bio was split. Thus, the second bio uses the same metadata as the first bio and this leads to metadata corruption. This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec instead of returning the first vector. mp_bvec_iter_bvec reads the iterator and uses it to build a bvec for the current position in the iterator. The "queue_max_integrity_segments(rq->q) > 1" check was removed, because the updated rq_integrity_vec function works correctly with multiple segments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme: apple: fix device reference countingKeith Busch2024-08-141-5/+22
| | | | | | | | | | | | | | [ Upstream commit b9ecbfa45516182cd062fecd286db7907ba84210 ] Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The apple driver had been doing this wrong, leaking the controller device memory on a tagset failure. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-pci: add missing condition check for existence of mapped dataLeon Romanovsky2024-08-031-1/+2
| | | | | | | | | | | | | | [ Upstream commit c31fad1470389666ac7169fe43aa65bf5b7e2cfd ] nvme_map_data() is called when request has physical segments, hence the nvme_unmap_data() should have same condition to avoid dereference. Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvme-pci: Fix the instructions for disabling power managementBart Van Assche2024-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 92fc2c469eb26060384e9b2cd4cb0cc228aba582 ] pcie_aspm=off tells the kernel not to modify the ASPM configuration. This setting does not guarantee that ASPM (Active State Power Management) is disabled. Hence add pcie_port_pm=off. This disables power management for all PCIe ports. This patch has been tested on a workstation with a Samsung SSD 970 EVO Plus NVMe SSD. Fixes: 4641a8e6e145 ("nvme-pci: add trouble shooting steps for timeouts") Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvmet-auth: fix nvmet_auth hash error handlingGaosheng Cui2024-08-031-6/+8
| | | | | | | | | | | | | | | [ Upstream commit 89f58f96d1e2357601c092d85b40a2109cf25ef3 ] If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc for shash, we should free the memory allocation for challenge, so add err path out_free_challenge to fix the memory leak. Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]Nathan Chancellor2024-06-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Work for __counted_by on generic pointers in structures (not just flexible array members) has started landing in Clang 19 (current tip of tree). During the development of this feature, a restriction was added to __counted_by to prevent the flexible array member's element type from including a flexible array member itself such as: struct foo { int count; char buf[]; }; struct bar { int count; struct foo data[] __counted_by(count); }; because the size of data cannot be calculated with the standard array size formula: sizeof(struct foo) * count This restriction was downgraded to a warning but due to CONFIG_WERROR, it can still break the build. The application of __counted_by on the fod member of 'struct nvmet_fc_tgt_queue' triggers this restriction, resulting in: drivers/nvme/target/fc.c:151:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct nvmet_fc_fcp_iod' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size] 151 | struct nvmet_fc_fcp_iod fod[] __counted_by(sqsize); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Remove this use of __counted_by to fix the warning/error. However, rather than remove it altogether, leave it commented, as it may be possible to support this in future compiler releases. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2027 Fixes: ccd3129aca28 ("nvmet-fc: Annotate struct nvmet_fc_tgt_queue with __counted_by") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvmet: make 'tsas' attribute idempotent for RDMAHannes Reinecke2024-06-211-9/+30
| | | | | | | | | | | | The RDMA transport defines values for TSAS, but it cannot be changed as we only support the 'connected' mode. So to avoid errors during reconfiguration we should allow to write the current value. Fixes: 3f123494db72 ("nvmet: make TCP sectype settable via configfs") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme-apple: add missing MODULE_DESCRIPTION()Jeff Johnson2024-06-201-0/+1
| | | | | | | | | | | | | make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-apple.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Reviewed-by: Eric Curtin <ecurtin@redhat.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvmet: do not return 'reserved' for empty TSAS valuesHannes Reinecke2024-06-171-1/+1
| | | | | | | | | | | | | The 'TSAS' value is only defined for TCP and RDMA, but returning 'reserved' for undefined values tricked nvmetcli to try to write 'reserved' when restoring from a config file. This caused an error and the configuration would not be applied. Fixes: 3f123494db72 ("nvmet: make TCP sectype settable via configfs") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.Boyang Yu2024-06-171-1/+1
| | | | | | | | | | | The value of NVME_NS_DEAC is 3, which means NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS. Provide a unique value for this feature flag. Fixes 1b96f862eccc ("nvme: implement the DEAC bit for the Write Zeroes command") Signed-off-by: Boyang Yu <yuboyang@dapustor.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* Merge tag 'nvme-6.10-2024-06-13' of git://git.infradead.org/nvme into block-6.10Jens Axboe2024-06-135-16/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe fixes from Keith: "nvme fixes for Linux 6.10 - Discard double free on error conditions (Chunguang) - Target Fixes (Daniel) - Namespace detachment regression fix (Keith)" * tag 'nvme-6.10-2024-06-13' of git://git.infradead.org/nvme: nvme: fix namespace removal list nvmet: always initialize cqe.result nvmet-passthru: propagate status from id override functions nvme: avoid double free special payload
| * nvme: fix namespace removal listKeith Busch2024-06-131-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | This function wants to move a subset of a list from one element to the tail into another list. It also needs to use the srcu synchronize instead of the regular rcu version. Do this one element at a time because that's the only to do it. Fixes: be647e2c76b27f4 ("nvme: use srcu for iterating namespace list") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: always initialize cqe.resultDaniel Wagner2024-06-123-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The spec doesn't mandate that the first two double words (aka results) for the command queue entry need to be set to 0 when they are not used (not specified). Though, the target implemention returns 0 for TCP and FC but not for RDMA. Let's make RDMA behave the same and thus explicitly initializing the result field. This prevents leaking any data from the stack. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet-passthru: propagate status from id override functionsDaniel Wagner2024-06-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | The id override functions return a status which is not propagated to the caller. Fixes: c1fef73f793b ("nvmet: add passthru code to process commands") Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: avoid double free special payloadChunguang Xu2024-06-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | If a discard request needs to be retried, and that retry may fail before a new special payload is added, a double free will result. Clear the RQF_SPECIAL_LOAD when the request is cleaned. Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* | block: unmap and free user mapped integrity via submitterAnuj Gupta2024-06-121-4/+11
|/ | | | | | | | | | | | | | | The user mapped intergity is copied back and unpinned by bio_integrity_free which is a low-level routine. Do it via the submitter rather than doing it in the low-level block layer code, to split the submitter side from the consumer side of the bio. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240610111144.14647-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: fix nvme_pr_* status code parsingWeiwen Hu2024-05-311-1/+1
| | | | | | | | Fix the parsing if extra status bits (e.g. MORE) is present. Fixes: 7fb42780d06c ("nvme: Convert NVMe errors to PR errors") Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme-fabrics: use reserved tag for reg read/write commandChunguang Xu2024-05-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | In some scenarios, if too many commands are issued by nvme command in the same time by user tasks, this may exhaust all tags of admin_q. If a reset (nvme reset or IO timeout) occurs before these commands finish, reconnect routine may fail to update nvme regs due to insufficient tags, which will cause kernel hang forever. In order to workaround this issue, maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved tags. This maybe safe for nvmf: 1. For the disable ctrl path, we will not issue connect command 2. For the enable ctrl / fw activate path, since connect and reg_xx() are called serially. So the reserved tags may still be enough while reg_xx() use reserved tags. Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvmet: fix a possible leak when destroy a ctrl during qp establishmentSagi Grimberg2024-05-281-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In nvmet_sq_destroy we capture sq->ctrl early and if it is non-NULL we know that a ctrl was allocated (in the admin connect request handler) and we need to release pending AERs, clear ctrl->sqs and sq->ctrl (for nvme-loop primarily), and drop the final reference on the ctrl. However, a small window is possible where nvmet_sq_destroy starts (as a result of the client giving up and disconnecting) concurrently with the nvme admin connect cmd (which may be in an early stage). But *before* kill_and_confirm of sq->ref (i.e. the admin connect managed to get an sq live reference). In this case, sq->ctrl was allocated however after it was captured in a local variable in nvmet_sq_destroy. This prevented the final reference drop on the ctrl. Solve this by re-capturing the sq->ctrl after all inflight request has completed, where for sure sq->ctrl reference is final, and move forward based on that. This issue was observed in an environment with many hosts connecting multiple ctrls simoutanuosly, creating a delay in allocating a ctrl leading up to this race window. Reported-by: Alex Turin <alex@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme: use srcu for iterating namespace listKeith Busch2024-05-284-56/+83
| | | | | | | | | | | | | | | | | | | | | | | | The nvme pci driver synchronizes with all the namespace queues during a reset to ensure that there's no pending timeout work. Meanwhile the timeout work potentially iterates those same namespaces to freeze their queues. Each of those namespace iterations use the same read lock. If a write lock should somehow get between the synchronize and freeze steps, then forward progress is deadlocked. We had been relying on the nvme controller state machine to ensure the reset work wouldn't conflict with timeout work. That guarantee may be a bit fragile to rely on, so iterate the namespace lists without taking potentially circular locks, as reported by lockdep. Link: https://lore.kernel.org/all/20220930001943.zdbvolc3gkekfmcv@shindev/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offsetKundan Kumar2024-05-241-1/+2
| | | | | | | | | | | | | | | | | bio_vec start offset may be relatively large particularly when large folio gets added to the bio. A bigger offset will result in avoiding the single-segment mapping optimization and end up using expensive mempool_alloc further. Rather than using absolute value, adjust bv_offset by NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two PRP entries. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme: remove sgs and swsKanchan Joshi2024-05-241-2/+0
| | | | | | | | | sgs/sws are unused, so remove these from nvme_ns_head structure. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvmet: fix ns enable/disable possible hangSagi Grimberg2024-05-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | When disabling an nvmet namespace, there is a period where the subsys->lock is released, as the ns disable waits for backend IO to complete, and the ns percpu ref to be properly killed. The original intent was to avoid taking the subsystem lock for a prolong period as other processes may need to acquire it (for example new incoming connections). However, it opens up a window where another process may come in and enable the ns, (re)intiailizing the ns percpu_ref, causing the disable sequence to hang. Solve this by taking the global nvmet_config_sem over the entire configfs enable/disable sequence. Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme-multipath: fix io accounting on failoverKeith Busch2024-05-233-2/+4
| | | | | | | | | | | There are io stats accounting that needs to be handled, so don't call blk_mq_end_request() directly. Use the existing nvme_end_req() helper that already handles everything. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme: fix multipath batched completion accountingKeith Busch2024-05-231-5/+10
| | | | | | | | | | | | | Batched completions were missing the io stats accounting and bio trace events. Move the common code to a helper and call it from the batched and non-batched functions. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* nvme-multipath: find NUMA path only for online numa-nodeNilay Shroff2024-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | In current native multipath design when a shared namespace is created, we loop through each possible numa-node, calculate the NUMA distance of that node from each nvme controller and then cache the optimal IO path for future reference while sending IO. The issue with this design is that we may refer to the NUMA distance table for an offline node which may not be populated at the time and so we may inadvertently end up finding and caching a non-optimal path for IO. Then latter when the corresponding numa-node becomes online and hence the NUMA distance table entry for that node is created, ideally we should re-calculate the multipath node distance for the newly added node however that doesn't happen unless we rescan/reset the controller. So essentially, we may keep using non-optimal IO path for a node which is made online after namespace is created. This patch helps fix this issue ensuring that when a shared namespace is created, we calculate the multipath node distance for each online numa-node instead of each possible numa-node. Then latter when a node becomes online and we receive any IO on that newly added node, we would calculate the multipath node distance for newly added node but this time NUMA distance table would have been already populated for newly added node. Hence we would be able to correctly calculate the multipath node distance and choose the optimal path for the IO. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* Merge tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme into block-6.10Jens Axboe2024-05-1414-111/+141
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe updates and fixes from Keith: "nvme updates for Linux 6.10 - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi)" * tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme: nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures nvme-fabrics: short-circuit reconnect retries nvme: return kernel error codes for admin queue connect nvmet: return DHCHAP status codes from nvmet_setup_auth() nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
| * nvme-rdma, nvme-tcp: include max reconnects for reconnect loggingTokunori Ikegami2024-05-072-6/+6
| | | | | | | | | | | | | | Makes clear max reconnects translated by ctrl loss tmo and reconnect delay. Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet-rdma: Avoid o(n^2) loop in delete_ctrlSagi Grimberg2024-05-071-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When deleting a nvmet-rdma ctrl, we essentially loop over all queues that belong to the controller and schedule a removal of each. Instead of restarting the loop every time a queue is found, do a simple safe list traversal. This addresses an unneeded time spent scheduling queue removal in cases there a lot of queues. Signed-off-by: Sagi Grimberg <sagi.grimberg@vastdata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: do not retry authentication failuresDaniel Wagner2024-05-012-3/+9
| | | | | | | | | | | | | | | | When the key is invalid there is no point in retrying. Because the auth code returns kernel error codes only, we can't test on the DNR bit. Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme-fabrics: short-circuit reconnect retriesHannes Reinecke2024-05-015-20/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | Returning a nvme status from nvme_tcp_setup_ctrl() indicates that the association was established and we have received a status from the controller; consequently we should honour the DNR bit. If not any future reconnect attempts will just return the same error, so we can short-circuit the reconnect attempts and fail the connection directly. Signed-off-by: Hannes Reinecke <hare@suse.de> [dwagner: - extended nvme_should_reconnect] Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: return kernel error codes for admin queue connectHannes Reinecke2024-05-013-22/+17
| | | | | | | | | | | | | | | | | | | | | | | | nvmf_connect_admin_queue returns NVMe error status codes and kernel error codes. This mixes the different domains which makes maintainability difficult. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: return DHCHAP status codes from nvmet_setup_auth()Hannes Reinecke2024-05-014-45/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A failure in nvmet_setup_auth() does not mean that the NVMe authentication command failed, so we should rather return a protocol error with a 'failure1' response than an NVMe status. Also update the type used for dhchap_step and dhchap_status to u8 to avoid confusions with nvme status. Furthermore, split dhchap_status and nvme status so we don't accidentally mix these return values. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hannes Reinecke <hare@suse.de> [dwagner: - use u8 as type for dhchap_{step|status} - separate nvme status from dhcap_status] Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: lock config semaphore when accessing DH-HMAC-CHAP keyHannes Reinecke2024-05-012-5/+19
| | | | | | | | | | | | | | | | | | | | | | When the DH-HMAC-CHAP key is accessed via configfs we need to take the config semaphore as a reconnect might be running at the same time. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* | Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds2024-05-132-8/+4
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...