summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| * RDMA/mlx5: Set correct kernel-doc identifierLeon Romanovsky2021-03-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The W=1 allmodconfig build produces the following warning: drivers/infiniband/hw/mlx5/odp.c:1086: warning: wrong kernel-doc identifier on line: * Parse a series of data segments for page fault handling. Fix it by changing /** to be /* as it is written in kernel-doc documentation. Fixes: 5e769e444d26 ("RDMA/hw/mlx5/odp: Fix formatting and add missing descriptions in 'pagefault_data_segments()'") Link: https://lore.kernel.org/r/20210302074214.1054299-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * IB/mlx5: Add missing error codeYueHaibing2021-03-011-1/+3
| | | | | | | | | | | | | | | | | | | | Set err to -ENOMEM if kzalloc fails instead of 0. Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://lore.kernel.org/r/20210222122343.19720-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * RDMA/rxe: Fix missing kconfig dependency on CRYPTOJulian Braha2021-03-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When RDMA_RXE is enabled and CRYPTO is disabled, Kbuild gives the following warning: WARNING: unmet direct dependencies detected for CRYPTO_CRC32 Depends on [n]: CRYPTO [=n] Selected by [y]: - RDMA_RXE [=y] && (INFINIBAND_USER_ACCESS [=y] || !INFINIBAND_USER_ACCESS [=y]) && INET [=y] && PCI [=y] && INFINIBAND [=y] && INFINIBAND_VIRT_DMA [=y] This is because RDMA_RXE selects CRYPTO_CRC32, without depending on or selecting CRYPTO, despite that config option being subordinate to CRYPTO. Fixes: cee2688e3cd6 ("IB/rxe: Offload CRC calculation when possible") Signed-off-by: Julian Braha <julianbraha@gmail.com> Link: https://lore.kernel.org/r/21525878.NYvzQUHefP@ubuntu-mate-laptop Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * RDMA/cm: Fix IRQ restore in ib_send_cm_sidr_repSaeed Mahameed2021-03-011-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ib_send_cm_sidr_rep() { spin_lock_irqsave() cm_send_sidr_rep_locked() { ... spin_lock_irq() .... spin_unlock_irq() <--- this will enable interrupts } spin_unlock_irqrestore() } spin_unlock_irqrestore() expects interrupts to be disabled but the internal spin_unlock_irq() will always enable hard interrupts. Fix this by replacing the internal spin_{lock,unlock}_irq() with irqsave/restore variants. It fixes the following kernel trace: raw_local_irq_restore() called with IRQs enabled WARNING: CPU: 2 PID: 20001 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x1d/0x20 Call Trace: _raw_spin_unlock_irqrestore+0x4e/0x50 ib_send_cm_sidr_rep+0x3a/0x50 [ib_cm] cma_send_sidr_rep+0xa1/0x160 [rdma_cm] rdma_accept+0x25e/0x350 [rdma_cm] ucma_accept+0x132/0x1cc [rdma_ucm] ucma_write+0xbf/0x140 [rdma_ucm] vfs_write+0xc1/0x340 ksys_write+0xb3/0xe0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 87c4c774cbef ("RDMA/cm: Protect access to remote_sidr_table") Link: https://lore.kernel.org/r/20210301081844.445823-1-leon@kernel.org Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* | Merge tag 'gcc-plugins-v5.12-rc2' of ↵Linus Torvalds2021-03-052-3/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc-plugins fixes from Kees Cook: "Tiny gcc-plugin fixes for v5.12-rc2. These issues are small but have been reported a couple times now by static analyzers, so best to get them fixed to reduce the noise. :) - Fix coding style issues (Jason Yan)" * tag 'gcc-plugins-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: latent_entropy: remove unneeded semicolon gcc-plugins: structleak: remove unneeded variable 'ret'
| * | gcc-plugins: latent_entropy: remove unneeded semicolonJason Yan2021-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following coccicheck warning: scripts/gcc-plugins/latent_entropy_plugin.c:539:2-3: Unneeded semicolon Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200418070521.10931-1-yanaijie@huawei.com
| * | gcc-plugins: structleak: remove unneeded variable 'ret'Jason Yan2021-03-011-2/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | Fix the following coccicheck warning: scripts/gcc-plugins/structleak_plugin.c:177:14-17: Unneeded variable: "ret". Return "0" on line 207 Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200418070505.10715-1-yanaijie@huawei.com
* | Merge tag 'pstore-v5.12-rc2' of ↵Linus Torvalds2021-03-052-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fixes from Kees Cook: - Rate-limit ECC warnings (Dmitry Osipenko) - Fix error path check for NULL (Tetsuo Handa) * tag 'pstore-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Rate-limit "uncorrectable error in header" message pstore: Fix warning in pstore_kill_sb()
| * | pstore/ram: Rate-limit "uncorrectable error in header" messageDmitry Osipenko2021-03-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a quite huge "uncorrectable error in header" flood in KMSG on a clean system boot since there is no pstore buffer saved in RAM. Let's silence the redundant noisy messages by rate-limiting the printk message. Now there are maximum 10 messages printed repeatedly instead of 35+. Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
| * | pstore: Fix warning in pstore_kill_sb()Tetsuo Handa2021-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot is hitting WARN_ON(pstore_sb != sb) at pstore_kill_sb() [1], for the assumption that pstore_sb != NULL is wrong because pstore_fill_super() will not assign pstore_sb = sb when new_inode() for d_make_root() returned NULL (due to memory allocation fault injection). Since mount_single() calls pstore_kill_sb() when pstore_fill_super() failed, pstore_kill_sb() needs to be aware of such failure path. [1] https://syzkaller.appspot.com/bug?id=6abacb8da5137cb47a416f2bef95719ed60508a0 Reported-by: syzbot <syzbot+d0cf0ad6513e9a1da5df@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210214031307.57903-1-penguin-kernel@I-love.SAKURA.ne.jp
* | | Merge tag 'for-5.12/dm-fixes' of ↵Linus Torvalds2021-03-052-11/+16
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Fix DM verity target's optional Forward Error Correction (FEC) for Reed-Solomon roots that are unaligned to block size" * tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: fix FEC for RS roots unaligned to block size dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size
| * | | dm verity: fix FEC for RS roots unaligned to block sizeMilan Broz2021-03-041-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optional Forward Error Correction (FEC) code in dm-verity uses Reed-Solomon code and should support roots from 2 to 24. The error correction parity bytes (of roots lengths per RS block) are stored on a separate device in sequence without any padding. Currently, to access FEC device, the dm-verity-fec code uses dm-bufio client with block size set to verity data block (usually 4096 or 512 bytes). Because this block size is not divisible by some (most!) of the roots supported lengths, data repair cannot work for partially stored parity bytes. This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT" where we can be sure that the full parity data is always available. (There cannot be partial FEC blocks because parity must cover whole sectors.) Because the optional FEC starting offset could be unaligned to this new block size, we have to use dm_bufio_set_sector_offset() to configure it. The problem is easily reproduced using veritysetup, e.g. for roots=13: # create verity device with RS FEC dd if=/dev/urandom of=data.img bs=4096 count=8 status=none veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash # create an erasure that should be always repairable with this roots setting dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none # try to read it through dm-verity veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash) dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer # wait for possible recursive recovery in kernel udevadm settle veritysetup close test With this fix, errors are properly repaired. device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors ... Without it, FEC code usually ends on unrecoverable failure in RS decoder: device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74 ... This problem is present in all kernels since the FEC code's introduction (kernel 4.5). It is thought that this problem is not visible in Android ecosystem because it always uses a default RS roots=2. Depends-on: a14e5ec66a7a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size") Signed-off-by: Milan Broz <gmazyland@gmail.com> Tested-by: Jérôme Carretero <cJ-ko@zougloub.eu> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: stable@vger.kernel.org # 4.5+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | dm bufio: subtract the number of initial sectors in dm_bufio_get_device_sizeMikulas Patocka2021-03-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm_bufio_get_device_size returns the device size in blocks. Before returning the value, we must subtract the nubmer of starting sectors. The number of starting sectors may not be divisible by block size. Note that currently, no target is using dm_bufio_set_sector_offset and dm_bufio_get_device_size simultaneously, so this change has no effect. However, an upcoming dm-verity-fec fix needs this change. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | | | Merge tag 'block-5.12-2021-03-05' of git://git.kernel.dk/linux-blockLinus Torvalds2021-03-0516-69/+75
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - NVMe fixes: - more device quirks (Julian Einwag, Zoltán Böszörményi, Pascal Terjan) - fix a hwmon error return (Daniel Wagner) - fix the keep alive timeout initialization (Martin George) - ensure the model_number can't be changed on a used subsystem (Max Gurtovoy) - rsxx missing -EFAULT on copy_to_user() failure (Dan) - rsxx remove unused linux.h include (Tian) - kill unused RQF_SORTED (Jean) - updated outdated BFQ comments (Joseph) - revert work-around commit for bd_size_lock, since we removed the offending user in this merge window (Damien) * tag 'block-5.12-2021-03-05' of git://git.kernel.dk/linux-block: nvmet: model_number must be immutable once set nvme-fabrics: fix kato initialization nvme-hwmon: Return error code when registration fails nvme-pci: add quirks for Lexar 256GB SSD nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST. rsxx: Return -EFAULT if copy_to_user() fails block/bfq: update comments and default value in docs for fifo_expire rsxx: remove unused including <linux/version.h> block: Drop leftover references to RQF_SORTED block: revert "block: fix bd_size_lock use"
| * \ \ \ Merge tag 'nvme-5.12-2021-03-05' of git://git.infradead.org/nvme into block-5.12Jens Axboe2021-03-057-47/+62
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe fixes from Christoph: "nvme fixes for 5.12: - more device quirks (Julian Einwag, Zoltán Böszörményi, Pascal Terjan) - fix a hwmon error return (Daniel Wagner) - fix the keep alive timeout initialization (Martin George) - ensure the model_number can't be changed on a used subsystem (Max Gurtovoy)" * tag 'nvme-5.12-2021-03-05' of git://git.infradead.org/nvme: nvmet: model_number must be immutable once set nvme-fabrics: fix kato initialization nvme-hwmon: Return error code when registration fails nvme-pci: add quirks for Lexar 256GB SSD nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.
| | * | | | nvmet: model_number must be immutable once setMax Gurtovoy2021-03-054-45/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case we have already established connection to nvmf target, it shouldn't be allowed to change the model_number. E.g. if someone will identify ctrl and get model_number of "my_model" later on will change the model_numbel via configfs to "my_new_model" this will break the NVMe specification for "Get Log Page – Persistent Event Log" that refers to Model Number as: "This field contains the same value as reported in the Model Number field of the Identify Controller data structure, bytes 63:24." Although it doesn't mentioned explicitly that this field can't be changed, we can assume it. So allow setting this field only once: using configfs or in the first identify ctrl operation. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | nvme-fabrics: fix kato initializationMartin George2021-03-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently kato is initialized to NVME_DEFAULT_KATO for both discovery & i/o controllers. This is a problem specifically for non-persistent discovery controllers since it always ends up with a non-zero kato value. Fix this by initializing kato to zero instead, and ensuring various controllers are assigned appropriate kato values as follows: non-persistent controllers - kato set to zero persistent controllers - kato set to NVMF_DEV_DISC_TMO (or any positive int via nvme-cli) i/o controllers - kato set to NVME_DEFAULT_KATO (or any positive int via nvme-cli) Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | nvme-hwmon: Return error code when registration failsDaniel Wagner2021-03-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hwmon pointer wont be NULL if the registration fails. Though the exit code path will assign it to ctrl->hwmon_device. Later nvme_hwmon_exit() will try to free the invalid pointer. Avoid this by returning the error code from hwmon_device_register_with_info(). Fixes: ed7770f66286 ("nvme/hwmon: rework to avoid devm allocation") Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | nvme-pci: add quirks for Lexar 256GB SSDPascal Terjan2021-03-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the NVME_QUIRK_NO_NS_DESC_LIST and NVME_QUIRK_IGNORE_DEV_SUBNQN quirks for this buggy device. Reported and tested in https://bugs.mageia.org/show_bug.cgi?id=28417 Signed-off-by: Pascal Terjan <pterjan@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | nvme-pci: mark Kingston SKC2000 as not supporting the deepest power stateZoltán Böszörményi2021-03-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My 2TB SKC2000 showed the exact same symptoms that were provided in 538e4a8c57 ("nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs"), i.e. a complete NVME lockup that needed cold boot to get it back. According to some sources, the A2000 is simply a rebadged SKC2000 with a slightly optimized firmware. Adding the SKC2000 PCI ID to the quirk list with the same workaround as the A2000 made my laptop survive a 5 hours long Yocto bootstrap buildfest which reliably triggered the SSD lockup previously. Signed-off-by: Zoltán Böszörményi <zboszor@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.Julian Einwag2021-03-051-1/+2
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel fails to fully detect these SSDs, only the character devices are present: [ 10.785605] nvme nvme0: pci function 0000:04:00.0 [ 10.876787] nvme nvme1: pci function 0000:81:00.0 [ 13.198614] nvme nvme0: missing or invalid SUBNQN field. [ 13.198658] nvme nvme1: missing or invalid SUBNQN field. [ 13.206896] nvme nvme0: Shutdown timeout set to 20 seconds [ 13.215035] nvme nvme1: Shutdown timeout set to 20 seconds [ 13.225407] nvme nvme0: 16/0/0 default/read/poll queues [ 13.233602] nvme nvme1: 16/0/0 default/read/poll queues [ 13.239627] nvme nvme0: Identify Descriptors failed (8194) [ 13.246315] nvme nvme1: Identify Descriptors failed (8194) Adding the NVME_QUIRK_NO_NS_DESC_LIST fixes this problem. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=205679 Signed-off-by: Julian Einwag <jeinwag-nvme@marcapo.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
| * | | | rsxx: Return -EFAULT if copy_to_user() failsDan Carpenter2021-03-031-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The copy_to_user() function returns the number of bytes remaining but we want to return -EFAULT to the user if it can't complete the copy. The "st" variable only holds zero on success or negative error codes on failure so the type should be int. Fixes: 36f988e978f8 ("rsxx: Adding in debugfs entries.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block/bfq: update comments and default value in docs for fifo_expireJoseph Qi2021-03-022-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Correct the comments since bfq_fifo_expire[0] is for async request, while bfq_fifo_expire[1] is for sync request. Also update docs, according the source code, the default fifo_expire_async is 250ms, and fifo_expire_sync is 125ms. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | rsxx: remove unused including <linux/version.h>Tian Tao2021-03-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove including <linux/version.h> that don't need it. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: Drop leftover references to RQF_SORTEDJean Delvare2021-03-013-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead elevator code") removed all users of RQF_SORTED. However it is still defined, and there is one reference left to it (which in effect is dead code). Clear it all up. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: revert "block: fix bd_size_lock use"Damien Le Moal2021-02-282-7/+4
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the removal of the skd driver, using IRQ safe locking of a bdev bd_size_lock spinlock to protect the bdev inode size is not necessary anymore as there is no other known driver using this lock under an IRQ disabled context (e.g. calling set_capacity() with IRQ disabled). Revert commit 0fe37724f8e7 ("block: fix bd_size_lock use") which introduced the IRQ safe change. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'io_uring-5.12-2021-03-05' of git://git.kernel.dk/linux-blockLinus Torvalds2021-03-056-439/+361
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "A bit of a mix between fallout from the worker change, cleanups and reductions now possible from that change, and fixes in general. In detail: - Fully serialize manager and worker creation, fixing races due to that. - Clean up some naming that had gone stale. - SQPOLL fixes. - Fix race condition around task_work rework that went into this merge window. - Implement unshare. Used for when the original task does unshare(2) or setuid/seteuid and friends, drops the original workers and forks new ones. - Drop the only remaining piece of state shuffling we had left, which was cred. Move it into issue instead, and we can drop all of that code too. - Kill f_op->flush() usage. That was such a nasty hack that we had out of necessity, we no longer need it. - Following from ->flush() removal, we can also drop various bits of ctx state related to SQPOLL and cancelations. - Fix an issue with IOPOLL retry, which originally was fallout from a filemap change (removing iov_iter_revert()), but uncovered an issue with iovec re-import too late. - Fix an issue with system suspend. - Use xchg() for fallback work, instead of cmpxchg(). - Properly destroy io-wq on exec. - Add create_io_thread() core helper, and use that in io-wq and io_uring. This allows us to remove various silly completion events related to thread setup. - A few error handling fixes. This should be the grunt of fixes necessary for the new workers, next week should be quieter. We've got a pending series from Pavel on cancelations, and how tasks and rings are indexed. Outside of that, should just be minor fixes. Even with these fixes, we're still killing a net ~80 lines" * tag 'io_uring-5.12-2021-03-05' of git://git.kernel.dk/linux-block: (41 commits) io_uring: don't restrict issue_flags for io_openat io_uring: make SQPOLL thread parking saner io-wq: kill hashed waitqueue before manager exits io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return io_uring: don't keep looping for more events if we can't flush overflow io_uring: move to using create_io_thread() kernel: provide create_io_thread() helper io_uring: reliably cancel linked timeouts io_uring: cancel-match based on flags io-wq: ensure all pending work is canceled on exit io_uring: ensure that threads freeze on suspend io_uring: remove extra in_idle wake up io_uring: inline __io_queue_async_work() io_uring: inline io_req_clean_work() io_uring: choose right tctx->io_wq for try cancel io_uring: fix -EAGAIN retry with IOPOLL io-wq: fix error path leak of buffered write hash map io_uring: remove sqo_task io_uring: kill sqo_dead and sqo submission halting io_uring: ignore double poll add on the same waitqueue head ...
| * | | | io_uring: don't restrict issue_flags for io_openatPavel Begunkov2021-03-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 45d189c606292 ("io_uring: replace force_nonblock with flags") did something strange for io_openat() slicing all issue_flags but IO_URING_F_NONBLOCK. Not a bug for now, but better to just forward the flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: make SQPOLL thread parking sanerJens Axboe2021-03-051-35/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have this weird true/false return from parking, and then some of the callers decide to look at that. It can lead to unbalanced parks and sqd locking. Have the callers check the thread status once it's parked. We know we have the lock at that point, so it's either valid or it's NULL. Fix race with parking on thread exit. We need to be careful here with ordering of the sdq->lock and the IO_SQ_THREAD_SHOULD_PARK bit. Rename sqd->completion to sqd->parked to reflect that this is the only thing this completion event doesn. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io-wq: kill hashed waitqueue before manager exitsJens Axboe2021-03-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we race with shutting down the io-wq context and someone queueing a hashed entry, then we can exit the manager with it armed. If it then triggers after the manager has exited, we can have a use-after-free where io_wqe_hash_wake() attempts to wake a now gone manager process. Move the killing of the hashed write queue into the manager itself, so that we know we've killed it before the task exits. Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED returnJens Axboe2021-03-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The callback can only be armed, if we get -EIOCBQUEUED returned. It's important that we clear the WAITQ bit for other cases, otherwise we can queue for async retry and filemap will assume that we're armed and return -EAGAIN instead of just blocking for the IO. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: don't keep looping for more events if we can't flush overflowJens Axboe2021-03-051-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't make sense to wait for more events to come in, if we can't even flush the overflow we already have to the ring. Return -EBUSY for that condition, just like we do for attempts to submit with overflow pending. Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: move to using create_io_thread()Jens Axboe2021-03-053-109/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows us to do task creation and setup without needing to use completions to try and synchronize with the starting thread. Get rid of the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup completion events - we can now do setup before the task is running. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | kernel: provide create_io_thread() helperJens Axboe2021-03-042-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a generic helper for setting up an io_uring worker. Returns a task_struct so that the caller can do whatever setup is needed, then call wake_up_new_task() to kick it into gear. Add a kernel_clone_args member, io_thread, which tells copy_process() to mark the task with PF_IO_WORKER. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: reliably cancel linked timeoutsPavel Begunkov2021-03-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linked timeouts are fired asynchronously (i.e. soft-irq), and use generic cancellation paths to do its stuff, including poking into io-wq. The problem is that it's racy to access tctx->io_wq, as io_uring_task_cancel() and others may be happening at this exact moment. Mark linked timeouts with REQ_F_INLIFGHT for now, making sure there are no timeouts before io-wq destraction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: cancel-match based on flagsPavel Begunkov2021-03-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of going into request internals, like checking req->file->f_op, do match them based on REQ_F_INFLIGHT, it's set only when we want it to be reliably cancelled. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io-wq: ensure all pending work is canceled on exitJens Axboe2021-03-041-9/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we race on shutting down the io-wq, then we should ensure that any work that was queued after workers shutdown is canceled. Harden the add work check a bit too, checking for IO_WQ_BIT_EXIT and cancel if it's set. Add a WARN_ON() for having any work before we kill the io-wq context. Reported-by: syzbot+91b4b56ead187d35c9d3@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: ensure that threads freeze on suspendJens Axboe2021-03-042-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alex reports that his system fails to suspend using 5.12-rc1, with the following dump: [ 240.650300] PM: suspend entry (deep) [ 240.650748] Filesystems sync: 0.000 seconds [ 240.725605] Freezing user space processes ... [ 260.739483] Freezing of tasks failed after 20.013 seconds (3 tasks refusing to freeze, wq_busy=0): [ 260.739497] task:iou-mgr-446 state:S stack: 0 pid: 516 ppid: 439 flags:0x00004224 [ 260.739504] Call Trace: [ 260.739507] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739515] ? pick_next_task_fair+0x197/0x1cde [ 260.739519] ? sysvec_reschedule_ipi+0x2f/0x6a [ 260.739522] ? asm_sysvec_reschedule_ipi+0x12/0x20 [ 260.739525] ? __schedule+0x57/0x6d6 [ 260.739529] ? del_timer_sync+0xb9/0x115 [ 260.739533] ? schedule+0x63/0xd5 [ 260.739536] ? schedule_timeout+0x219/0x356 [ 260.739540] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739544] ? io_wq_manager+0x73/0xb1 [ 260.739549] ? io_wq_create+0x262/0x262 [ 260.739553] ? ret_from_fork+0x22/0x30 [ 260.739557] task:iou-mgr-517 state:S stack: 0 pid: 522 ppid: 439 flags:0x00004224 [ 260.739561] Call Trace: [ 260.739563] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739566] ? pick_next_task_fair+0x16f/0x1cde [ 260.739569] ? sysvec_apic_timer_interrupt+0xb/0x81 [ 260.739571] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 260.739574] ? __schedule+0x5b7/0x6d6 [ 260.739578] ? del_timer_sync+0x70/0x115 [ 260.739581] ? schedule_timeout+0x211/0x356 [ 260.739585] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739588] ? io_wq_check_workers+0x15/0x11f [ 260.739592] ? io_wq_manager+0x69/0xb1 [ 260.739596] ? io_wq_create+0x262/0x262 [ 260.739600] ? ret_from_fork+0x22/0x30 [ 260.739603] task:iou-wrk-517 state:S stack: 0 pid: 523 ppid: 439 flags:0x00004224 [ 260.739607] Call Trace: [ 260.739609] ? __schedule+0x5b7/0x6d6 [ 260.739614] ? schedule+0x63/0xd5 [ 260.739617] ? schedule_timeout+0x219/0x356 [ 260.739621] ? __next_timer_interrupt+0xf1/0xf1 [ 260.739624] ? task_thread.isra.0+0x148/0x3af [ 260.739628] ? task_thread_unbound+0xa/0xa [ 260.739632] ? task_thread_bound+0x7/0x7 [ 260.739636] ? ret_from_fork+0x22/0x30 [ 260.739647] OOM killer enabled. [ 260.739648] Restarting tasks ... done. [ 260.740077] PM: suspend exit Play nice and ensure that any thread we create will call try_to_freeze() at an opportune time so that memory suspend can proceed. For the io-wq worker threads, mark them as PF_NOFREEZE. They could potentially be blocked for a long time. Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: remove extra in_idle wake upPavel Begunkov2021-03-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_dismantle_req() is always followed by io_put_task(), which already do proper in_idle wake ups, so we can skip waking the owner task in io_dismantle_req(). The rules are simpler now, do io_put_task() shortly after ending a request, and it will be fine. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: inline __io_queue_async_work()Pavel Begunkov2021-03-041-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __io_queue_async_work() is only called from io_queue_async_work(), inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: inline io_req_clean_work()Pavel Begunkov2021-03-041-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inline io_req_clean_work(), less code and easier to analyse tctx dependencies and refs usage. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: choose right tctx->io_wq for try cancelPavel Begunkov2021-03-041-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we cancel SQPOLL, @task in io_uring_try_cancel_requests() will differ from current. Use the right tctx from passed in @task, and don't forget that it can be NULL when the io_uring ctx exits. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: fix -EAGAIN retry with IOPOLLJens Axboe2021-03-041-5/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We no longer revert the iovec on -EIOCBQUEUED, see commit ab2125df921d, and this started causing issues for IOPOLL on devies that run out of request slots. Turns out what outside of needing a revert for those, we also had a bug where we didn't properly setup retry inside the submission path. That could cause re-import of the iovec, if any, and that could lead to spurious results if the application had those allocated on the stack. Catch -EAGAIN retry and make the iovec stable for IOPOLL, just like we do for !IOPOLL retries. Cc: <stable@vger.kernel.org> # 5.9+ Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reported-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io-wq: fix error path leak of buffered write hash mapJens Axboe2021-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'err' path should include the hash put, we already grabbed a reference once we get that far. Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: remove sqo_taskPavel Begunkov2021-03-041-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, sqo_task is used only for a warning that is not interesting anymore since sqo_dead is gone, remove all of that including ctx->sqo_task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: kill sqo_dead and sqo submission haltingPavel Begunkov2021-03-041-37/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As SQPOLL task doesn't poke into ->sqo_task anymore, there is no need to kill the sqo when the master task exits. Before it was necessary to avoid races accessing sqo_task->files with removing them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: don't forget to enable SQPOLL before exit, if started disabled] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: ignore double poll add on the same waitqueue headJens Axboe2021-03-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reports a deadlock, attempting to lock the same spinlock twice: ============================================ WARNING: possible recursive locking detected 5.11.0-syzkaller #0 Not tainted -------------------------------------------- swapper/1/0 is trying to acquire lock: ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960 but task is already holding lock: ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&runtime->sleep); lock(&runtime->sleep); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by swapper/1/0: #0: ffff888147474908 (&group->lock){..-.}-{2:2}, at: _snd_pcm_stream_lock_irqsave+0x9f/0xd0 sound/core/pcm_native.c:170 #1: ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137 stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0xfa/0x151 lib/dump_stack.c:120 print_deadlock_bug kernel/locking/lockdep.c:2829 [inline] check_deadlock kernel/locking/lockdep.c:2872 [inline] validate_chain kernel/locking/lockdep.c:3661 [inline] __lock_acquire.cold+0x14c/0x3b4 kernel/locking/lockdep.c:4900 lock_acquire kernel/locking/lockdep.c:5510 [inline] lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960 __wake_up_common+0x147/0x650 kernel/sched/wait.c:108 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138 snd_pcm_update_state+0x46a/0x540 sound/core/pcm_lib.c:203 snd_pcm_update_hw_ptr0+0xa75/0x1a50 sound/core/pcm_lib.c:464 snd_pcm_period_elapsed+0x160/0x250 sound/core/pcm_lib.c:1805 dummy_hrtimer_callback+0x94/0x1b0 sound/drivers/dummy.c:378 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1600 __do_softirq+0x29b/0x9f6 kernel/softirq.c:345 invoke_softirq kernel/softirq.c:221 [inline] __irq_exit_rcu kernel/softirq.c:422 [inline] irq_exit_rcu+0x134/0x200 kernel/softirq.c:434 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1100 </IRQ> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:70 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:137 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516 Code: dd 38 6e f8 84 db 75 ac e8 54 32 6e f8 e8 0f 1c 74 f8 e9 0c 00 00 00 e8 45 32 6e f8 0f 00 2d 4e 4a c5 00 e8 39 32 6e f8 fb f4 <9c> 5b 81 e3 00 02 00 00 fa 31 ff 48 89 de e8 14 3a 6e f8 48 85 db RSP: 0018:ffffc90000d47d18 EFLAGS: 00000293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8880115c3780 RSI: ffffffff89052537 RDI: 0000000000000000 RBP: ffff888141127064 R08: 0000000000000001 R09: 0000000000000001 R10: ffffffff81794168 R11: 0000000000000000 R12: 0000000000000001 R13: ffff888141127000 R14: ffff888141127064 R15: ffff888143331804 acpi_idle_enter+0x361/0x500 drivers/acpi/processor_idle.c:647 cpuidle_enter_state+0x1b1/0xc80 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x4a/0xa0 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x3e1/0x590 kernel/sched/idle.c:300 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:397 start_secondary+0x274/0x350 arch/x86/kernel/smpboot.c:272 secondary_startup_64_no_verify+0xb0/0xbb which is due to the driver doing poll_wait() twice on the same wait_queue_head. That is perfectly valid, but from checking the rest of the kernel tree, it's the only driver that does this. We can handle this just fine, we just need to ignore the second addition as we'll get woken just fine on the first one. Cc: stable@vger.kernel.org # 5.8+ Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Reported-by: syzbot+28abd693db9e92c160d8@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: ensure that SQPOLL thread is started for exitJens Axboe2021-03-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we create it in a disabled state because IORING_SETUP_R_DISABLED is set on ring creation, we need to ensure that we've kicked the thread if we're exiting before it's been explicitly disabled. Otherwise we can run into a deadlock where exit is waiting go park the SQPOLL thread, but the SQPOLL thread itself is waiting to get a signal to start. That results in the below trace of both tasks hung, waiting on each other: INFO: task syz-executor458:8401 blocked for more than 143 seconds. Not tainted 5.11.0-next-20210226-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor458 state:D stack:27536 pid: 8401 ppid: 8400 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread_park fs/io_uring.c:7115 [inline] io_sq_thread_park+0xd5/0x130 fs/io_uring.c:7103 io_uring_cancel_task_requests+0x24c/0xd90 fs/io_uring.c:8745 __io_uring_files_cancel+0x110/0x230 fs/io_uring.c:8840 io_uring_files_cancel include/linux/io_uring.h:47 [inline] do_exit+0x299/0x2a60 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 __do_sys_exit_group kernel/exit.c:933 [inline] __se_sys_exit_group kernel/exit.c:931 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x43e899 RSP: 002b:00007ffe89376d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00000000004af2f0 RCX: 000000000043e899 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffffffffffc0 R09: 0000000010000000 R10: 0000000000008011 R11: 0000000000000246 R12: 00000000004af2f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 INFO: task iou-sqp-8401:8402 can't die for more than 143 seconds. task:iou-sqp-8401 state:D stack:30272 pid: 8402 ppid: 8400 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread+0x27d/0x1ae0 fs/io_uring.c:6717 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 INFO: task iou-sqp-8401:8402 blocked for more than 143 seconds. Reported-by: syzbot+fb5458330b4442f2090d@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: replace cmpxchg in fallback with xchgPavel Begunkov2021-03-041-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_run_ctx_fallback() can use xchg() instead of cmpxchg(). It's simpler and faster. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: fix __tctx_task_work() ctx racePavel Begunkov2021-03-041-17/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an unlikely but possible race using a freed context. That's because req->task_work.func() can free a request, but we won't necessarily find a completion in submit_state.comp and so all ctx refs may be put by the time we do mutex_lock(&ctx->uring_ctx); There are several reasons why it can miss going through submit_state.comp: 1) req->task_work.func() didn't complete it itself, but punted to iowq (e.g. reissue) and it got freed later, or a similar situation with it overflowing and getting flushed by someone else, or being submitted to IRQ completion, 2) As we don't hold the uring_lock, someone else can do io_submit_flush_completions() and put our ref. 3) Bugs and code obscurities, e.g. failing to propagate issue_flags properly. One example is as follows CPU1 | CPU2 ======================================================================= @req->task_work.func() | -> @req overflwed, | so submit_state.comp,nr==0 | | flush overflows, and free @req | ctx refs == 0, free it ctx is dead, but we do | lock + flush + unlock | So take a ctx reference for each new ctx we see in __tctx_task_work(), and do release it until we do all our flushing. Fixes: 65453d1efbd2 ("io_uring: enable req cache for task_work items") Reported-by: syzbot+a157ac7c03a56397f553@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fold in my one-liner and fix ref mismatch] Signed-off-by: Jens Axboe <axboe@kernel.dk>