summaryrefslogtreecommitdiffstats
path: root/fs/io_uring.c
Commit message (Collapse)AuthorAgeFilesLines
* io_uring: Fix current->fs handling in io_sq_wq_submit_work()Nicolai Stange2021-01-301-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No upstream commit, this is a fix to a stable 5.4 specific backport. The intention of backport commit cac68d12c531 ("io_uring: grab ->fs as part of async offload") as found in the stable 5.4 tree was to make io_sq_wq_submit_work() to switch the workqueue task's ->fs over to the submitting task's one during the IO operation. However, due to a small logic error, this change turned out to not have any actual effect. From a high level, the relevant code in io_sq_wq_submit_work() looks like old_fs_struct = current->fs; do { ... if (req->fs != current->fs && current->fs != old_fs_struct) { task_lock(current); if (req->fs) current->fs = req->fs; else current->fs = old_fs_struct; task_unlock(current); } ... } while (req); The if condition is supposed to cover the case that current->fs doesn't match what's needed for processing the request, but observe how it fails to ever evaluate to true due to the second clause: current->fs != old_fs_struct will be false in the first iteration as per the initialization of old_fs_struct and because this prevents current->fs from getting replaced, the same follows inductively for all subsequent iterations. Fix said if condition such that - if req->fs is set and doesn't match current->fs, the latter will be switched to the former - or if req->fs is unset, the switch back to the initial old_fs_struct will be made, if necessary. While at it, also correct the condition for the ->fs related cleanup right before the return of io_sq_wq_submit_work(): currently, old_fs_struct is restored only if it's non-NULL. It is always non-NULL though and thus, the if-condition is rendundant. Supposedly, the motivation had been to optimize and avoid switching current->fs back to the initial old_fs_struct in case it is found to have the desired value already. Make it so. Cc: stable@vger.kernel.org # v5.4 Fixes: cac68d12c531 ("io_uring: grab ->fs as part of async offload") Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix double list add in io_queue_async_work()Muchun Song2020-10-141-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we queue work in io_poll_wake(), it will leads to list double add. So we should add the list when the callback func is the io_sq_wq_submit_work. The following oops was seen: list_add double add: new=ffff9ca6a8f1b0e0, prev=ffff9ca62001cee8, next=ffff9ca6a8f1b0e0. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:31! Call Trace: <IRQ> io_poll_wake+0xf3/0x230 __wake_up_common+0x91/0x170 __wake_up_common_lock+0x7a/0xc0 io_commit_cqring+0xea/0x280 ? blkcg_iolatency_done_bio+0x2b/0x610 io_cqring_add_event+0x3e/0x60 io_complete_rw+0x58/0x80 dio_complete+0x106/0x250 blk_update_request+0xa0/0x3b0 blk_mq_end_request+0x1a/0x110 blk_mq_complete_request+0xd0/0xe0 nvme_irq+0x129/0x270 [nvme] __handle_irq_event_percpu+0x7b/0x190 handle_irq_event_percpu+0x30/0x80 handle_irq_event+0x3c/0x60 handle_edge_irq+0x91/0x1e0 do_IRQ+0x4d/0xd0 common_interrupt+0xf/0xf Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix remove irrelevant req from the task_listMuchun Song2020-10-141-11/+10
| | | | | | | | | | | | | | If the process 0 has been initialized io_uring is complete, and then fork process 1. If process 1 exits and it leads to delete all reqs from the task_list. If we kill process 0. We will not send SIGINT signal to the kworker. So we can not remove the req from the task_list. The io_sq_wq_submit_work() can do that for us. Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix missing smp_mb() in io_cancel_async_work()Muchun Song2020-10-141-1/+15
| | | | | | | | | | | | The store to req->flags and load req->work_task should not be reordering in io_cancel_async_work(). We should make sure that either we store REQ_F_CANCE flag to req->flags or we see the req->work_task setted in io_sq_wq_submit_work(). Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix resource leaking when kill the processYinyin Zhu2020-10-141-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 1c4404efcf2c0> ("<io_uring: make sure async workqueue is canceled on exit>") doesn't solve the resource leak problem totally! When kworker is doing a io task for the io_uring, The process which submitted the io task has received a SIGKILL signal from the user. Then the io_cancel_async_work function could have sent a SIGINT signal to the kworker, but the judging condition is wrong. So it doesn't send a SIGINT signal to the kworker, then caused the resource leaking problem. Why the juding condition is wrong? The process is a multi-threaded process, we call the thread of the process which has submitted the io task Thread1. So the req->task is the current macro of the Thread1. when all the threads of the process have done exit procedure, the last thread will call the io_cancel_async_work, but the last thread may not the Thread1, so the task is not equal and doesn't send the SIGINT signal. To fix this bug, we alter the task attribute of the req with struct files_struct. And check the files instead. Fixes: 1c4404efcf2c0 ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Yinyin Zhu <zhuyinyin@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix NULL pointer dereference in io_sq_wq_submit_work()Xin Yin2020-09-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the commit <1c4404efcf2c0> ("<io_uring: make sure async workqueue is canceled on exit>") caused a crash in io_sq_wq_submit_work(). when io_ring-wq get a req form async_list, which not have been added to task_list. Then try to delete the req from task_list will caused a "NULL pointer dereference". Ensure add req to async_list and task_list at the sametime. The crash log looks like this: [95995.973638] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [95995.979123] pgd = c20c8964 [95995.981803] [00000000] *pgd=1c72d831, *pte=00000000, *ppte=00000000 [95995.988043] Internal error: Oops: 817 [#1] SMP ARM [95995.992814] Modules linked in: bpfilter(-) [95995.996898] CPU: 1 PID: 15661 Comm: kworker/u8:5 Not tainted 5.4.56 #2 [95996.003406] Hardware name: Amlogic Meson platform [95996.008108] Workqueue: io_ring-wq io_sq_wq_submit_work [95996.013224] PC is at io_sq_wq_submit_work+0x1f4/0x5c4 [95996.018261] LR is at walk_stackframe+0x24/0x40 [95996.022685] pc : [<c059b898>] lr : [<c030da7c>] psr: 600f0093 [95996.028936] sp : dc6f7e88 ip : dc6f7df0 fp : dc6f7ef4 [95996.034148] r10: deff9800 r9 : dc1d1694 r8 : dda58b80 [95996.039358] r7 : dc6f6000 r6 : dc6f7ebc r5 : dc1d1600 r4 : deff99c0 [95996.045871] r3 : 0000cb5d r2 : 00000000 r1 : ef6b9b80 r0 : c059b88c [95996.052385] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user [95996.059593] Control: 10c5387d Table: 22be804a DAC: 00000055 [95996.065325] Process kworker/u8:5 (pid: 15661, stack limit = 0x78013c69) [95996.071923] Stack: (0xdc6f7e88 to 0xdc6f8000) [95996.076268] 7e80: dc6f7ecc dc6f7e98 00000000 c1f06c08 de9dc800 deff9a04 [95996.084431] 7ea0: 00000000 dc6f7f7c 00000000 c1f65808 0000080c dc677a00 c1ee9bd0 dc6f7ebc [95996.092594] 7ec0: dc6f7ebc d085c8f6 c0445a90 dc1d1e00 e008f300 c0288400 e4ef7100 00000000 [95996.100757] 7ee0: c20d45b0 e4ef7115 dc6f7f34 dc6f7ef8 c03725f0 c059b6b0 c0288400 c0288400 [95996.108921] 7f00: c0288400 00000001 c0288418 e008f300 c0288400 e008f314 00000088 c0288418 [95996.117083] 7f20: c1f03d00 dc6f6038 dc6f7f7c dc6f7f38 c0372df8 c037246c dc6f7f5c 00000000 [95996.125245] 7f40: c1f03d00 c1f03d00 c20d3cbe c0288400 dc6f7f7c e1c43880 e4fa7980 00000000 [95996.133409] 7f60: e008f300 c0372d9c e48bbe74 e1c4389c dc6f7fac dc6f7f80 c0379244 c0372da8 [95996.141570] 7f80: 600f0093 e4fa7980 c0379108 00000000 00000000 00000000 00000000 00000000 [95996.149734] 7fa0: 00000000 dc6f7fb0 c03010ac c0379114 00000000 00000000 00000000 00000000 [95996.157897] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [95996.166060] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 [95996.174217] Backtrace: [95996.176662] [<c059b6a4>] (io_sq_wq_submit_work) from [<c03725f0>] (process_one_work+0x190/0x4c0) [95996.185425] r10:e4ef7115 r9:c20d45b0 r8:00000000 r7:e4ef7100 r6:c0288400 r5:e008f300 [95996.193237] r4:dc1d1e00 [95996.195760] [<c0372460>] (process_one_work) from [<c0372df8>] (worker_thread+0x5c/0x5bc) [95996.203836] r10:dc6f6038 r9:c1f03d00 r8:c0288418 r7:00000088 r6:e008f314 r5:c0288400 [95996.211647] r4:e008f300 [95996.214173] [<c0372d9c>] (worker_thread) from [<c0379244>] (kthread+0x13c/0x168) [95996.221554] r10:e1c4389c r9:e48bbe74 r8:c0372d9c r7:e008f300 r6:00000000 r5:e4fa7980 [95996.229363] r4:e1c43880 [95996.231888] [<c0379108>] (kthread) from [<c03010ac>] (ret_from_fork+0x14/0x28) [95996.239088] Exception stack(0xdc6f7fb0 to 0xdc6f7ff8) [95996.244127] 7fa0: 00000000 00000000 00000000 00000000 [95996.252291] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [95996.260453] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [95996.267054] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0379108 [95996.274866] r4:e4fa7980 r3:600f0093 [95996.278430] Code: eb3a59e1 e5952098 e5951094 e5812004 (e5821000) Signed-off-by: Xin Yin <yinxin_1989@aliyun.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: Fix NULL pointer dereference in loop_rw_iter()Guoyu Huang2020-08-191-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 upstream. loop_rw_iter() does not check whether the file has a read or write function. This can lead to NULL pointer dereference when the user passes in a file descriptor that does not have read or write function. The crash log looks like this: [ 99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 99.835364] #PF: supervisor instruction fetch in kernel mode [ 99.836522] #PF: error_code(0x0010) - not-present page [ 99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0 [ 99.839649] Oops: 0010 [#2] SMP PTI [ 99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G D 5.8.0 #2 [ 99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 99.845140] RIP: 0010:0x0 [ 99.845840] Code: Bad RIP value. [ 99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208 [ 99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300 [ 99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68 [ 99.856914] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.858651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0 [ 99.861979] Call Trace: [ 99.862617] loop_rw_iter.part.0+0xad/0x110 [ 99.863838] io_write+0x2ae/0x380 [ 99.864644] ? kvm_sched_clock_read+0x11/0x20 [ 99.865595] ? sched_clock+0x9/0x10 [ 99.866453] ? sched_clock_cpu+0x11/0xb0 [ 99.867326] ? newidle_balance+0x1d4/0x3c0 [ 99.868283] io_issue_sqe+0xd8f/0x1340 [ 99.869216] ? __switch_to+0x7f/0x450 [ 99.870280] ? __switch_to_asm+0x42/0x70 [ 99.871254] ? __switch_to_asm+0x36/0x70 [ 99.872133] ? lock_timer_base+0x72/0xa0 [ 99.873155] ? switch_mm_irqs_off+0x1bf/0x420 [ 99.874152] io_wq_submit_work+0x64/0x180 [ 99.875192] ? kthread_use_mm+0x71/0x100 [ 99.876132] io_worker_handle_work+0x267/0x440 [ 99.877233] io_wqe_worker+0x297/0x350 [ 99.878145] kthread+0x112/0x150 [ 99.878849] ? __io_worker_unuse+0x100/0x100 [ 99.879935] ? kthread_park+0x90/0x90 [ 99.880874] ret_from_fork+0x22/0x30 [ 99.881679] Modules linked in: [ 99.882493] CR2: 0000000000000000 [ 99.883324] ---[ end trace 4453745f4673190b ]--- [ 99.884289] RIP: 0010:0x0 [ 99.884837] Code: Bad RIP value. [ 99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608 [ 99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00 [ 99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68 [ 99.896079] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.898017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0 Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations") Cc: stable@vger.kernel.org Signed-off-by: Guoyu Huang <hgy5945@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: set ctx sq/cq entry count earlierJens Axboe2020-08-191-2/+4
| | | | | | | | | | | | | | | | | | | commit bd74048108c179cea0ff52979506164c80f29da7 upstream. If we hit an earlier error path in io_uring_create(), then we will have accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the ring is torn down in error, we use those values to unaccount the memory. Ensure we set the ctx entries before we're able to hit a potential error path. Cc: stable@vger.kernel.org Reported-by: Tomáš Chaloupka <chalucha@gmail.com> Tested-by: Tomáš Chaloupka <chalucha@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: fix sq array offset calculationDmitry Vyukov2020-08-191-3/+3
| | | | | | | | | | | | | | | | [ Upstream commit b36200f543ff07a1cb346aa582349141df2c8068 ] rings_size() sets sq_offset to the total size of the rings (the returned value which is used for memory allocation). This is wrong: sq array should be located within the rings, not after them. Set sq_offset to where it should be. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hristo Venev <hristo@venev.name> Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs/io_uring.c: Fix uninitialized variable is referenced in io_submit_sqeLiu Yong2020-08-191-0/+1
| | | | | | | | | | the commit <a4d61e66ee4a> ("<io_uring: prevent re-read of sqe->opcode>") caused another vulnerability. After io_get_req(), the sqe_submit struct in req is not initialized, but the following code defaults that req->submit.opcode is available. Signed-off-by: Liu Yong <pkfxxxing@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: Fix use-after-free in io_sq_wq_submit_work()Guoyu Huang2020-08-111-0/+1
| | | | | | | | | | | when ctx->sqo_mm is zero, io_sq_wq_submit_work() frees 'req' without deleting it from 'task_list'. After that, 'req' is accessed in io_ring_ctx_wait_and_kill() which lead to a use-after-free. Signed-off-by: Guoyu Huang <hgy5945@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: prevent re-read of sqe->opcodeJens Axboe2020-08-111-35/+24
| | | | | | | | | | | | | Liu reports that he can trigger a NULL pointer dereference with IORING_OP_SENDMSG, by changing the sqe->opcode after we've validated that the previous opcode didn't need a file and didn't assign one. Ensure we validate and read the opcode only once. Reported-by: Liu Yong <pkfxxxing@gmail.com> Tested-by: Liu Yong <pkfxxxing@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: make sure async workqueue is canceled on exitJens Axboe2020-07-091-0/+63
| | | | | | | | | | | | | | | | | Track async work items that we queue, so we can safely cancel them if the ring is closed or the process exits. Newer kernels handle this automatically with io-wq, but the old workqueue based setup needs a bit of special help to get there. There's no upstream variant of this, as that would require backporting all the io-wq changes from 5.5 and on. Hence I made a one-off that ensures that we don't leak memory if we have async work items that need active cancelation (like socket IO). Reported-by: Agarwal, Anchal <anchalag@amazon.com> Tested-by: Agarwal, Anchal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: use kvfree() in io_sqe_buffer_register()Denis Efremov2020-06-171-2/+2
| | | | | | | | | | | | | | | commit a8c73c1a614f6da6c0b04c393f87447e28cb6de4 upstream. Use kvfree() to free the pages and vmas, since they are allocated by kvmalloc_array() in a loop. Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings") Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: initialize ctx->sqo_wait earlierJens Axboe2020-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 583863ed918136412ddf14de2e12534f17cfdc6f ] Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated, instead of deferring it to the offload setup. This fixes a syzbot reported lockdep complaint, which is really due to trying to wake_up on an uninitialized wait queue: RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319 RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260 R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160 io_poll_remove_all fs/io_uring.c:4357 [inline] io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305 io_uring_create fs/io_uring.c:7843 [inline] io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x441319 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: honor original task RLIMIT_FSIZEJens Axboe2020-04-171-0/+11
| | | | | | | | | | | | | | | | | | commit 4ed734b0d0913e566a9d871e15d24eb240f269f7 upstream. With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor. During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: remove bogus RLIMIT_NOFILE check in file registrationJens Axboe2020-04-171-7/+0
| | | | | | | | | | | | | | commit c336e992cb1cb1db9ee608dfb30342ae781057ab upstream. We already checked this limit when the file was opened, and we keep it open in the file table. Hence when we added unit_inflight to the count we want to register, we're doubly accounting these files. This results in -EMFILE for file registration, if we're at half the limit. Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: fix 32-bit compatability with sendmsg/recvmsgJens Axboe2020-03-051-0/+5
| | | | | | | | | | | | | | | | | commit d876836204897b6d7d911f942084f69a1e9d5c4d upstream. We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: stable@vger.kernel.org Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()") Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: grab ->fs as part of async offloadJens Axboe2020-03-051-0/+46
| | | | | | | | | | | | | | | | | | | | [ Upstream commits 9392a27d88b9 and ff002b30181d ] Ensure that the async work grabs ->fs from the queueing task if the punted commands needs to do lookups. We don't have these two commits in 5.4-stable: ff002b30181d30cdfbca316dadd099c3ca0d739c 9392a27d88b9707145d713654eb26f0c29789e50 because they don't apply with the rework that was done in how io_uring handles offload. Since there's no io-wq in 5.4, it doesn't make sense to do two patches. I'm attaching my port of the two for 5.4-stable, it's been tested. Please queue it up for the next 5.4-stable, thanks! Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: prevent sq_thread from spinning when it should stopStefano Garzarella2020-02-281-10/+10
| | | | | | | | | | | | | | | | | | | | | commit 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 upstream. This patch drops 'cur_mm' before calling cond_resched(), to prevent the sq_thread from spinning even when the user process is finished. Before this patch, if the user process ended without closing the io_uring fd, the sq_thread continues to spin until the 'sq_thread_idle' timeout ends. In the worst case where the 'sq_thread_idle' parameter is bigger than INT_MAX, the sq_thread will spin forever. Fixes: 6c271ce2f1d5 ("io_uring: add submission polling") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: fix __io_iopoll_check deadlock in io_sq_threadXiaoguang Wang2020-02-281-18/+9
| | | | | | | | | | | | | | | | | | | | | | commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream. Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit. I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check(). Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "io_uring: only allow submit from owning task"Jens Axboe2020-01-291-6/+0
| | | | | | | | | | | | | | | | commit 73e08e711d9c1d79fae01daed4b0e1fee5f8a275 upstream. This ends up being too restrictive for tasks that willingly fork and share the ring between forks. Andres reports that this breaks his postgresql work. Since we're close to 5.5 release, revert this change for now. Cc: stable@vger.kernel.org Fixes: 44d282796f81 ("io_uring: only allow submit from owning task") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: only allow submit from owning taskJens Axboe2020-01-231-0/+6
| | | | | | | | | | | | | | | | | commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream. If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async. Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: don't wait when under-submittingPavel Begunkov2020-01-121-0/+4
| | | | | | | | | | | | | | | [ Upstream commit 7c504e65206a4379ff38fe41d21b32b6c2c3e53e ] There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases. Don't wait/poll if can't submit all sqes Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: use current task creds instead of allocating a new oneJens Axboe2020-01-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a upstream. syzbot reports: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack. Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> [ only use the io_uring.c portion of the patch - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: io_allocate_scq_urings() should return a sane stateJens Axboe2020-01-041-2/+8
| | | | | | | | | | | | | | | [ Upstream commit eb065d301e8c83643367bdb0898becc364046bda ] We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails. Be nice and return with either everything setup in success, or return an error with things nicely cleaned up. Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: ensure req->submit is copied when req is deferredJens Axboe2019-12-131-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's an issue with deferred requests through drain, where if we do need to defer, we're not copying over the sqe_submit state correctly. This can result in using uninitialized data when we then later go and submit the deferred request, like this check in __io_submit_sqe(): if (unlikely(s->index >= ctx->sq_entries)) return -EINVAL; with 's' being uninitialized, we can randomly fail this check. Fix this by copying sqe_submit state when we defer a request. Because it was fixed as part of a cleanup series in mainline, before anyone realized we had this issue. That removed the separate states of ->index vs ->submit.sqe. That series is not something I was comfortable putting into stable, hence the much simpler addition. Here's the patch in the series that fixes the same issue: commit cf6fd4bd559ee61a4454b161863c8de6f30f8dca Author: Pavel Begunkov <asml.silence@gmail.com> Date: Mon Nov 25 23:14:39 2019 +0300 io_uring: inline struct sqe_submit Reported-by: Andres Freund <andres@anarazel.de> Reported-by: Tomáš Chaloupka Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: fix missing kmap() declaration on powerpcJens Axboe2019-12-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit aa4c3967756c6c576a38a23ac511be211462a6b7 upstream. Christophe reports that current master fails building on powerpc with this error: CC fs/io_uring.o fs/io_uring.c: In function ‘loop_rw_iter’: fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’ [-Werror=implicit-function-declaration] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1628:19: warning: assignment makes pointer from integer without a cast [-Wint-conversion] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’ [-Werror=implicit-function-declaration] kunmap(iter->bvec->bv_page); ^ which is caused by a missing highmem.h include. Fix it by including it. Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw") Reported-by: Christophe Leroy <christophe.leroy@c-s.fr> Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTRJens Axboe2019-12-131-0/+2
| | | | | | | | | | | | commit 441cdbd5449b4923cd413d3ba748124f91388be9 upstream. We should never return -ERESTARTSYS to userspace, transform it into -EINTR. Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: fix dead-hung for non-iter fixed rwPavel Begunkov2019-12-131-1/+14
| | | | | | | | | | | | | | | | | | | commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 upstream. Read/write requests to devices without implemented read/write_iter using fixed buffers can cause general protection fault, which totally hangs a machine. io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter() accesses it as iovec, dereferencing random address. kmap() page by page in this case Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring: async workers should inherit the user credsJens Axboe2019-12-041-1/+22
| | | | | | | | | | | | [ Upstream commit 181e448d8709e517c9c7b523fcd209f24eb38ca7 ] If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring: ensure registered buffer import returns the IO lengthJens Axboe2019-11-131-1/+1
| | | | | | | | | | | | | | | A test case was reported where two linked reads with registered buffers failed the second link always. This is because we set the expected value of a request in req->result, and if we don't get this result, then we fail the dependent links. For some reason the registered buffer import returned -ERROR/0, while the normal import returns -ERROR/length. This broke linked commands with registered buffers. Fix this by making io_import_fixed() correctly return the mapped length. Cc: stable@vger.kernel.org # v5.3 Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Fix getting file for timeoutPavel Begunkov2019-11-131-0/+1
| | | | | | | | | For timeout requests io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: make timeout sequence == 0 mean no sequenceJens Axboe2019-11-121-7/+22
| | | | | | | | | | | | | Currently we make sequence == 0 be the same as sequence == 1, but that's not super useful if the intent is really to have a timeout that's just a pure timeout. If the user passes in sqe->off == 0, then don't apply any sequence logic to the request, let it purely be driven by the timeout specified. Reported-by: 李通洲 <carter.li@eoitek.com> Reviewed-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: ensure we clear io_kiocb->result before each issueJens Axboe2019-10-301-0/+1
| | | | | | | | | | | | | | | | We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data. Reset ->result whenever we re-prep a request for polled submission. Cc: stable@vger.kernel.org Fixes: 9e645e1105ca ("io_uring: add support for sqe links") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: don't touch ctx in setup after ring fd installJens Axboe2019-10-281-4/+8
| | | | | | | | | | | | | | syzkaller reported an issue where it looks like a malicious app can trigger a use-after-free of reading the ctx ->sq_array and ->rings value right after having installed the ring fd in the process file table. Defer ring fd installation until after we're done reading those values. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Fix leaked shadow_reqPavel Begunkov2019-10-271-0/+1
| | | | | | | | | | io_queue_link_head() owns shadow_req after taking it as an argument. By not freeing it in case of an error, it can leak the request along with taken ctx->refs. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREADJens Axboe2019-10-251-12/+32
| | | | | | | | | | | | | | | | | We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: used cached copies of sq->dropped and cq->overflowJens Axboe2019-10-251-5/+8
| | | | | | | | | | | | We currently use the ring values directly, but that can lead to issues if the application is malicious and changes these values on our behalf. Created in-kernel cached versions of them, and just overwrite the user side when we update them. This is similar to how we treat the sq/cq ring tail/head updates. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Fix race for sqes with userspacePavel Begunkov2019-10-251-1/+2
| | | | | | | | | | | io_ring_submit() finalises with 1. io_commit_sqring(), which releases sqes to the userspace 2. Then calls to io_queue_link_head(), accessing released head's sqe Reorder them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Fix broken links with offloadingPavel Begunkov2019-10-251-29/+33
| | | | | | | | | | | | | | | | io_sq_thread() processes sqes by 8 without considering links. As a result, links will be randomely subdivided. The easiest way to fix it is to call io_get_sqring() inside io_submit_sqes() as do io_ring_submit(). Downsides: 1. This removes optimisation of not grabbing mm_struct for fixed files 2. It submitting all sqes in one go, without finer-grained sheduling with cq processing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Fix corrupted user_dataPavel Begunkov2019-10-251-0/+2
| | | | | | | | | | | There is a bug, where failed linked requests are returned not with specified @user_data, but with garbage from a kernel stack. The reason is that io_fail_links() uses req->user_data, which is uninitialised when called from io_queue_sqe() on fail path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: correct timeout req sequence when inserting a new entryzhangyi (F)2019-10-231-1/+10
| | | | | | | | | | | | | | | | | | | | The sequence number of the timeout req (req->sequence) indicate the expected completion request. Because of each timeout req consume a sequence number, so the sequence of each timeout req on the timeout list shouldn't be the same. But now, we may get the same number (also incorrect) if we insert a new entry before the last one, such as submit such two timeout reqs on a new ring instance below. req->sequence req_1 (count = 2): 2 req_2 (count = 1): 2 Then, if we submit a nop req, req_2 will still timeout even the nop req finished. This patch fix this problem by adjust the sequence number of each reordered reqs when inserting a new entry. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring : correct timeout req sequence when waiting timeoutzhangyi (F)2019-10-231-1/+10
| | | | | | | | | | | The sequence number of reqs on the timeout_list before the timeout req should be adjusted in io_timeout_fn(), because the current timeout req will consumes a slot in the cq_ring and cq_tail pointer will be increased, otherwise other timeout reqs may return in advance without waiting for enough wait_nr. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: revert "io_uring: optimize submit_and_wait API"Jens Axboe2019-10-231-46/+17
| | | | | | | | | | | | | | There are cases where it isn't always safe to block for submission, even if the caller asked to wait for events as well. Revert the previous optimization of doing that. This reverts two commits: bf7ec93c644cb c576666863b78 Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-blockLinus Torvalds2019-10-181-24/+60
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - NVMe pull request from Keith that address deadlocks, double resets, memory leaks, and other regression. - Fixup elv_support_iosched() for bio based devices (Damien) - Fixup for the ahci PCS quirk (Dan) - Socket O_NONBLOCK handling fix for io_uring (me) - Timeout sequence io_uring fixes (yangerkun) - MD warning fix for parameter default_layout (Song) - blkcg activation fixes (Tejun) - blk-rq-qos node deletion fix (Tejun) * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block: nvme-pci: Set the prp2 correctly when using more than 4k page io_uring: fix logic error in io_timeout io_uring: fix up O_NONBLOCK handling for sockets md/raid0: fix warning message for parameter default_layout libata/ahci: Fix PCS quirk application blk-rq-qos: fix first node deletion of rq_qos_del() blkcg: Fix multiple bugs in blkcg_activate_policy() io_uring: consider the overflow of sequence for timeout req nvme-tcp: fix possible leakage during error flow nvmet-loop: fix possible leakage during error flow block: Fix elv_support_iosched() nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL nvme: Wait for reset state when required nvme: Prevent resets during paused controller state nvme: Restart request timers in resetting state nvme: Remove ADMIN_ONLY state nvme-pci: Free tagset if no IO queues nvme: retain split access workaround for capability reads nvme: fix possible deadlock when nvme_update_formats fails
| * io_uring: fix logic error in io_timeoutyangerkun2019-10-171-1/+1
| | | | | | | | | | | | | | | | | | If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not tmp_nxt. Fixes: 5da0fb1ab34c ("io_uring: consider the overflow of sequence for timeout req") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: fix up O_NONBLOCK handling for socketsJens Axboe2019-10-171-18/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've got two issues with the non-regular file handling for non-blocking IO: 1) We don't want to re-do a short read in full for a non-regular file, as we can't just read the data again. 2) For non-regular files that don't support non-blocking IO attempts, we need to punt to async context even if the file is opened as non-blocking. Otherwise the caller always gets -EAGAIN. Add two new request flags to handle these cases. One is just a cache of the inode S_ISREG() status, the other tells io_uring that we always need to punt this request to async context, even if REQ_F_NOWAIT is set. Cc: stable@vger.kernel.org Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: consider the overflow of sequence for timeout reqyangerkun2019-10-151-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we recalculate the sequence of timeout with 'req->sequence = ctx->cached_sq_head + count - 1', judge the right place to insert for timeout_list by compare the number of request we still expected for completion. But we have not consider about the situation of overflow: 1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for the new timeout req can have a small req->sequence. 2. cached_sq_head of now may overflow compare with before req. And it will lead the timeout req with small req->sequence. This overflow will lead to the misorder of timeout_list, which can lead to the wrong order of the completion of timeout_list. Fix it by reuse req->submit.sequence to store the count, and change the logic of inserting sort in io_timeout. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-blockLinus Torvalds2019-10-131-17/+20
|\| | | | | | | | | | | | | | | | | Pull io_uring fix from Jens Axboe: "Single small fix for a regression in the sequence logic for linked commands" * tag 'for-linus-20191012' of git://git.kernel.dk/linux-block: io_uring: fix sequence logic for timeout requests