summaryrefslogtreecommitdiffstats
path: root/io_uring
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2023-10-271-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
| * io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() ↵Al Viro2023-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | failed ->ki_pos value is unreliable in such cases. For an obvious example, consider O_DSYNC write - we feed the data to page cache and start IO, then we make sure it's completed. Update of ->ki_pos is dealt with by the first part; failure in the second ends up with negative value returned _and_ ->ki_pos left advanced as if sync had been successful. In the same situation write(2) does not advance the file position at all. Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | io_uring/rw: disable IOCB_DIO_CALLER_COMPJens Axboe2023-10-251-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an application does O_DIRECT writes with io_uring and the file system supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is done from the task_work that will post the completion event for said write as well. Whenever a dio write is done against a file, the inode i_dio_count is elevated. This enables other callers to use inode_dio_wait() to wait for previous writes to complete. If we defer the full dio completion to task_work, we are dependent on that task_work being run before the inode i_dio_count can be decremented. If the same task that issues io_uring dio writes with IOCB_DIO_CALLER_COMP performs a synchronous system call that calls inode_dio_wait(), then we can deadlock as we're blocked sleeping on the event to become true, but not processing the completions that will result in the inode i_dio_count being decremented. Until we can guarantee that this is the case, then disable the deferred caller completions. Fixes: 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pidJens Axboe2023-10-251-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | We could race with SQ thread exit, and if we do, we'll hit a NULL pointer dereference when the thread is cleared. Grab the SQPOLL data lock before attempting to get the task cpu and pid for fdinfo, this ensures we have a stable view of it. Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032 Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring addressJens Axboe2023-10-181-0/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we specify a valid CQ ring address but an invalid SQ ring address, we'll correctly spot this and free the allocated pages and clear them to NULL. However, we don't clear the ring page count, and hence will attempt to free the pages again. We've already cleared the address of the page array when freeing them, but we don't check for that. This causes the following crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Oops [#1] Modules linked in: CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56 Hardware name: ucbbar,riscvemu-bare (DT) Workqueue: events_unbound io_ring_exit_work epc : io_pages_free+0x2a/0x58 ra : io_rings_free+0x3a/0x50 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d [<ffffffff808811a2>] io_pages_free+0x2a/0x58 [<ffffffff80881406>] io_rings_free+0x3a/0x50 [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424 [<ffffffff80027234>] process_one_work+0x10c/0x1f4 [<ffffffff8002756e>] worker_thread+0x252/0x31c [<ffffffff8002f5e4>] kthread+0xc4/0xe0 [<ffffffff8000332a>] ret_from_fork+0xa/0x1c Check for a NULL array in io_pages_free(), but also clear the page counts when we free them to be on the safer side. Reported-by: rtm@csail.mit.edu Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Cc: stable@vger.kernel.org Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()Jeff Moyer2023-10-051-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I received a bug report with the following signature: [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8 [ 1759.944564] #PF: supervisor read access in kernel mode [ 1759.949732] #PF: error_code(0x0000) - not-present page [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0 [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1 [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48 [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286 [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000 [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400 [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000 [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50 [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548 [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000 [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0 [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1760.086325] PKRU: 55555554 [ 1760.089044] Call Trace: [ 1760.091501] <TASK> [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0 [ 1760.106584] ? __die_body.cold+0x8/0xd [ 1760.110356] ? page_fault_oops+0x134/0x170 [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110 [ 1760.119298] ? exc_page_fault+0xa8/0x150 [ 1760.123247] ? asm_exc_page_fault+0x22/0x30 [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1760.142527] __io_wq_cpu_online+0x54/0xb0 [ 1760.146558] cpuhp_invoke_callback+0x109/0x460 [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10 [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 1760.160320] cpuhp_thread_fun+0x8d/0x140 [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0 [ 1760.168297] kthread+0xdd/0x100 [ 1760.171457] ? __pfx_kthread+0x10/0x10 [ 1760.175225] ret_from_fork+0x29/0x50 [ 1760.178826] </TASK> [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod [ 1760.248623] CR2: ffffffffffffffe8 A cpu hotplug callback was issued before wq->all_list was initialized. This results in a null pointer dereference. The fix is to fully setup the io_wq before calling cpuhp_state_add_instance_nocalls(). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pagesJens Axboe2023-10-031-1/+15
| | | | | | | | | | | On at least arm32, but presumably any arch with highmem, if the application passes in memory that resides in highmem for the rings, then we should fail that ring creation. We fail it with -EINVAL, which is what kernels that don't support IORING_SETUP_NO_MMAP will do as well. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: ensure io_lockdep_assert_cq_locked() handles disabled ringsJens Axboe2023-10-031-14/+27
| | | | | | | | | | | | | | | | | | | | | io_lockdep_assert_cq_locked() checks that locking is correctly done when a CQE is posted. If the ring is setup in a disabled state with IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until the ring is later enabled. We generally don't post CQEs in this state, as no SQEs can be submitted. However it is possible to generate a CQE if tagged resources are being updated. If this happens and PROVE_LOCKING is enabled, then the locking check helper will dereference ctx->submitter_task, which hasn't been set yet. Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While at it, convert it to a static inline as well, so that generated line offsets will actually reflect which condition failed, rather than just the line offset for io_lockdep_assert_cq_locked() itself. Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com Fixes: f26cc9593581 ("io_uring: lockdep annotate CQ locking") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/kbuf: don't allow registered buffer rings on highmem pagesJens Axboe2023-10-031-8/+19
| | | | | | | | | | | | | | | | | | | | syzbot reports that registering a mapped buffer ring on arm32 can trigger an OOPS. Registered buffer rings have two modes, one of them is the application passing in the memory that the buffer ring should reside in. Once those pages are mapped, we use page_address() to get a virtual address. This will obviously fail on highmem pages, which aren't mapped. Add a check if we have any highmem pages after mapping, and fail the attempt to register a provided buffer ring if we do. This will return the same error as kernels that don't support provided buffer rings to begin with. Link: https://lore.kernel.org/io-uring/000000000000af635c0606bcb889@google.com/ Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Cc: stable@vger.kernel.org Reported-by: syzbot+2113e61b8848fa7951d8@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/fs: remove sqe->rw_flags checking from LINKATJens Axboe2023-09-291-1/+1
| | | | | | | | | | | | This is unionized with the actual link flags, so they can of course be set and they will be evaluated further down. If not we fail any LINKAT that has to set option flags. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: stable@vger.kernel.org Reported-by: Thomas Leonard <talex5@gmail.com> Link: https://github.com/axboe/liburing/issues/955 Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: fix iter retargeting for selected bufPavel Begunkov2023-09-141-0/+5
| | | | | | | | | | | | | | When using selected buffer feature, io_uring delays data iter setup until later. If io_setup_async_msg() is called before that it might see not correctly setup iterator. Pre-init nr_segs and judge from its state whether we repointing. Cc: stable@vger.kernel.org Reported-by: syzbot+a4c6e5ef999b68b26ed1@syzkaller.appspotmail.com Fixes: 0455d4ccec548 ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0000000000002770be06053c7757@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()"Jens Axboe2023-09-071-32/+0
| | | | | | | | | | | | | | | | | This reverts commit b484a40dc1f16edb58e5430105a021e1916e6f27. This commit cancels all requests with io-wq, not just the ones from the originating task. This breaks use cases that have thread pools, or just multiple tasks issuing requests on the same ring. The liburing regression test for this also shows that problem: $ test/thread-exit.t cqe->res=-125, Expected 512 where an IO thread gets its request canceled rather than complete successfully. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix unprotected iopoll overflowPavel Begunkov2023-09-071-2/+2
| | | | | | | | | | | | | | | | | | | [ 71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769 io_cqring_event_overflow+0x47b/0x6b0 [ 71.498381] Call Trace: [ 71.498590] <TASK> [ 71.501858] io_req_cqe_overflow+0x105/0x1e0 [ 71.502194] __io_submit_flush_completions+0x9f9/0x1090 [ 71.503537] io_submit_sqes+0xebd/0x1f00 [ 71.503879] __do_sys_io_uring_enter+0x8c5/0x2380 [ 71.507360] do_syscall_64+0x39/0x80 We decoupled CQ locking from ->task_complete but haven't fixed up places forcing locking for CQ overflows. Fixes: ec26c225f06f5 ("io_uring: merge iopoll and normal completion paths") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: break out of iowq iopoll on teardownPavel Begunkov2023-09-073-0/+13
| | | | | | | | | | | | | io-wq will retry iopoll even when it failed with -EAGAIN. If that races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers, such workers might potentially infinitely spin retrying iopoll again and again and each time failing on some allocation / waiting / etc. Don't keep spinning if io-wq is dying. Fixes: 561fb04a6a225 ("io_uring: replace workqueue usage with io-wq") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: add a sysctl to disable io_uring system-wideMatteo Rizzo2023-09-051-0/+50
| | | | | | | | | | | | | | | Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo <matteorizzo@google.com> [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/fdinfo: only print ->sq_array[] if it's thereJens Axboe2023-09-011-0/+2
| | | | | | | | | | | If a ring is setup with IORING_SETUP_NO_SQARRAY, then we don't have the SQ array. Don't try to dump info from it through fdinfo if that is the case. Reported-by: syzbot+216e2ea6e0bf4a0acdd7@syzkaller.appspotmail.com Fixes: 2af89abda7d9 ("io_uring: add option to remove SQ indirection") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix IO hang in io_wq_put_and_exit from do_exit()Ming Lei2023-09-011-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit(). Meantime io_wq IO code path may share resource with normal iopoll code path. So if any HIPRI request is submittd via io_wq, this request may not get resouce for moving on, given iopoll isn't possible in io_wq_put_and_exit(). The issue can be triggered when terminating 't/io_uring -n4 /dev/nullb0' with default null_blk parameters. Fix it by always cancelling all requests in io_wq by adding helper of io_uring_cancel_wq(), and this way is reasonable because io_wq destroying follows canceling requests immediately. Closes: https://lore.kernel.org/linux-block/3893581.1691785261@warthog.procyon.org.uk/ Reported-by: David Howells <dhowells@redhat.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230901134916.2415386-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Don't set affinity on a dying sqpoll threadGabriel Krisman Bertazi2023-08-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported a null-ptr-deref of sqd->thread inside io_sqpoll_wq_cpu_affinity. It turns out the sqd->thread can go away from under us during io_uring_register, in case the process gets a fatal signal during io_uring_register. It is not particularly hard to hit the race, and while I am not sure this is the exact case hit by syzbot, it solves it. Finally, checking ->thread is enough to close the race because we locked sqd while "parking" the thread, thus preventing it from going away. I reproduced it fairly consistently with a program that does: int main(void) { ... io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL); while (1) { io_uring_register_iowq_aff(ring, 1, &mask); } } Executed in a loop with timeout to trigger SIGTERM: while true; do timeout 1 /a.out ; done This will hit the following BUG() in very few attempts. BUG: kernel NULL pointer dereference, address: 00000000000007a8 PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70 Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f 48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8 a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70 RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840 R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800 FS: 00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? do_user_addr_fault+0x174/0x7b0 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? io_sqpoll_wq_cpu_affinity+0x27/0x70 __io_register_iowq_aff+0x2b/0x60 __io_uring_register+0x614/0xa70 __x64_sys_io_uring_register+0xaa/0x1a0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fb7c226fec9 Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9 RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003 RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: 00000000000007a8 ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com Fixes: ebdfefc09c6d ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds2023-08-2917-261/+339
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: "Fairly quiet round in terms of features, mostly just improvements all over the map for existing code. In detail: - Initial support for socket operations through io_uring. Latter half of this will likely land with the 6.7 kernel, then allowing things like get/setsockopt (Breno) - Cleanup of the cancel code, and then adding support for canceling requests with the opcode as the key (me) - Improvements for the io-wq locking (me) - Fix affinity setting for SQPOLL based io-wq (me) - Remove the io_uring userspace code. These were added initially as copies from liburing, but all of them have since bitrotted and are way out of date at this point. Rather than attempt to keep them in sync, just get rid of them. People will have liburing available anyway for these examples. (Pavel) - Series improving the CQ/SQ ring caching (Pavel) - Misc fixes and cleanups (Pavel, Yue, me)" * tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux: (47 commits) io_uring: move iopoll ctx fields around io_uring: move multishot cqe cache in ctx io_uring: separate task_work/waiting cache line io_uring: banish non-hot data to end of io_ring_ctx io_uring: move non aligned field to the end io_uring: add option to remove SQ indirection io_uring: compact SQ/CQ heads/tails io_uring: force inline io_fill_cqe_req io_uring: merge iopoll and normal completion paths io_uring: reorder cqring_flush and wakeups io_uring: optimise extra io_get_cqe null check io_uring: refactor __io_get_cqe() io_uring: simplify big_cqe handling io_uring: cqe init hardening io_uring: improve cqe !tracing hot path io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used io_uring: simplify io_run_task_work_sig return io_uring/rsrc: keep one global dummy_ubuf io_uring: never overflow io_aux_cqe ...
| * io_uring: move multishot cqe cache in ctxPavel Begunkov2023-08-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | We cache multishot CQEs before flushing them to the CQ in submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the middle of the io_submit_state structure. Move it out of there, it should help with CPU caches for the submission state, and shouldn't affect cached CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dbe1f39c043ee23da918836be44fcec252ce6711.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add option to remove SQ indirectionPavel Begunkov2023-08-241-20/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not many aware, but io_uring submission queue has two levels. The first level usually appears as sq_array and stores indexes into the actual SQ. To my knowledge, no one has ever seriously used it, nor liburing exposes it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother creating and using the sq_array and SQ heads/tails will be pointing directly into the SQ. Improves memory footprint, in term of both allocations as well as cache usage, and also should make io_get_sqe() less branchy in the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: force inline io_fill_cqe_reqPavel Begunkov2023-08-241-1/+2
| | | | | | | | | | | | | | | | | | There are only 2 callers of io_fill_cqe_req left, and one of them is extremely hot. Force inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ffce4fc5e3521966def848a4d930586dfe33ae11.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: merge iopoll and normal completion pathsPavel Begunkov2023-08-243-26/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_do_iopoll() and io_submit_flush_completions() are pretty similar, both filling CQEs and then free a list of requests. Don't duplicate it and make iopoll use __io_submit_flush_completions(), which also helps with inlining and other optimisations. For that, we need to first find all completed iopoll requests and splice them from the iopoll list and then pass it down. This adds one extra list traversal, which should be fine as requests will stay hot in cache. CQ locking is already conditional, introduce ->lockless_cq and skip locking for IOPOLL as it's protected by ->uring_lock. We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(), so it works just like io_cqring_ev_posted_iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: reorder cqring_flush and wakeupsPavel Begunkov2023-08-242-12/+4
| | | | | | | | | | | | | | | | | | | | | | Unlike in the past, io_commit_cqring_flush() doesn't do anything that may need io_cqring_wake() to be issued after, all requests it completes will go via task_work. Do io_commit_cqring_flush() after io_cqring_wake() to clean up __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed32dcfeec47e6c97bd6b18c152ddce5b218403f.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: optimise extra io_get_cqe null checkPavel Begunkov2023-08-242-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If the cached cqe check passes in io_get_cqe*() it already means that the cqe we return is valid and non-zero, however the compiler is unable to optimise null checks like in io_fill_cqe_req(). Do a bit of trickery, return success/fail boolean from io_get_cqe*() and store cqe in the cqe parameter. That makes it do the right thing, erasing the check together with the introduced indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: refactor __io_get_cqe()Pavel Begunkov2023-08-242-20/+16
| | | | | | | | | | | | | | | | | | | | Make __io_get_cqe simpler by not grabbing the cqe from refilled cached, but letting io_get_cqe() do it for us. That's cleaner and removes some duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: simplify big_cqe handlingPavel Begunkov2023-08-243-20/+8
| | | | | | | | | | | | | | | | | | | | Don't keep big_cqe bits of req in a union with hash_node, find a separate space for it. It's bit safer, but also if we keep it always initialised, we can get rid of ugly REQ_F_CQE32_INIT handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: cqe init hardeningPavel Begunkov2023-08-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | io_kiocb::cqe stores the completion info which we'll memcpy to userspace, and we rely on callbacks and other later steps to populate it with right values. We have never had problems with that, but it would still be safer to zero it on allocation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b16a3b64dde678686460d3c3792c3ba6d3d1bc7a.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: improve cqe !tracing hot pathPavel Begunkov2023-08-241-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking at io_fill_cqe_req()'s asm I stumbled on our trace points turning into the chunk below: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, req->extra1, req->extra2); io_uring/io_uring.c:898: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, movq 232(%rbx), %rdi # req_44(D)->big_cqe.extra2, _5 movq 224(%rbx), %rdx # req_44(D)->big_cqe.extra1, _6 movl 84(%rbx), %r9d # req_44(D)->cqe.D.81184.flags, _7 movl 80(%rbx), %r8d # req_44(D)->cqe.res, _8 movq 72(%rbx), %rcx # req_44(D)->cqe.user_data, _9 movq 88(%rbx), %rsi # req_44(D)->ctx, _10 ./arch/x86/include/asm/jump_label.h:27: asm_volatile_goto("1:" 1:jmp .L1772 # objtool NOPs this # ... It does a jump_label for actual tracing, but those 6 moves will stay there in the hottest io_uring path. As an optimisation, add a trace_io_uring_complete_enabled() check, which is also uses jump_labels, it tricks the compiler into behaving. It removes the junk without changing anything else int the hot path. Note: apparently, it's not only me noticing it, and people are also working it around. We should remove the check when it's solved generically or rework tracing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_byKees Cook2023-08-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct io_mapped_ubuf. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230817212146.never.853-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is usedJens Axboe2023-08-165-15/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we setup the ring with SQPOLL, then that polling thread has its own io-wq setup. This means that if the application uses IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be setting it for the invoking task, but rather the sqpoll task. Add an sqpoll helper that parks the thread and updates the affinity, and use that one if we're using SQPOLL. Fixes: fe76421d1da1 ("io_uring: allow user configurable IO thread CPU affinity") Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/884 Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: simplify io_run_task_work_sig returnPavel Begunkov2023-08-111-2/+2
| | | | | | | | | | | | | | | | | | Nobody cares about io_run_task_work_sig returning 1, we only check for negative errors. Simplify by keeping to 0/-error returns. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3aec8a532c003d6e50739b969a82989402696170.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rsrc: keep one global dummy_ubufPavel Begunkov2023-08-112-13/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We set empty registered buffers to dummy_ubuf as an optimisation. Currently, we allocate the dummy entry for each ring, whenever we can simply have one global instance. We're casting out const on assignment, it's fine as we're not going to change the content of the dummy, the constness gives us an extra layer of protection if sth ever goes wrong. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4a96dda35ab755914bc43f6781bba0df97ac489.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: never overflow io_aux_cqePavel Begunkov2023-08-115-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now all callers of io_aux_cqe() set allow_overflow to false, remove the parameter and not allow overflowing auxilary multishot cqes. When CQ is full the function callers and all multishot requests in general are expected to complete the request. That prevents indefinite in-background grows of the overflow list and let's the userspace to handle the backlog at its own pace. Resubmitting a request should also be faster than accounting a bunch of overflows, so it should be better for perf when it happens, but a well behaving userspace should be trying to avoid overflows in any case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove return from io_req_cqe_overflow()Pavel Begunkov2023-08-112-5/+5
| | | | | | | | | | | | | | | | Nobody checks io_req_cqe_overflow()'s return, make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: open code io_fill_cqe_req()Pavel Begunkov2023-08-113-14/+7
| | | | | | | | | | | | | | | | | | io_fill_cqe_req() is only called from one place, open code it, and rename __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/net: don't overflow multishot recvPavel Begunkov2023-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | Don't allow overflowing multishot recv CQEs, it might get out of hand, hurt performance, and in the worst case scenario OOM the task. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55c ("io_uring: multishot recv") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b295634e8f1b71aa764c984608c22d85f88f75c.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/net: don't overflow multishot acceptPavel Begunkov2023-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | Don't allow overflowing multishot accept CQEs, we want to limit the grows of the overflow list. Cc: stable@vger.kernel.org Fixes: 4e86a2c980137 ("io_uring: implement multishot mode for accept") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d0d749649244873772623dd7747966f516fe6e2.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/io-wq: don't gate worker wake up success on wake_up_process()Jens Axboe2023-08-111-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All we really care about is finding a free worker. If said worker is already running, it's either starting new work already or it's just finishing up existing work. For the latter, we'll be finding this work item next anyway, and for the former, if the worker does go to sleep, it'll create a new worker anyway as we have pending items. This reduces try_to_wake_up() overhead considerably: 23.16% -10.46% [kernel.kallsyms] [k] try_to_wake_up Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/io-wq: reduce frequency of acct->lock acquisitionsJens Axboe2023-08-111-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we check if we have work to run, we grab the acct lock, check, drop it, and then return the result. If we do have work to run, then running the work will again grab acct->lock and get the work item. This causes us to grab acct->lock more frequently than we need to. If we have work to do, have io_acct_run_queue() return with the acct lock still acquired. io_worker_handle_work() is then always invoked with the acct lock already held. In a simple test cases that stats files (IORING_OP_STATX always hits io-wq), we see a nice reduction in locking overhead with this change: 19.32% -12.55% [kernel.kallsyms] [k] __cmpwait_case_32 20.90% -12.07% [kernel.kallsyms] [k] queued_spin_lock_slowpath Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/io-wq: don't grab wq->lock for worker activationJens Axboe2023-08-111-3/+0
| | | | | | | | | | | | | | | | | | The worker free list is RCU protected, and checks for workers going away when iterating it. There's no need to hold the wq->lock around the lookup. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove unnecessary forward declarationJens Axboe2023-08-101-1/+0
| | | | | | | | | | | | | | We never use io_move_task_work_from_local() before it's defined in the file anyway, so kill the forward declaration. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: have io_file_put() take an io_kiocb rather than the fileJens Axboe2023-08-102-7/+5
| | | | | | | | | | | | | | No functional changes in this patch, just a prep patch for needing the request in io_file_put(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/splice: use fput() directlyJens Axboe2023-08-101-2/+2
| | | | | | | | | | | | | | No point in using io_file_put() here, as we need to check if it's a fixed file in the caller anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/fdinfo: get rid of ref trygetJens Axboe2023-08-101-12/+6
| | | | | | | | | | | | | | | | The caller holds a reference to the ring itself, so by definition the ring cannot go away. There's no need to play games with tryget for the reference, as we don't need an extra reference at all. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: cleanup 'ret' handling in io_iopoll_check()Jens Axboe2023-08-091-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | We return 0 for success, or -error when there's an error. Move the 'ret' variable into the loop where we are actually using it, to make it clearer that we don't carry this variable forward for return outside of the loop. While at it, also move the need_resched() break condition out of the while check itself, keeping it with the signal pending check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: break iopolling on signalPavel Begunkov2023-08-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Don't keep spinning iopoll with a signal set. It'll eventually return back, e.g. by virtue of need_resched(), but it's not a nice user experience. Cc: stable@vger.kernel.org Fixes: def596e9557c9 ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: fix false positive KASAN warningsPavel Begunkov2023-08-092-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_req_local_work_add() peeks into the work list, which can be executed in the meanwhile. It's completely fine without KASAN as we're in an RCU read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may trigger a false positive warning because internal io_uring caches are sanitised. Remove sanitisation from the io_uring request cache for now. Cc: stable@vger.kernel.org Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: fix drain stalls by invalid SQEPavel Begunkov2023-08-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | cq_extra is protected by ->completion_lock, which io_get_sqe() misses. The bug is harmless as it doesn't happen in real life, requires invalid SQ index array and racing with submission, and only messes up the userspace, i.e. stall requests execution but will be cleaned up on ring destruction. Fixes: 15641e427070f ("io_uring: don't cache number of dropped SQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rsrc: Remove unused declaration io_rsrc_put_tw()Yue Haibing2023-08-091-1/+0
| | | | | | | | | | | | | | | | | | Commit 36b9818a5a84 ("io_uring/rsrc: don't offload node free") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230808151058.4572-1-yuehaibing@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>