summaryrefslogtreecommitdiffstats
path: root/io_uring/io_uring.c
Commit message (Collapse)AuthorAgeFilesLines
* io_uring/napi: pass ktime to io_napi_adjust_timeoutPavel Begunkov2024-07-261-2/+4
| | | | | | | | | | Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: align iowq and task request error handlingPavel Begunkov2024-07-241-1/+1
| | | | | | | | | | | | | There is a difference in how io_queue_sqe and io_wq_submit_work treat error codes they get from io_issue_sqe. The first one fails anything unknown but latter only fails when the code is negative. It doesn't make sense to have this discrepancy, align them to the io_queue_sqe behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: tighten task exit cancellationsPavel Begunkov2024-07-241-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: 3f48cf18f886c ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linuxLinus Torvalds2024-07-151-95/+55
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...
| * io_uring/msg_ring: fix overflow postingPavel Begunkov2024-07-021-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The caller of io_cqring_event_overflow() should be holding the completion_lock, which is violated by io_msg_tw_complete. There is only one caller of io_add_aux_cqe(), so just add locking there for now. WARNING: CPU: 0 PID: 5145 at io_uring/io_uring.c:703 io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 RIP: 0010:io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 <TASK> __io_post_aux_cqe io_uring/io_uring.c:816 [inline] io_add_aux_cqe+0x27c/0x320 io_uring/io_uring.c:837 io_msg_tw_complete+0x9d/0x4d0 io_uring/msg_ring.c:78 io_fallback_req_func+0xce/0x1c0 io_uring/io_uring.c:256 process_one_work kernel/workqueue.c:3224 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3305 worker_thread+0x86d/0xd40 kernel/workqueue.c:3383 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:144 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fixes: f33096a3c99c0 ("io_uring: add io_add_aux_cqe() helper") Reported-by: syzbot+f7f9c893345c5c615d34@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c7350d07fefe8cce32b50f57665edbb6355ea8c1.1719927398.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/msg_ring: add an alloc cache for io_kiocb entriesJens Axboe2024-06-241-0/+6
| | | | | | | | | | | | | | | | | | | | | | With slab accounting, allocating and freeing memory has considerable overhead. Add a basic alloc cache for the io_kiocb allocations that msg_ring needs to do. Unlike other caches, this one is used by the sender, grabbing it from the remote ring. When the remote ring gets the posted completion, it'll free it locally. Hence it is separately locked, using ctx->msg_lock. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add io_add_aux_cqe() helperJens Axboe2024-06-241-2/+21
| | | | | | | | | | | | | | | | This helper will post a CQE, and can be called from task_work where we now that the ctx is already properly locked and that deferred completions will get flushed later on. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add remote task_work execution helperJens Axboe2024-06-241-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | All our task_work handling is targeted at the state in the io_kiocb itself, which is what it is being used for. However, MSG_RING rolls its own task_work handling, ignoring how that is usually done. In preparation for switching MSG_RING to be able to use the normal task_work handling, add io_req_task_work_add_remote() which allows the caller to pass in the target io_ring_ctx. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/io-wq: make io_wq_work flags atomicJens Axboe2024-06-161-6/+6
| | | | | | | | | | | | | | | | | | | | The work flags can be set/accessed from different tasks, both the originator of the request, and the io-wq workers. While modifications aren't concurrent, it still makes KMSAN unhappy. There's no real downside to just making the flag reading/manipulation use proper atomics here. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: use 'state' consistentlyJens Axboe2024-06-161-1/+1
| | | | | | | | | | | | | | | | __io_submit_flush_completions() assigns ctx->submit_state to a local variable and uses it in all but one spot, switch that forgotten statement to using 'state' as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/eventfd: move eventfd handling to separate fileJens Axboe2024-06-161-81/+1
| | | | | | | | | | | | | | | | This is pretty nicely abstracted already, but let's move it to a separate file rather than have it in the main io_uring file. With that, we can also move the io_ev_fd struct and enum out of global scope. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/eventfd: move to more idiomatic RCU free usageJens Axboe2024-06-161-23/+26
| | | | | | | | | | | | | | | | | | | | | | In some ways, it just "happens to work" currently with using the ops field for both the free and signaling bit. But it depends on ordering of operations in terms of freeing and signaling. Clean it up and use the usual refs == 0 under RCU read side lock to determine if the ev_fd is still valid, and use the reference to gate the freeing as well. Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | io_uring: signal SQPOLL task_work with TWA_SIGNAL_NO_IPIJens Axboe2024-06-241-2/+2
|/ | | | | | | | | | | Before SQPOLL was transitioned to managing its own task_work, the core used TWA_SIGNAL_NO_IPI to ensure that task_work was processed. If not, we can't be sure that all task_work is processed at SQPOLL thread exit time. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix cancellation overwriting req->flagsPavel Begunkov2024-06-131-0/+1
| | | | | | | | | | | | | | | | | | Only the current owner of a request is allowed to write into req->flags. Hence, the cancellation path should never touch it. Add a new field instead of the flag, move it into the 3rd cache line because it should always be initialised. poll_refs can move further as polling is an involved process anyway. It's a minimal patch, in the future we can and should find a better place for it and remove now unused REQ_F_CANCEL_SEQ. Fixes: 521223d7c229f ("io_uring/cancel: don't default to setting req->work.cancel_seq") Cc: stable@vger.kernel.org Reported-by: Li Shi <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6827b129f8f0ad76fa9d1f0a773de938b240ffab.1718323430.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: remove checks for NULL 'sq_offset'Jens Axboe2024-05-221-4/+2
| | | | | | | | Since the 5.12 kernel release, nobody has been passing NULL as the sq_offset pointer. Remove the checks for it being NULL or not, it will always be valid. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linuxLinus Torvalds2024-05-131-544/+121
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
| * io_uring/rw: reinstate thread check for retriesJens Axboe2024-04-251-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allowing retries for everything is arguably the right thing to do, now that every command type is async read from the start. But it's exposed a few issues around missing check for a retry (which cca6571381a0 exposed), and the fixup commit for that isn't necessarily 100% sound in terms of iov_iter state. For now, just revert these two commits. This unfortunately then re-opens the fact that -EAGAIN can get bubbled to userspace for some cases where the kernel very well could just sanely retry them. But until we have all the conditions covered around that, we cannot safely enable that. This reverts commit df604d2ad480fcf7b39767280c9093e13b1de952. This reverts commit cca6571381a0bdc88021a1f7a4c2349df21279f7. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/net: support bundles for recvJens Axboe2024-04-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it. This can be used with multishot receive as well, or without it. Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rw: ensure retry condition isn't lostJens Axboe2024-04-171-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: ensure overflow entries are dropped when ring is exitingJens Axboe2024-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | A previous consolidation cleanup missed handling the case where the ring is dying, and __io_cqring_overflow_flush() doesn't flush entries if the CQ ring is already full. This is fine for the normal CQE overflow flushing, but if the ring is going away, we need to flush everything, even if it means simply freeing the overflown entries. Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: consolidate overflow flushingPavel Begunkov2024-04-151-25/+15
| | | | | | | | | | | | | | | | | | | | Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill() into a single function as it once was, it's easier to work with it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: always lock __io_cqring_overflow_flushPavel Begunkov2024-04-151-5/+8
| | | | | | | | | | | | | | | | | | | | | | Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: open code io_cqring_overflow_flush()Pavel Begunkov2024-04-151-8/+3
| | | | | | | | | | | | | | | | There is only one caller of io_cqring_overflow_flush(), open code it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove extra SQPOLL overflow flushPavel Begunkov2024-04-151-2/+0
| | | | | | | | | | | | | | | | | | | | | | c1edbf5f081be ("io_uring: flag SQPOLL busy condition to userspace") added an extra overflowed CQE flush in the SQPOLL submission path due to backpressure, which was later removed. Remove the flush and let io_cqring_wait() / iopoll handle it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: unexport io_req_cqe_overflow()Pavel Begunkov2024-04-151-1/+1
| | | | | | | | | | | | | | | | | | There are no users of io_req_cqe_overflow() apart from io_uring.c, make it static. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: return void from io_put_kbuf_comp()Ming Lei2024-04-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | The only caller doesn't handle the return value of io_put_kbuf_comp(), so change its return type into void. Also follow Jens's suggestion to rename it as io_put_kbuf_drop(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove io_req_put_rsrc_locked()Pavel Begunkov2024-04-151-3/+2
| | | | | | | | | | | | | | | | | | | | | | io_req_put_rsrc_locked() is a weird shim function around io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding ->uring_lock, so we can just use it directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove async request cachePavel Begunkov2024-04-151-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: turn implicit assumptions into a warningPavel Begunkov2024-04-151-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | io_req_complete_post() is now io-wq only and shouldn't be used outside of it, i.e. it relies that io-wq holds a ref for the request as explained in a comment below. Let's add a warning to enforce the assumption and make sure nobody would try to do anything weird. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: kill dead code in io_req_complete_postMing Lei2024-04-151-35/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"), io_req_complete_post() is only called from io-wq submit work, where the request reference is guaranteed to be grabbed and won't drop to zero in io_req_complete_post(). Kill the dead code, meantime add req_ref_put() to put the reference. Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: move mapping/allocation helpers to a separate fileJens Axboe2024-04-151-325/+2
| | | | | | | | | | | | | | | | Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: use unpin_user_pages() where appropriateJens Axboe2024-04-151-3/+1
| | | | | | | | | | | | | | There are a few cases of open-rolled loops around unpin_user_page(), use the generic helper instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ringJens Axboe2024-04-151-42/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: unify io_pin_pages()Jens Axboe2024-04-151-19/+42
| | | | | | | | | | | | | | Move it into io_uring.c where it belongs, and use it in there as well rather than have two implementations of this. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: use vmap() for ring mappingJens Axboe2024-04-151-29/+9
| | | | | | | | | | | | | | This is the last holdout which does odd page checking, convert it to vmap just like what is done for the non-mmap path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: get rid of remap_pfn_range() for mapping rings/sqesJens Axboe2024-04-151-8/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_pages() and have it Just Work. If possible, allocate a single compound page that covers the range that is needed. If that works, then we can just use page_address() on that page. If we fail to get a compound page, allocate single pages and use vmap() to map them into the kernel virtual address space. This just covers the rings/sqes, the other remaining user of the mmap remap_pfn_range() user will be converted separately. Once that is done, we can kill the old alloc/free code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados2024-04-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from kernel_io_uring_disabled_table Signed-off-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: Remove unused functionJiapeng Chong2024-04-151-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The function are defined in the io_uring.c file, but not called elsewhere, so delete the unused function. io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: refill request cache in memory orderJens Axboe2024-04-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The allocator will generally return memory in order, but __io_alloc_req_refill() then adds them to a stack and we'll extract them in the opposite order. This obviously isn't a huge deal, but: 1) it makes debugging easier when they are in order 2) keeping them in-order is the right thing to do 3) reduces the code for adding them to the stack Just add them in reverse to the stack. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/poll: shrink alloc cache size to 32Jens Axboe2024-04-151-1/+1
| | | | | | | | | | | | | | This should be plenty, rather than the default of 128, and matches what we have on the rsrc and futex side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/alloc_cache: switch to array based cachingJens Axboe2024-04-151-15/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: drop ->prep_async()Jens Axboe2024-04-151-32/+4
| | | | | | | | | | | | | | | | | | | | It's now unused, drop the code related to it. This includes the io_issue_defs->manual alloc field. While in there, and since ->async_size is now being used a bit more frequently and in the issue path, move it to io_issue_defs[]. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/uring_cmd: switch to always allocating async dataJens Axboe2024-04-151-0/+3
| | | | | | | | | | | | | | | | | | Basic conversion ensuring async_data is allocated off the prep path. Adds a basic alloc cache as well, as passthrough IO can be quite high in rate. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rw: always setup io_async_rw for read/write requestsJens Axboe2024-04-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | read/write requests try to put everything on the stack, and then alloc and copy if a retry is needed. This necessitates a bunch of nasty code that deals with intermediate state. Get rid of this, and have the prep side setup everything that is needed upfront, which greatly simplifies the opcode handlers. This includes adding an alloc cache for io_async_rw, to make it cheap to handle. In terms of cost, this should be basically free and transparent. For the worst case of {READ,WRITE}_FIXED which didn't need it before, performance is unaffected in the normal peak workload that is being used to test that. Still runs at 122M IOPS. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove timeout/poll specific cancelationsJens Axboe2024-04-151-9/+0
| | | | | | | | | | | | | | | | | | | | For historical reasons these were special cased, as they were the only ones that needed cancelation. But now we handle cancelations generally, and hence there's no need to check for these in io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles both these and the rest as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: flush delayed fallback task_work in cancelationJens Axboe2024-04-151-0/+2
| | | | | | | | | | | | | | Just like we run the inline task_work, ensure we also factor in and run the fallback task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: refactor io_req_complete_post()Pavel Begunkov2024-04-151-18/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests to task_work, it's much cleaner and should normally happen. We couldn't do it before because there was a possibility of looping in complete_post() -> tw -> complete_post() -> ... Also, unexport the function and inline __io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: remove current check from complete_postPavel Begunkov2024-04-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | task_work execution is now always locked, and we shouldn't get into io_req_complete_post() from them. That means that complete_post() is always called out of the original task context and we don't even need to check current. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: get rid of intermediate aux cqe cachesPavel Begunkov2024-04-151-49/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: refactor io_fill_cqe_req_auxPavel Begunkov2024-04-151-13/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The restriction on multishot execution context disallowing io-wq is driven by rules of io_fill_cqe_req_aux(), it should only be called in the master task context, either from the syscall path or in task_work. Since task_work now always takes the ctx lock implying IO_URING_F_COMPLETE_DEFER, we can just assume that the function is always called with its defer argument set to true. Kill the argument. Also rename the function for more consistency as "fill" in CQE related functions was usually meant for raw interfaces only copying data into the CQ without any locking, waking the user and other accounting "post" functions take care of. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>