summaryrefslogtreecommitdiffstats
path: root/io_uring
Commit message (Collapse)AuthorAgeFilesLines
* io_uring: always prep_async for drain requestsDylan Yudaken2023-01-271-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | Drain requests all go through io_drain_req, which has a quick exit in case there is nothing pending (ie the drain is not useful). In that case it can run the issue the request immediately. However for safety it queues it through task work. The problem is that in this case the request is run asynchronously, but the async work has not been prepared through io_req_prep_async. This has not been a problem up to now, as the task work always would run before returning to userspace, and so the user would not have a chance to race with it. However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and the work might be defered, giving userspace a chance to change data being referred to in the request. Instead _always_ prep_async for drain requests, which is simpler anyway and removes this issue. Cc: stable@vger.kernel.org Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: cache provided buffer group value for multishot receivesJens Axboe2023-01-231-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | If we're using ring provided buffers with multishot receive, and we end up doing an io-wq based issue at some points that also needs to select a buffer, we'll lose the initially assigned buffer group as io_ring_buffer_select() correctly clears the buffer group list as the issue isn't serialized by the ctx uring_lock. This is fine for normal receives as the request puts the buffer and finishes, but for multishot, we will re-arm and do further receives. On the next trigger for this multishot receive, the receive will try and pick from a buffer group whose value is the same as the buffer ID of the las receive. That is obviously incorrect, and will result in a premature -ENOUFS error for the receive even if we had available buffers in the correct group. Cache the buffer group value at prep time, so we can restore it for future receives. This only needs doing for the above mentioned case, but just do it by default to keep it easier to read. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Cc: Dylan Yudaken <dylany@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: don't reissue in case of poll race on multishot requestJens Axboe2023-01-201-1/+5
| | | | | | | | | | | | | | | | | | A previous commit fixed a poll race that can occur, but it's only applicable for multishot requests. For a multishot request, we can safely ignore a spurious wakeup, as we never leave the waitqueue to begin with. A blunt reissue of a multishot armed request can cause us to leak a buffer, if they are ring provided. While this seems like a bug in itself, it's not really defined behavior to reissue a multishot request directly. It's less efficient to do so as well, and not required to rearm anything like it is for singleshot poll requests. Cc: stable@vger.kernel.org Fixes: 6e5aedb9324a ("io_uring/poll: attempt request issue after racy poll wakeup") Reported-and-tested-by: Olivier Langlois <olivier@trillion01.com> Link: https://github.com/axboe/liburing/issues/778 Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/msg_ring: fix remote queue to disabled ringPavel Begunkov2023-01-202-2/+10
| | | | | | | | | | | | IORING_SETUP_R_DISABLED rings don't have the submitter task set, so it's not always safe to use ->submitter_task. Disallow posting msg_ring messaged to disabled rings. Also add task NULL check for loosy sync around testing for IORING_SETUP_R_DISABLED. Cc: stable@vger.kernel.org Fixes: 6d043ee1164ca ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/msg_ring: fix flagging remote executionPavel Begunkov2023-01-201-17/+23
| | | | | | | | | | | | | | | There is a couple of problems with queueing a tw in io_msg_ring_data() for remote execution. First, once we queue it the target ring can go away and so setting IORING_SQ_TASKRUN there is not safe. Secondly, the userspace might not expect IORING_SQ_TASKRUN. Extract a helper and uniformly use TWA_SIGNAL without TWA_SIGNAL_NO_IPI tricks for now, just as it was done in the original patch. Cc: stable@vger.kernel.org Fixes: 6d043ee1164ca ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/msg_ring: fix missing lock on overflow for IOPOLLJens Axboe2023-01-191-9/+30
| | | | | | | | | | | If the target ring is configured with IOPOLL, then we always need to hold the target ring uring_lock before posting CQEs. We could just grab it unconditionally, but since we don't expect many target rings to be of this type, make grabbing the uring_lock conditional on the ring type. Link: https://lore.kernel.org/io-uring/Y8krlYa52%2F0YGqkg@ip-172-31-85-199.ec2.internal/ Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/msg_ring: move double lock/unlock helpers higher upJens Axboe2023-01-191-24/+23
| | | | | | | | | In preparation for needing them somewhere else, move them and get rid of the unused 'issue_flags' for the unlock side. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: lock overflowing for IOPOLLPavel Begunkov2023-01-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reports an issue with overflow filling for IOPOLL: WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0 Workqueue: events_unbound io_ring_exit_work Call trace:  io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734  io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773  io_fill_cqe_req io_uring/io_uring.h:168 [inline]  io_do_iopoll+0x474/0x62c io_uring/rw.c:1065  io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513  io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056  io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869  process_one_work+0x2d8/0x504 kernel/workqueue.c:2289  worker_thread+0x340/0x610 kernel/workqueue.c:2436  kthread+0x12c/0x158 kernel/kthread.c:376  ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863 There is no real problem for normal IOPOLL as flush is also called with uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL, for which __io_cqring_overflow_flush() happens from the CQ waiting path. Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: attempt request issue after racy poll wakeupJens Axboe2023-01-121-11/+20
| | | | | | | | | | | | | | | | | | | | | | If we have multiple requests waiting on the same target poll waitqueue, then it's quite possible to get a request triggered and get disappointed in not being able to make any progress with it. If we race in doing so, we'll potentially leave the poll request on the internal tables, but removed from the waitqueue. That means that any subsequent trigger of the poll waitqueue will not kick that request into action, causing an application to potentially wait for completion of a request that will never happen. Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather than have complicated logic for how to re-arm a given type of request, just punt it for a reissue. While in there, move the 'ret' variable to the only section where it gets used. This avoids confusion the scope of it. Cc: stable@vger.kernel.org Fixes: eb0089d629ba ("io_uring: single shot poll removal optimisation") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/fdinfo: include locked hash table in fdinfo outputJens Axboe2023-01-101-2/+10
| | | | | | | | | | | | A previous commit split the hash table for polled requests into two parts, but didn't get the fdinfo output updated. This means that it's less useful for debugging, as we may think a given request is not pending poll. Fix this up by dumping the locked hash table contents too. Fixes: 9ca9fb24d5fe ("io_uring: mutex locked poll hashing") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: add hash if ready poll request can't complete inlineJens Axboe2023-01-091-5/+12
| | | | | | | | | | | | | If we don't, then we may lose access to it completely, leading to a request leak. This will eventually stall the ring exit process as well. Cc: stable@vger.kernel.org Fixes: 49f1c68e048f ("io_uring: optimise submission side poll_refs") Reported-and-tested-by: syzbot+6c95df01470a47fc3af4@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/0000000000009f829805f1ce87b2@google.com/ Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/io-wq: only free worker if it was allocated for creationJens Axboe2023-01-081-1/+6
| | | | | | | | | | | We have two types of task_work based creation, one is using an existing worker to setup a new one (eg when going to sleep and we have no free workers), and the other is allocating a new worker. Only the latter should be freed when we cancel task_work creation for a new worker. Fixes: af82425c6a2d ("io_uring/io-wq: free worker if task_work creation is canceled") Reported-by: syzbot+d56ec896af3637bdb7e4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix CQ waiting timeout handlingPavel Begunkov2023-01-051-3/+3
| | | | | | | | | | | | | | Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in particular we rearm it anew every time we get into io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2 CQEs and getting a task_work in the middle may double the timeout value, or even worse in some cases task may wait indefinitely. Cc: stable@vger.kernel.org Fixes: 228339662b398 ("io_uring: don't convert to jiffies for waiting on timeouts") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: lockdep annotate CQ lockingPavel Begunkov2023-01-032-3/+17
| | | | | | | | | | Locking around CQE posting is complex and depends on options the ring is created with, add more thorough lockdep annotations checking all invariants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: pin context while queueing deferred twPavel Begunkov2023-01-031-1/+7
| | | | | | | | | | | | | Unlike normal tw, nothing prevents deferred tw to be executed right after an tw item added to ->work_llist in io_req_local_work_add(). For instance, the waiting task may get waken up by CQ posting or a normal tw. Thus we need to pin the ring for the rest of io_req_local_work_add() Cc: stable@vger.kernel.org Fixes: c0e0d6ba25f18 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/io-wq: free worker if task_work creation is canceledJens Axboe2023-01-021-0/+1
| | | | | | | | | | If we cancel the task_work, the worker will never come into existance. As this is the last reference to it, ensure that we get it freed appropriately. Cc: stable@vger.kernel.org Reported-by: 진호 <wnwlsgh98@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: check for valid register opcode earlierJens Axboe2022-12-231-2/+3
| | | | | | | | We only check the register opcode value inside the restricted ring section, move it into the main io_uring_register() function instead and check it up front. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/cancel: re-grab ctx mutex after finishing waitJens Axboe2022-12-211-5/+4
| | | | | | | | | | | | | | | If we have a signal pending during cancelations, it'll cause the task_work run to return an error. Since we didn't run task_work, the current task is left in TASK_INTERRUPTIBLE state when we need to re-grab the ctx mutex, and the kernel will rightfully complain about that. Move the lock grabbing for the error cases outside the loop to avoid that issue. Reported-by: syzbot+7df055631cd1be4586fd@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/0000000000003a14a905f05050b0@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: finish waiting before flushing overflow entriesJens Axboe2022-12-211-9/+16
| | | | | | | | | | | | | | | If we have overflow entries being generated after we've done the initial flush in io_cqring_wait(), then we could be flushing them in the main wait loop as well. If that's done after having added ourselves to the cq_wait waitqueue, then the task state can be != TASK_RUNNING when we enter the overflow flush. Check for the need to overflow flush, and finish our wait cycle first if we have to do so. Reported-and-tested-by: syzbot+cf6ea1d6bb30a4ce10b2@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/000000000000cb143a05f04eee15@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: fix cleanup after recyclePavel Begunkov2022-12-191-1/+1
| | | | | | | | | | Don't access io_async_msghdr io_netmsg_recycle(), it may be reallocated. Cc: stable@vger.kernel.org Fixes: 9bb66906f23e5 ("io_uring: support multishot in recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e326f4ad4046ddadf15bf34bf3fa58c6372f6b5.1671461985.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: ensure compat import handlers clear free_iovJens Axboe2022-12-191-0/+1
| | | | | | | | | | | If we're not allocating the vectors because the count is below UIO_FASTIOV, we still do need to properly clear ->free_iov to prevent an erronous free of on-stack data. Reported-by: Jiri Slaby <jirislaby@gmail.com> Fixes: 4c17a496a7a0 ("io_uring/net: fix cleanup double free free_iov init") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: include task_work run after scheduling in wait for eventsJens Axboe2022-12-171-1/+10
| | | | | | | | | | | | | | It's quite possible that we got woken up because task_work was queued, and we need to process this task_work to generate the events waited for. If we return to the wait loop without running task_work, we'll end up adding the task to the waitqueue again, only to call io_cqring_wait_schedule() again which will run the task_work. This is less efficient than it could be, as it requires adding to the cq_wait queue again. It also triggers the wakeup path for completions as cq_wait is now non-empty with the task itself, and it'll require another lock grab and deletion to remove ourselves from the waitqueue. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_workJens Axboe2022-12-171-2/+1
| | | | | | | | | Use task_work_pending() as a better test for whether we have task_work or not, TIF_NOTIFY_SIGNAL is only valid if the any of the task_work items had been queued with TWA_SIGNAL as the notification mechanism. Hence task_work_pending() is a more reliable check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: use call_rcu_hurry if signaling an eventfdDylan Yudaken2022-12-151-1/+1
| | | | | | | | | | | | | | | | io_uring uses call_rcu in the case it needs to signal an eventfd as a result of an eventfd signal, since recursing eventfd signals are not allowed. This should be calling the new call_rcu_hurry API to not delay the signal. Signed-off-by: Dylan Yudaken <dylany@meta.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Paul E. McKenney <paulmck@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20221215184138.795576-1-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix overflow handling regressionPavel Begunkov2022-12-152-2/+2
| | | | | | | | | | | | Because the single task locking series got reordered ahead of the timeout and completion lock changes, two hunks inadvertently ended up using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant that we dropped overflow handling in those two spots. Reinstate the correct CQE filling helper. Fixes: f66f73421f0a ("io_uring: skip spinlocking for ->task_complete") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: ease timeout flush locking requirementsPavel Begunkov2022-12-142-7/+4
| | | | | | | | We don't need completion_lock for timeout flushing, don't take it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e3dc657975ac445b80e7bdc40050db783a5935a.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: revise completion_lock lockingPavel Begunkov2022-12-143-15/+20
| | | | | | | | | | | io_kill_timeouts() doesn't post any events but queues everything to task_work. Locking there is needed for protecting linked requests traversing, we should grab completion_lock directly instead of using io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: protect cq_timeouts with timeout_lockPavel Begunkov2022-12-141-1/+3
| | | | | | | | | | | | Read cq_timeouts in io_flush_timeouts() only after taking the timeout_lock, as it's protected by it. There are many places where we also grab ->completion_lock, but for instance io_timeout_fn() doesn't and still modifies cq_timeouts. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9c79544dd6cf5c4018cb1bab99cf481a93ea46ef.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds2022-12-131-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
| * io_uring/rw: enable bio caches for IRQ rwPavel Begunkov2022-11-161-1/+2
| | | | | | | | | | | | | | | | | | Now we can use IOCB_ALLOC_CACHE not only for iopoll'ed reads/write but also for normal IRQ driven I/O. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb8bd092ed5a4a3b037e84e4777074d07aa5639a.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds2022-12-1310-190/+359
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates part two from Jens Axboe: - Misc fixes (me, Lin) - Series from Pavel extending the single task exclusive ring mode, yielding nice improvements for the common case of having a single ring per thread (Pavel) - Cleanup for MSG_RING, removing our IOPOLL hack (Pavel) - Further poll cleanups and fixes (Pavel) - Misc cleanups and fixes (Pavel) * tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux: (22 commits) io_uring/msg_ring: flag target ring as having task_work, if needed io_uring: skip spinlocking for ->task_complete io_uring: do msg_ring in target task via tw io_uring: extract a io_msg_install_complete helper io_uring: get rid of double locking io_uring: never run tw and fallback in parallel io_uring: use tw for putting rsrc io_uring: force multishot CQEs into task context io_uring: complete all requests in task context io_uring: don't check overflow flush failures io_uring: skip overflow CQE posting for dying ring io_uring: improve io_double_lock_ctx fail handling io_uring: dont remove file from msg_ring reqs io_uring: reshuffle issue_flags io_uring: don't reinstall quiesce node for each tw io_uring: improve rsrc quiesce refs checks io_uring: don't raw spin unlock to match cq_lock io_uring: combine poll tw handlers io_uring: improve poll warning handling io_uring: remove ctx variable in io_poll_check_events ...
| * | io_uring/msg_ring: flag target ring as having task_work, if neededJens Axboe2022-12-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before the recent change, we didn't even wake the targeted task when posting the cqe remotely. Now we do wake it, but we do want to ensure that the recipient knows there's potential work there that needs to get processed to get the CQE posted. OR in IORING_SQ_TASKRUN for that purpose. Link: https://lore.kernel.org/io-uring/2843c6b4-ba9a-b67d-e0f4-957f42098489@kernel.dk/ Fixes: 6d043ee1164c ("io_uring: do msg_ring in target task via tw") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: skip spinlocking for ->task_completePavel Begunkov2022-12-072-20/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->task_complete was added to serialised CQE posting by doing it from the task context only (or fallback wq when the task is dead), and now we can use that to avoid taking ->completion_lock while filling CQ entries. The patch skips spinlocking only in two spots, __io_submit_flush_completions() and flushing in io_aux_cqe, it's safer and covers all cases we care about. Extra care is taken to force taking the lock while queueing overflow entries. It fundamentally relies on SINGLE_ISSUER to have only one task posting events. It also need to take into account overflowed CQEs, flushing of which happens in the cq wait path, and so this implementation also needs DEFER_TASKRUN to limit waiters. For the same reason we disable it for SQPOLL, and for IOPOLL as it won't benefit from it in any case. DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a8c91fd82cfcdcc1d2e5bac7051fe2c183bda73.1670384893.git.asml.silence@gmail.com [axboe: modify to apply] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: do msg_ring in target task via twPavel Begunkov2022-12-071-3/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While executing in a context of one io_uring instance msg_ring manipulates another ring. We're trying to keep CQEs posting contained in the context of the ring-owner task, use task_work to send the request to the target ring's task when we're modifying its CQ or trying to install a file. Note, we can't safely use io_uring task_work infra and have to use task_work directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d76c7b28ed5d71b520de4482fbb7f660f21cd80.1670384893.git.asml.silence@gmail.com [axboe: use TWA_SIGNAL_NO_IPI] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: extract a io_msg_install_complete helperPavel Begunkov2022-12-071-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | Extract a helper called io_msg_install_complete() from io_msg_send_fd(), will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1500ca1054cc4286a3ee1c60aacead57fcdfa02a.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: get rid of double lockingPavel Begunkov2022-12-073-36/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need to take both uring_locks at once, msg_ring can be split in two parts, first getting a file from the filetable of the first ring and then installing it into the second one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a80ecc2bc99c3b3f2cf20015d618b7c51419a797.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: never run tw and fallback in parallelPavel Begunkov2022-12-071-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once we fallback a tw we want all requests to that task to be given to the fallback wq so we dont run it in parallel with the last, i.e. post PF_EXITING, tw run of the task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/96f4987265c4312f376f206511c6af3e77aaf5ac.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use tw for putting rsrcPavel Begunkov2022-12-073-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use task_work for completing rsrc removals, it'll be needed later for spinlock optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cbba5d53a11ee6fc2194dacea262c1d733c8b529.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: force multishot CQEs into task contextPavel Begunkov2022-12-071-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multishot are posting CQEs outside of the normal request completion path, which is usually done from within a task work handler. However, it might be not the case when it's yet to be polled but has been punted to io-wq. Make it abide ->task_complete and push it to the polling path when executed by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7714aaff583096769a0f26e8e747759e556feb1.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: complete all requests in task contextPavel Begunkov2022-12-071-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds ctx->task_complete flag. If set, we'll complete all requests in the context of the original task. Note, this extends to completion CQE posting only but not io_kiocb cleanup / free, e.g. io-wq may free the requests in the free calllback. This flag will be used later for optimisations purposes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/21ece72953f76bb2e77659a72a14326227ab6460.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't check overflow flush failuresPavel Begunkov2022-12-071-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only way to fail overflowed CQEs flush is for CQ to be fully packed. There is one place checking for flush failures, i.e. io_cqring_wait(), but we limit the number to be waited for by the CQ size, so getting a failure automatically means that we're done with waiting. Don't check for failures, rarely but they might spuriously fail CQ waiting with -EBUSY. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b720a45c03345655517f8202cbd0bece2848fb2.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: skip overflow CQE posting for dying ringPavel Begunkov2022-12-071-14/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After io_ring_ctx_wait_and_kill() is called there should be no users poking into rings and so there is no need to post CQEs. So, instead of trying to post overflowed CQEs into the CQ, drop them. Also, do it in io_ring_exit_work() in a loop to reduce the number of contexts it can be executed from and even when it struggles to quiesce the ring we won't be leaving memory allocated for longer than needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/26d13751155a735a3029e24f8d9ca992f810419d.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: improve io_double_lock_ctx fail handlingPavel Begunkov2022-12-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | msg_ring will fail the request if it can't lock rings, instead punt it to io-wq as was originally intended. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4697f05afcc37df5c8f89e2fe6d9c7c19f0241f9.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: dont remove file from msg_ring reqsPavel Begunkov2022-12-074-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not be messing with req->file outside of core paths. Clearing it makes msg_ring non reentrant, i.e. luckily io_msg_send_fd() fails the request on failed io_double_lock_ctx() but clearly was originally intended to do retries instead. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e5ac9edadb574fe33f6d727cb8f14ce68262a684.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't reinstall quiesce node for each twPavel Begunkov2022-11-301-20/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to reinit data and install a new rsrc node every time we get a task_work, it's detrimental, just execute it and conitnue waiting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3895d3344164cd9b3a0bbb24a6e357e20a13434b.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: improve rsrc quiesce refs checksPavel Begunkov2022-11-301-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do a little bit of refactoring of io_rsrc_ref_quiesce(), flatten the data refs checks and so get rid of a conditional weird unlock-else-break construct. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d21283e9f88a77612c746ed526d86fe3bfb58a70.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't raw spin unlock to match cq_lockPavel Begunkov2022-11-302-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is one newly added place when we lock ring with io_cq_lock() but unlocking is hand coded calling spin_unlock directly. It's ugly and troublesome in the long run. Make it consistent with the other completion locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ca4f0564492b90214a190cd5b2a6c76522de138.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: combine poll tw handlersPavel Begunkov2022-11-301-33/+21
| | | | | | | | | | | | | | | | | | | | | | | | Merge apoll and regular poll tw handlers, it will help with inlining. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/482e59edb9fc81bd275fdbf486837330fb27120a.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: improve poll warning handlingPavel Begunkov2022-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't try to complete requests if their refs are broken and we've got a warning, it's much better to drop them and potentially leaking than double freeing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/31edf9f96f05d03ab62c114508a231a2dce434cb.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: remove ctx variable in io_poll_check_eventsPavel Begunkov2022-11-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | ctx is only used by io_poll_check_events() for multishot poll CQE posting, don't save it on stack in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/552c1771f8a0e7688afdb4f538ead245f53e80e7.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>