summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-5.19/io_uring-xattr-2022-05-22' of ↵Linus Torvalds2022-05-233-72/+422
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.dk/linux-block Pull io_uring xattr support from Jens Axboe: "Support for the xattr variants" * tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: cleanup error-handling around io_req_complete io_uring: fix trace for reduced sqe padding io_uring: add fgetxattr and getxattr support io_uring: add fsetxattr and setxattr support fs: split off do_getxattr from getxattr fs: split off setxattr_copy and do_setxattr function from setxattr
| * io_uring: cleanup error-handling around io_req_completeKanchan Joshi2022-04-241-29/+5
| | | | | | | | | | | | | | | | | | | | | | | | Move common error-handling to io_req_complete, so that various callers avoid repeating that. Few callers (io_tee, io_splice) require slightly different handling. These are changed to use __io_req_complete instead. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220422101048.419942-1-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add fgetxattr and getxattr supportStefan Roesch2022-04-241-0/+129
| | | | | | | | | | | | | | | | | | This adds support to io_uring for the fgetxattr and getxattr API. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20220323154420.3301504-5-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add fsetxattr and setxattr supportStefan Roesch2022-04-241-0/+165
| | | | | | | | | | | | | | | | | | This adds support to io_uring for the fsetxattr and setxattr API. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-4-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * fs: split off do_getxattr from getxattrStefan Roesch2022-04-242-21/+43
| | | | | | | | | | | | | | | | | | | | This splits off do_getxattr function from the getxattr function. This will allow io_uring to call it from its io worker. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * fs: split off setxattr_copy and do_setxattr function from setxattrStefan Roesch2022-04-242-25/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This splits of the setup part of the function setxattr in its own dedicated function called setxattr_copy. In addition it also exposes a new function called do_setxattr for making the setxattr call. This makes it possible to call these two functions from io_uring in the processing of an xattr request. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds2022-05-233-1017/+1643
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: "Here are the main io_uring changes for 5.19. This contains: - Fixes for sparse type warnings (Christoph, Vasily) - Support for multi-shot accept (Hao) - Support for io_uring managed fixed files, rather than always needing the applicationt o manage the indices (me) - Fix for a spurious poll wakeup (Dylan) - CQE overflow fixes (Dylan) - Support more types of cancelations (me) - Support for co-operative task_work signaling, rather than always forcing an IPI (me) - Support for doing poll first when appropriate, rather than always attempting a transfer first (me) - Provided buffer cleanups and support for mapped buffers (me) - Improve how io_uring handles inflight SCM files (Pavel) - Speedups for registered files (Pavel, me) - Organize the completion data in a struct in io_kiocb rather than keep it in separate spots (Pavel) - task_work improvements (Pavel) - Cleanup and optimize the submission path, in general and for handling links (Pavel) - Speedups for registered resource handling (Pavel) - Support sparse buffers and file maps (Pavel, me) - Various fixes and cleanups (Almog, Pavel, me)" * tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block: (111 commits) io_uring: fix incorrect __kernel_rwf_t cast io_uring: disallow mixed provided buffer group registrations io_uring: initialize io_buffer_list head when shared ring is unregistered io_uring: add fully sparse buffer registration io_uring: use rcu_dereference in io_close io_uring: consistently use the EPOLL* defines io_uring: make apoll_events a __poll_t io_uring: drop a spurious inline on a forward declaration io_uring: don't use ERR_PTR for user pointers io_uring: use a rwf_t for io_rw.flags io_uring: add support for ring mapped supplied buffers io_uring: add io_pin_pages() helper io_uring: add buffer selection support to IORING_OP_NOP io_uring: fix locking state for empty buffer group io_uring: implement multishot mode for accept io_uring: let fast poll support multishot io_uring: add REQ_F_APOLL_MULTISHOT for requests io_uring: add IORING_ACCEPT_MULTISHOT for accept io_uring: only wake when the correct events are set io_uring: avoid io-wq -EAGAIN looping for !IOPOLL ...
| * | io_uring: disallow mixed provided buffer group registrationsJens Axboe2022-05-181-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's nonsensical to register a provided buffer ring, if a classic provided buffer group with the same ID exists. Depending on the order of which we decide what type to pick, the other type will never get used. Explicitly disallow it and return an error if this is attempted. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: initialize io_buffer_list head when shared ring is unregisteredJens Axboe2022-05-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use ->buf_pages != 0 to tell if this is a shared buffer ring or a classic provided buffer group. If we unregister the shared ring and then attempt to use it, buf_pages is zero yet the classic list head isn't properly initialized. This causes io_buffer_select() to think that we have classic buffers available, but then we crash when we try and get one from the list. Just initialize the list if we unregister a shared buffer ring, leaving it in a sane state for either re-registration or for attempting to use it. And do the same for the initial setup from the classic path. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add fully sparse buffer registrationPavel Begunkov2022-05-181-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use rcu_dereference in io_closeChristoph Hellwig2022-05-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Accessing the file table needs a rcu_dereference_protected(). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: consistently use the EPOLL* definesChristoph Hellwig2022-05-181-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: make apoll_events a __poll_tChristoph Hellwig2022-05-181-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: drop a spurious inline on a forward declarationChristoph Hellwig2022-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't use ERR_PTR for user pointersChristoph Hellwig2022-05-181-46/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use a rwf_t for io_rw.flagsChristoph Hellwig2022-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use the proper type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add support for ring mapped supplied buffersJens Axboe2022-05-181-12/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add io_pin_pages() helperJens Axboe2022-05-181-27/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add buffer selection support to IORING_OP_NOPJens Axboe2022-05-181-1/+12
| | | | | | | | | | | | | | | | | | | | | Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: fix locking state for empty buffer groupJens Axboe2022-05-181-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com Fixes: 149c69b04a90 ("io_uring: abstract out provided buffer list selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: implement multishot mode for acceptHao Xu2022-05-141-6/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor io_accept() to support multishot mode. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userspace) --> accept inline ^ | |-----------------| - multishot: add accept sqe(userspace) --> accept inline ^ | |--*--| we do accept repeatedly in * place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: let fast poll support multishotHao Xu2022-05-141-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For operations like accept, multishot is a useful feature, since we can reduce a number of accept sqe. Let's integrate it to fast poll, it may be good for other operations in the future. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add REQ_F_APOLL_MULTISHOT for requestsHao Xu2022-05-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flag to indicate multishot mode for fast poll. currently only accept use it, but there may be more operations leveraging it in the future. Also add a mask IO_APOLL_MULTI_POLLED which stands for REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: only wake when the correct events are setDylan Yudaken2022-05-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for waking up a request compares the poll_t bits, however this will always contain some common flags so this always wakes up. For files with single wait queues such as sockets this can cause the request to be sent to the async worker unnecesarily. Further if it is non-blocking will complete the request with EAGAIN which is not desired. Here exclude these common events, making sure to not exclude POLLERR which might be important. Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: avoid io-wq -EAGAIN looping for !IOPOLLPavel Begunkov2022-05-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work() might continue busily hammer the same handler over and over again, which is not ideal. The -EAGAIN handling in question was put there only for IOPOLL, so restrict it to IOPOLL mode only where there is no other recourse than to retry as we cannot wait. Fixes: def596e9557c9 ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add flag for allocating a fully sparse direct descriptor spaceJens Axboe2022-05-131-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently to setup a fully sparse descriptor space upfront, the app needs to alloate an array of the full size and memset it to -1 and then pass that in. Make this a bit easier by allowing a flag that simply does this internally rather than needing to copy each slot separately. This works with IORING_REGISTER_FILES2 as the flag is set in struct io_uring_rsrc_register, and is only allow when the type is IORING_RSRC_FILE as this doesn't make sense for registered buffers. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: bump max direct descriptor count to 1MJens Axboe2022-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | We currently limit these to 32K, but since we're now backing the table space with vmalloc when needed, there's no reason why we can't make it bigger. The total space is limited by RLIMIT_NOFILE as well. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: allow allocated fixed files for acceptJens Axboe2022-05-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space, and also allows multi-shot support to work with fixed files. Normal accept direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: allow allocated fixed files for openat/openat2Jens Axboe2022-05-131-3/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space. Normal open direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add basic fixed file allocatorJens Axboe2022-05-131-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | Applications currently always pick where they want fixed files to go. In preparation for allowing these types of commands with multishot support, add a basic allocator in the fixed file table. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: track fixed files with a bitmapJens Axboe2022-05-131-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding a basic allocator for direct descriptors, add helpers that set/clear whether a file slot is used. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't clear req->kbuf when buffer selection is doneJens Axboe2022-05-091-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of whether or not kbuf is valid, so just drop it. Suggested-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: eliminate the need to track provided buffer ID separatelyJens Axboe2022-05-091-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have io_kiocb->buf_index which is used for either fixed buffers, or for provided buffers. For the latter, it's used to hold the buffer group ID for buffer selection. Post selection, req->kbuf->bid is used to get the buffer ID. Store the buffer ID, when selected, in req->buf_index. If we do end up recycling the buffer, reset it back to the buffer group ID. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: move provided buffer state closer to submit stateJens Axboe2022-05-091-3/+5
| | | | | | | | | | | | | | | | | | | | | The timeout and other items that follow are less hot, so let's move the provided buffer state above that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: move provided and fixed buffers into the same io_kiocb areaJens Axboe2022-05-091-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | These are mutually exclusive - if you use provided buffers, then you cannot use fixed buffers and vice versa. Move them into the same spot in the io_kiocb, which is also advantageous for provided buffers as they get near the submit side hot cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: abstract out provided buffer list selectionJens Axboe2022-05-091-11/+23
| | | | | | | | | | | | | | | | | | | | | In preparation for providing another way to select a buffer, move the existing logic into a helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: never call io_buffer_select() for a buffer re-selectJens Axboe2022-05-091-12/+17
| | | | | | | | | | | | | | | | | | | | | | | | Callers already have room to store the addr and length information, clean it up by having the caller just assign the previously provided data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: get rid of hashed provided buffer groupsJens Axboe2022-05-091-39/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a plain array for any group ID that's less than 64, and punt anything beyond that to an xarray. 64 fits in a page even for 4KB page sizes and with the planned additions. This makes the expected group usage faster by avoiding a hash and lookup to find our list, and it uses less memory upfront by not allocating any memory for provided buffers unless it's actually being used. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: always use req->buf_index for the provided buffer groupJens Axboe2022-05-091-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | The read/write opcodes use it already, but the recv/recvmsg do not. If we switch them over and read and validate this at init time while we're checking if the opcode supports it anyway, then we can do it in one spot and we don't have to pass in a separate group ID for io_buffer_select(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't setJens Axboe2022-05-091-4/+0
| | | | | | | | | | | | | | | | | | | | | There's no point in validity checking buf_index if the request doesn't have REQ_F_BUFFER_SELECT set, as we will never use it for that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: kill io_rw_buffer_select() wrapperJens Axboe2022-05-091-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | After the recent changes, this is direct call to io_buffer_select() anyway. With this change, there are no wrappers left for provided buffer selection. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: make io_buffer_select() return the user address directlyJens Axboe2022-05-091-26/+20
| | | | | | | | | | | | | | | | | | | | | There's no point in having callers provide a kbuf, we're just returning the address anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: kill io_recv_buffer_select() wrapperJens Axboe2022-05-051-10/+2
| | | | | | | | | | | | | | | | | | It's just a thin wrapper around io_buffer_select(), get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use 'sr' vs 'req->sr_msg' consistentlyJens Axboe2022-05-051-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable, yet some cases still use req->sr_msg which sr points to. Use 'sr' consistently. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsgJens Axboe2022-05-051-2/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg, then we arm poll first rather than attempt a receive or send upfront. This can be useful if we expect there to be no data (or space) available for the request, as we can then avoid wasting time on the initial issue attempt. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: check IOPOLL/ioprio support upfrontJens Axboe2022-05-051-95/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't punt this check to the op prep handlers, add the support to io_op_defs and we can check them while setting up the request. This reduces the text size by 500 bytes on aarch64, and makes this less fragile by having the check in one spot and needing opcodes to opt in to IOPOLL or ioprio support. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()Almog Khaikin2022-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which implies a full barrier on some architectures but it is not required to do so. Use the more appropriate smp_mb__after_atomic() which avoids the extra barrier on those architectures. Signed-off-by: Almog Khaikin <almogkh@gmail.com> Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add IORING_SETUP_TASKRUN_FLAGJens Axboe2022-04-301-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the application can tell if task_work is pending in the kernel for this ring. This allows use cases like io_uring_peek_cqe() to still function appropriately, or for the task to know when it would be useful to call io_uring_wait_cqe() to run pending events. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is usedJens Axboe2022-04-301-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If this is set, io_uring will never use an IPI to deliver a task_work notification. This can be used in the common case where a single task or thread communicates with the ring, and doesn't rely on io_uring_cqe_peek(). This provides a noticeable win in performance, both from eliminating the IPI itself, but also from avoiding interrupting the submitting task unnecessarily. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: set task_work notify method at init timeJens Axboe2022-04-301-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that just does a task wakeup and then we can remove the special wakeup we have in task_work_add. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>