summaryrefslogtreecommitdiffstats
path: root/io_uring
Commit message (Collapse)AuthorAgeFilesLines
* io_uring: fix cancellation overwriting req->flagsPavel Begunkov2024-06-132-2/+3
| | | | | | | | | | | | | | | | | | Only the current owner of a request is allowed to write into req->flags. Hence, the cancellation path should never touch it. Add a new field instead of the flag, move it into the 3rd cache line because it should always be initialised. poll_refs can move further as polling is an involved process anyway. It's a minimal patch, in the future we can and should find a better place for it and remove now unused REQ_F_CANCEL_SEQ. Fixes: 521223d7c229f ("io_uring/cancel: don't default to setting req->work.cancel_seq") Cc: stable@vger.kernel.org Reported-by: Li Shi <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6827b129f8f0ad76fa9d1f0a773de938b240ffab.1718323430.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/rsrc: don't lock while !TASK_RUNNINGPavel Begunkov2024-06-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a report of io_rsrc_ref_quiesce() locking a mutex while not TASK_RUNNING, which is due to forgetting restoring the state back after io_run_task_work_sig() and attempts to break out of the waiting loop. do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff815d2494>] prepare_to_wait+0xa4/0x380 kernel/sched/wait.c:237 WARNING: CPU: 2 PID: 397056 at kernel/sched/core.c:10099 __might_sleep+0x114/0x160 kernel/sched/core.c:10099 RIP: 0010:__might_sleep+0x114/0x160 kernel/sched/core.c:10099 Call Trace: <TASK> __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0xb4/0x940 kernel/locking/mutex.c:752 io_rsrc_ref_quiesce+0x590/0x940 io_uring/rsrc.c:253 io_sqe_buffers_unregister+0xa2/0x340 io_uring/rsrc.c:799 __io_uring_register io_uring/register.c:424 [inline] __do_sys_io_uring_register+0x5b9/0x2400 io_uring/register.c:613 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd8/0x270 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 Reported-by: Li Shi <sl1589472800@gmail.com> Fixes: 4ea15b56f0810 ("io_uring/rsrc: use wq for quiescing") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/77966bc104e25b0534995d5dbb152332bc8f31c0.1718196953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: fix possible deadlock in io_register_iowq_max_workers()Hagar Hemdan2024-06-041-0/+4
| | | | | | | | | | | | | | | | | | | The io_register_iowq_max_workers() function calls io_put_sq_data(), which acquires the sqd->lock without releasing the uring_lock. Similar to the commit 009ad9f0c6ee ("io_uring: drop ctx->uring_lock before acquiring sqd->lock"), this can lead to a potential deadlock situation. To resolve this issue, the uring_lock is released before calling io_put_sq_data(), and then it is re-acquired after the function call. This change ensures that the locks are acquired in the correct order, preventing the possibility of a deadlock. Suggested-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Link: https://lore.kernel.org/r/20240604130527.3597-1-hagarhem@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/io-wq: avoid garbage value of 'match' in io_wq_enqueue()Su Hui2024-06-041-5/+5
| | | | | | | | | | | | | | | | Clang static checker (scan-build) warning: o_uring/io-wq.c:line 1051, column 3 The expression is an uninitialized value. The computed value will also be garbage. 'match.nr_pending' is used in io_acct_cancel_pending_work(), but it is not fully initialized. Change the order of assignment for 'match' to fix this problem. Fixes: 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Su Hui <suhui@nfschina.com> Link: https://lore.kernel.org/r/20240604121242.2661244-1-suhui@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/napi: fix timeout calculationJens Axboe2024-06-041-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | Not quite sure what __io_napi_adjust_timeout() was attemping to do, it's adjusting both the NAPI timeout and the general overall timeout, and calculating a value that is never used. The overall timeout is a super set of the NAPI timeout, and doesn't need adjusting. The only thing we really need to care about is that the NAPI timeout doesn't exceed the overall timeout. If a user asked for a timeout of eg 5 usec and NAPI timeout is 10 usec, then we should not spin for 10 usec. While in there, sanitize the time checking a bit. If we have a negative value in the passed in timeout, discard it. Round up the value as well, so we don't end up with a NAPI timeout for the majority of the wait, with only a tiny sleep value at the end. Hence the only case we need to care about is if the NAPI timeout is larger than the overall timeout. If it is, cap the NAPI timeout at what the overall timeout is. Cc: stable@vger.kernel.org Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Reported-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: check for non-NULL file pointer in io_file_can_poll()Jens Axboe2024-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In earlier kernels, it was possible to trigger a NULL pointer dereference off the forced async preparation path, if no file had been assigned. The trace leading to that looks as follows: BUG: kernel NULL pointer dereference, address: 00000000000000b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 67 PID: 1633 Comm: buf-ring-invali Not tainted 6.8.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022 RIP: 0010:io_buffer_select+0xc3/0x210 Code: 00 00 48 39 d1 0f 82 ae 00 00 00 48 81 4b 48 00 00 01 00 48 89 73 70 0f b7 50 0c 66 89 53 42 85 ed 0f 85 d2 00 00 00 48 8b 13 <48> 8b 92 b0 00 00 00 48 83 7a 40 00 0f 84 21 01 00 00 4c 8b 20 5b RSP: 0018:ffffb7bec38c7d88 EFLAGS: 00010246 RAX: ffff97af2be61000 RBX: ffff97af234f1700 RCX: 0000000000000040 RDX: 0000000000000000 RSI: ffff97aecfb04820 RDI: ffff97af234f1700 RBP: 0000000000000000 R08: 0000000000200030 R09: 0000000000000020 R10: ffffb7bec38c7dc8 R11: 000000000000c000 R12: ffffb7bec38c7db8 R13: ffff97aecfb05800 R14: ffff97aecfb05800 R15: ffff97af2be5e000 FS: 00007f852f74b740(0000) GS:ffff97b1eeec0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b0 CR3: 000000016deab005 CR4: 0000000000370ef0 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x14d/0x420 ? do_user_addr_fault+0x61/0x6a0 ? exc_page_fault+0x6c/0x150 ? asm_exc_page_fault+0x22/0x30 ? io_buffer_select+0xc3/0x210 __io_import_iovec+0xb5/0x120 io_readv_prep_async+0x36/0x70 io_queue_sqe_fallback+0x20/0x260 io_submit_sqes+0x314/0x630 __do_sys_io_uring_enter+0x339/0xbc0 ? __do_sys_io_uring_register+0x11b/0xc50 ? vm_mmap_pgoff+0xce/0x160 do_syscall_64+0x5f/0x180 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0x55e0a110a67e Code: ba cc 00 00 00 45 31 c0 44 0f b6 92 d0 00 00 00 31 d2 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 90 89 30 eb a9 0f 1f 40 00 48 8b 42 20 8b 00 a8 06 75 af 85 f6 because the request is marked forced ASYNC and has a bad file fd, and hence takes the forced async prep path. Current kernels with the request async prep cleaned up can no longer hit this issue, but for ease of backporting, let's add this safety check in here too as it really doesn't hurt. For both cases, this will inevitably end with a CQE posted with -EBADF. Cc: stable@vger.kernel.org Fixes: a76c0b31eef5 ("io_uring: commit non-pollable provided mapped buffers upfront") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: assign kmsg inq/flags before buffer selectionJens Axboe2024-05-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reports that recv is using an uninitialized value: ===================================================== BUG: KMSAN: uninit-value in io_req_cqe_overflow io_uring/io_uring.c:810 [inline] BUG: KMSAN: uninit-value in io_req_complete_post io_uring/io_uring.c:937 [inline] BUG: KMSAN: uninit-value in io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763 io_req_cqe_overflow io_uring/io_uring.c:810 [inline] io_req_complete_post io_uring/io_uring.c:937 [inline] io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763 io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860 io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597 io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651 ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Uninit was stored to memory at: io_req_set_res io_uring/io_uring.h:215 [inline] io_recv_finish+0xf10/0x1560 io_uring/net.c:861 io_recv+0x12ec/0x1ea0 io_uring/net.c:1175 io_issue_sqe+0x429/0x22c0 io_uring/io_uring.c:1751 io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860 io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597 io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651 ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Uninit was created at: slab_post_alloc_hook mm/slub.c:3877 [inline] slab_alloc_node mm/slub.c:3918 [inline] __do_kmalloc_node mm/slub.c:4038 [inline] __kmalloc+0x6e4/0x1060 mm/slub.c:4052 kmalloc include/linux/slab.h:632 [inline] io_alloc_async_data+0xc0/0x220 io_uring/io_uring.c:1662 io_msg_alloc_async io_uring/net.c:166 [inline] io_recvmsg_prep_setup io_uring/net.c:725 [inline] io_recvmsg_prep+0xbe8/0x1a20 io_uring/net.c:806 io_init_req io_uring/io_uring.c:2135 [inline] io_submit_sqe io_uring/io_uring.c:2182 [inline] io_submit_sqes+0x1135/0x2f10 io_uring/io_uring.c:2335 __do_sys_io_uring_enter io_uring/io_uring.c:3246 [inline] __se_sys_io_uring_enter+0x40f/0x3c80 io_uring/io_uring.c:3183 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3183 x64_sys_call+0x2c0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f which appears to be io_recv_finish() reading kmsg->msg.msg_inq to decide if it needs to set IORING_CQE_F_SOCK_NONEMPTY or not. If the recv is entered with buffer selection, but no buffer is available, then we jump error path which calls io_recv_finish() without having assigned kmsg->msg_inq. This might cause an errant setting of the NONEMPTY flag for a request get gets errored with -ENOBUFS. Reported-by: syzbot+b1647099e82b3b349fbf@syzkaller.appspotmail.com Fixes: 4a3223f7bfda ("io_uring/net: switch io_recv() to using io_async_msghdr") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/rw: Free iovec before cleaning async dataBreno Leitao2024-05-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmemleak shows that there is a memory leak in io_uring read operation, where a buffer is allocated at iovec import, but never de-allocated. The memory is allocated at io_async_rw->free_iovec, but, then io_async_rw is kfreed, taking the allocated memory with it. I saw this happening when the read operation fails with -11 (EAGAIN). This is the kmemleak splat. unreferenced object 0xffff8881da591c00 (size 256): ... backtrace (crc 7a15bdee): [<00000000256f2de4>] __kmalloc+0x2d6/0x410 [<000000007a9f5fc7>] iovec_from_user.part.0+0xc6/0x160 [<00000000cecdf83a>] __import_iovec+0x50/0x220 [<00000000d1d586a2>] __io_import_iovec+0x13d/0x220 [<0000000054ee9bd2>] io_prep_rw+0x186/0x340 [<00000000a9c0372d>] io_prep_rwv+0x31/0x120 [<000000001d1170b9>] io_prep_readv+0xe/0x30 [<0000000070b8eb67>] io_submit_sqes+0x1bd/0x780 [<00000000812496d4>] __do_sys_io_uring_enter+0x3ed/0x5b0 [<0000000081499602>] do_syscall_64+0x5d/0x170 [<00000000de1c5a4d>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This occurs because the async data cleanup functions are not set for read/write operations. As a result, the potentially allocated iovec in the rw async data is not freed before the async data is released, leading to a memory leak. With this following patch, kmemleak does not show the leaked memory anymore, and all liburing tests pass. Fixes: a9165b83c193 ("io_uring/rw: always setup io_async_rw for read/write requests") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240530142340.1248216-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: don't attempt to mmap larger than what the user asks forJens Axboe2024-05-291-2/+3
| | | | | | | | | | | | | | | | | | | If IORING_FEAT_SINGLE_MMAP is ignored, as can happen if an application uses an ancient liburing or does setup manually, then 3 mmap's are required to map the ring into userspace. The kernel will still have collapsed the mappings, however userspace may ask for mapping them individually. If so, then we should not use the full number of ring pages, as it may exceed the partial mapping. Doing so will yield an -EFAULT from vm_insert_pages(), as we pass in more pages than what the application asked for. Cap the number of pages to match what the application asked for, for the particular mapping operation. Reported-by: Lucas Mülling <lmulling@proton.me> Link: https://github.com/axboe/liburing/issues/1157 Fixes: 3ab1db3c6039 ("io_uring: get rid of remap_pfn_range() for mapping rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds2024-05-232-6/+6
|\ | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "Single fix here for a regression in 6.9, and then a simple cleanup removing some dead code" * tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linux: io_uring: remove checks for NULL 'sq_offset' io_uring/sqpoll: ensure that normal task_work is also run timely
| * io_uring: remove checks for NULL 'sq_offset'Jens Axboe2024-05-221-4/+2
| | | | | | | | | | | | | | | | Since the 5.12 kernel release, nobody has been passing NULL as the sq_offset pointer. Remove the checks for it being NULL or not, it will always be valid. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/sqpoll: ensure that normal task_work is also run timelyJens Axboe2024-05-211-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the move to private task_work, SQPOLL neglected to also run the normal task_work, if any is pending. This will eventually get run, but we should run it with the private task_work to ensure that things like a final fput() is processed in a timely fashion. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/313824bc-799d-414f-96b7-e6de57c7e21d@gmail.com/ Reported-by: Andrew Udvare <audvare@gmail.com> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Tested-by: Christian Heusel <christian@heusel.eu> Tested-by: Andrew Udvare <audvare@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2024-05-211-2/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | Pull misc vfs updates from Al Viro: "Assorted commits that had missed the last merge window..." * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove call_{read,write}_iter() functions do_dentry_open(): kill inode argument kernel_file_open(): get rid of inode argument get_file_rcu(): no need to check for NULL separately fd_is_open(): move to fs/file.c close_on_exec(): pass files_struct instead of fdtable
| * remove call_{read,write}_iter() functionsMiklos Szeredi2024-04-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | These have no clear purpose. This is effectively a revert of commit bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()"). The patch was created with the help of a coccinelle script. Fixes: bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds2024-05-191-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
| * | mm: switch mm->get_unmapped_area() to a flagRick Edgecombe2024-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linuxLinus Torvalds2024-05-181-6/+20
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more io_uring updates from Jens Axboe: "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept requests. This is very similar to previous work that enabled the same hint for doing receives on sockets. By far the majority of the work here is refactoring to enable the networking side to pass back whether or not the socket had more pending requests after accepting the current one, the last patch just wires it up for io_uring. Not only does this enable applications to know whether there are more connections to accept right now, it also enables smarter logic for io_uring multishot accept on whether to retry immediately or wait for a poll trigger" * tag 'net-accept-more-20240515' of git://git.kernel.dk/linux: io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept net: pass back whether socket was empty post accept net: have do_accept() take a struct proto_accept_arg argument net: change proto and proto_ops accept type
| * | | io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for acceptJens Axboe2024-05-131-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the given protocol supports passing back whether or not we had more pending accept post this one, pass back this information to userspace. This is done by setting IORING_CQE_F_SOCK_NONEMPTY in the CQE flags, just like we do for recv/recvmsg if there's more data available post a receive operation. We can also use this information to be smarter about multishot retry, as we don't need to do a pointless retry if we know for a fact that there aren't any more connections to accept. Suggested-by: Norman Maurer <norman_maurer@apple.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | net: have do_accept() take a struct proto_accept_arg argumentJens Axboe2024-05-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for passing in more information via this API, change do_accept() to take a proto_accept_arg struct pointer rather than just the file flags separately. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linuxLinus Torvalds2024-05-1335-1698/+1898
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
| * | | io_uring: support to inject result for NOPMing Lei2024-05-101-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support to inject result for NOP so that we can inject failure from userspace. It is very helpful for covering failure handling code in io_uring core change. With nop flags, it becomes possible to add more test features on NOP in future. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240510035031.78874-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fail NOP if non-zero op flags is passed inMing Lei2024-05-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NOP op flags should have been checked from beginning like any other opcode, otherwise NOP may not be extended with the op flags. Given both liburing and Rust io-uring crate always zeros SQE op flags, just ignore users which play raw NOP uring interface without zeroing SQE, because NOP is just for test purpose. Then we can save one NOP2 opcode. Suggested-by: Jens Axboe <axboe@kernel.dk> Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240510035031.78874-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: add IORING_ACCEPT_POLL_FIRST flagJens Axboe2024-05-091-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to how polling first is supported for receive, it makes sense to provide the same for accept. An accept operation does a lot of expensive setup, like allocating an fd, a socket/inode, etc. If no connection request is already pending, this is wasted and will just be cleaned up and freed, only to retry via the usual poll trigger. Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the accept request if poll says we have something to accept. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: add IORING_ACCEPT_DONTWAIT flagJens Axboe2024-05-091-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows the caller to perform a non-blocking attempt, similarly to how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection attempt, propagate the result to userspace rather than arm poll and wait for a retry. Suggested-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/filetable: don't unnecessarily clear/reset bitmapJens Axboe2024-05-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're updating an existing slot, we clear the slot bitmap only to set it again right after. Just leave the bit set rather than toggle it off and on, and move the unused slot setting into the branch of not already having a file occupy this slot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/io-wq: Use set_bit() and test_bit() at worker->flagsBreno Leitao2024-05-071-23/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Utilize set_bit() and test_bit() on worker->flags within io_uring/io-wq to address potential data races. The structure io_worker->flags may be accessed through various data paths, leading to concurrency issues. When KCSAN is enabled, it reveals data races occurring in io_worker_handle_work and io_wq_activate_free_worker functions. BUG: KCSAN: data-race in io_worker_handle_work / io_wq_activate_free_worker write to 0xffff8885c4246404 of 4 bytes by task 49071 on cpu 28: io_worker_handle_work (io_uring/io-wq.c:434 io_uring/io-wq.c:569) io_wq_worker (io_uring/io-wq.c:?) <snip> read to 0xffff8885c4246404 of 4 bytes by task 49024 on cpu 5: io_wq_activate_free_worker (io_uring/io-wq.c:? io_uring/io-wq.c:285) io_wq_enqueue (io_uring/io-wq.c:947) io_queue_iowq (io_uring/io_uring.c:524) io_req_task_submit (io_uring/io_uring.c:1511) io_handle_tw_list (io_uring/io_uring.c:1198) <snip> Line numbers against commit 18daea77cca6 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm"). These races involve writes and reads to the same memory location by different tasks running on different CPUs. To mitigate this, refactor the code to use atomic operations such as set_bit(), test_bit(), and clear_bit() instead of basic "and" and "or" operations. This ensures thread-safe manipulation of worker flags. Also, move `create_index` to avoid holes in the structure. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240507170002.2269003-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ringJens Axboe2024-05-011-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the posting outside the checking and locking, it's cleaner that way. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: Require zeroed sqe->len on provided-buffers sendGabriel Krisman Bertazi2024-05-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sending from a provided buffer, we set sr->len to be the smallest between the actual buffer size and sqe->len. But, now that we disconnect the buffer from the submission request, we can get in a situation where the buffers and requests mismatch, and only part of a buffer gets sent. Assume: * buf[1]->len = 128; buf[2]->len = 256 * sqe[1]->len = 128; sqe[2]->len = 256 If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2] runs first, sqe[1] picks buff[2], and the last half of buff[2] is never sent. While arguably the use-case of different-length sends is questionable, it has already raised confusion with potential users of this feature. Let's make the interface less tricky by forcing the length to only come from the buffer ring entry itself. Fixes: ac5f71a3d9d7 ("io_uring/net: add provided buffer support for IORING_OP_SEND") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: disable LAZY_WAKE for linked notifsPavel Begunkov2024-04-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Notifications may now be linked and thus a single tw can post multiple CQEs, it's not safe to use LAZY_WAKE with them. Disable LAZY_WAKE for now, if that'd prove to be a problem we can count them and pass the expected number of CQEs into __io_req_task_work_add(). Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a5accdb7d2d0d27ebec14f8106e14e0192fae17.1714488419.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: fix sendzc lazy wake pollingPavel Begunkov2024-04-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SEND[MSG]_ZC produces multiple CQEs via notifications, LAZY_WAKE doesn't handle it and so disable LAZY_WAKE for sendzc polling. It should be fine, sends are not likely to be polled in the first place. Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b360fb352d91e3aec751d75c87dfb4753a084ee.1714488419.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of ↵linke li2024-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | re-reading it In io_msg_exec_remote(), ctx->submitter_task is read using READ_ONCE at the beginning of the function, checked, and then re-read from ctx->submitter_task, voiding all guarantees of the checks. Reuse the value that was read by READ_ONCE to ensure the consistency of the task struct throughout the function. Signed-off-by: linke li <lilinke99@qq.com> Link: https://lore.kernel.org/r/tencent_F9B2296C93928D6F68FF0C95C33475C68209@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/rw: reinstate thread check for retriesJens Axboe2024-04-253-25/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allowing retries for everything is arguably the right thing to do, now that every command type is async read from the start. But it's exposed a few issues around missing check for a retry (which cca6571381a0 exposed), and the fixup commit for that isn't necessarily 100% sound in terms of iov_iter state. For now, just revert these two commits. This unfortunately then re-opens the fact that -EAGAIN can get bubbled to userspace for some cases where the kernel very well could just sanely retry them. But until we have all the conditions covered around that, we cannot safely enable that. This reverts commit df604d2ad480fcf7b39767280c9093e13b1de952. This reverts commit cca6571381a0bdc88021a1f7a4c2349df21279f7. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: implement notification stackingPavel Begunkov2024-04-222-7/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The network stack allows only one ubuf_info per skb, and unlike MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate ubuf_info. That means that send requests can't reuse a previosly allocated skb and need to get one more or more of new ones. That's fine for large sends, but otherwise it would spam the stack with lots of skbs carrying just a little data each. To help with that implement linking notification (i.e. an io_uring wrapper around ubuf_info) into a list. Each is refcounted by skbs and the stack as usual. additionally all non head entries keep a reference to the head, which they put down when their refcount hits 0. When the head have no more users, it'll efficiently put all notifications in a batch. As mentioned previously about ->io_link_skb, the callback implementation always allows to bind to an skb without a ubuf_info. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bf1e7f9b72f9ecc99999fdc0d2cded5eea87fd0b.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: simplify io_notif_flush()Pavel Begunkov2024-04-222-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_notif_flush() is partially duplicating io_tx_ubuf_complete(), so instead of duplicating it, make the flush call io_tx_ubuf_complete. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19e41652c16718b946a5c80d2ad409df7682e47e.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | Merge branch 'for-uring-ubufops' of ↵Jens Axboe2024-04-221-2/+6
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.10/io_uring Merge net changes required for the upcoming send zerocopy improvements. * 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux: net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | net: extend ubuf_info callback to ops structurePavel Begunkov2024-04-221-5/+13
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | io_uring/net: support bundles for recvJens Axboe2024-04-222-22/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it. This can be used with multishot receive as well, or without it. Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: support bundles for sendJens Axboe2024-04-221-17/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_OP_SEND is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea is that an application can fill outgoing buffers in a provided buffer group, and then arm a single send that will service them all. Once there are no more buffers to send, or if the requested length has been sent, the request posts a single completion for all the buffers. This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming in a separate patch. However, this patch does do a lot of the prep work that makes wiring up the sendmsg variant pretty trivial. They share the prep side. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/kbuf: add helpers for getting/peeking multiple buffersJens Axboe2024-04-222-12/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our provided buffer interface only allows selection of a single buffer. Add an API that allows getting/peeking multiple buffers at the same time. This is only implemented for the ring provided buffers. It could be added for the legacy provided buffers as well, but since it's strongly encouraged to use the new interface, let's keep it simpler and just provide it for the new API. The legacy interface will always just select a single buffer. There are two new main functions: io_buffers_select(), which selects up as many buffers as it can. The caller supplies the iovec array, and io_buffers_select() may allocate a bigger array if the 'out_len' being passed in is non-zero and bigger than what fits in the provided iovec. Buffers grabbed with this helper are permanently assigned. io_buffers_peek(), which works like io_buffers_select(), except they can be recycled, if needed. Callers using either of these functions should call io_put_kbufs() rather than io_put_kbuf() at completion time. The peek interface must be called with the ctx locked from peek to completion. This add a bit state for the request: - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been peeked and should be committed to the buffer ring head when they are put as part of completion. Prior to this, req->buf_list was cleared to NULL when committed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: add provided buffer support for IORING_OP_SENDJens Axboe2024-04-222-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's pretty trivial to wire up provided buffer support for the send side, just like how it's done the receive side. This enables setting up a buffer ring that an application can use to push pending sends to, and then have a send pick a buffer from that ring. One of the challenges with async IO and networking sends is that you can get into reordering conditions if you have more than one inflight at the same time. Consider the following scenario where everything is fine: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, completes successfully, posts CQE 5) sendB is issued, completes successfully, posts CQE All is fine. Requests are always issued in-order, and both complete inline as most sends do. However, if we're flooding socket1 with sends, the following could also result from the same sequence: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, completes successfully, posts CQE 7) sendA is retried, completes successfully, posts CQE Now we've sent sendB before sendA, which can make things unhappy. If both sendA and sendB had been using provided buffers, then it would look as follows instead: 1) App queues dataA for sendA, queues sendA for socket1 2) App queues dataB for sendB queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, picks first buffer (dataA), completes successfully, posts CQE (which says "I sent dataA") 7) sendA is retried, picks first buffer (dataB), completes successfully, posts CQE (which says "I sent dataB") Now we've sent the data in order, and everybody is happy. It's worth noting that this also opens the door for supporting multishot sends, as provided buffers would be a prerequisite for that. Those can trigger either when new buffers are added to the outgoing ring, or (if stalled due to lack of space) when space frees up in the socket. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: add generic multishot retry helperJens Axboe2024-04-221-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is just moving io_recv_prep_retry() higher up so it can get used for sends as well, and rename it to be generically useful for both sends and receives. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/rw: ensure retry condition isn't lostJens Axboe2024-04-173-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io-wq: Drop intermediate step between pending list and active workGabriel Krisman Bertazi2024-04-171-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | next_work is only used to make the work visible for cancellation. Instead, we can just directly write to cur_work before dropping the acct_lock and avoid the extra hop. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io-wq: write next_work before dropping acct_lockGabriel Krisman Bertazi2024-04-171-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 361aee450c6e ("io-wq: add intermediate work step between pending list and active work") closed a race between a cancellation and the work being removed from the wq for execution. To ensure the request is always reachable by the cancellation, we need to move it within the wq lock, which also synchronizes the cancellation. But commit 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") replaced the wq lock here and accidentally reintroduced the race by releasing the acct_lock too early. In other words: worker | cancellation work = io_get_next_work() | raw_spin_unlock(&acct->lock); | | | io_acct_cancel_pending_work | io_wq_worker_cancel() worker->next_work = work Using acct_lock is still enough since we synchronize on it on io_acct_cancel_pending_work. Fixes: 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/sqpoll: work around a potential audit memory leakJens Axboe2024-04-151-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmemleak complains that there's a memory leak related to connect handling: unreferenced object 0xffff0001093bdf00 (size 128): comm "iou-sqp-455", pid 457, jiffies 4294894164 hex dump (first 32 bytes): 02 00 fa ea 7f 00 00 01 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 2e481b1a): [<00000000c0a26af4>] kmemleak_alloc+0x30/0x38 [<000000009c30bb45>] kmalloc_trace+0x228/0x358 [<000000009da9d39f>] __audit_sockaddr+0xd0/0x138 [<0000000089a93e34>] move_addr_to_kernel+0x1a0/0x1f8 [<000000000b4e80e6>] io_connect_prep+0x1ec/0x2d4 [<00000000abfbcd99>] io_submit_sqes+0x588/0x1e48 [<00000000e7c25e07>] io_sq_thread+0x8a4/0x10e4 [<00000000d999b491>] ret_from_fork+0x10/0x20 which can can happen if: 1) The command type does something on the prep side that triggers an audit call. 2) The thread hasn't done any operations before this that triggered an audit call inside ->issue(), where we have audit_uring_entry() and audit_uring_exit(). Work around this by issuing a blanket NOP operation before the SQPOLL does anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: shrink account_pages to u32Pavel Begunkov2024-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->account_pages is the number of pages we account against the user derived from unsigned len, it definitely fits into unsigned, which saves some space in struct io_notif_data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19f2687fcb36daa74d86f4a27bfb3d35cffec318.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: remove ctx var from io_notif_tw_completePavel Begunkov2024-04-151-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need ctx in the hottest path, i.e. registered buffers, let's get it only when we need it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e7345e268ffaeaf79b4c8f3a5d019d6a87a3d1f1.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/notif: refactor io_tx_ubuf_complete()Pavel Begunkov2024-04-151-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Flip the dec_and_test "if", that makes the function extension easier in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/43939e2b04dff03bff5d7227c98afedf951227b3.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: ensure overflow entries are dropped when ring is exitingJens Axboe2024-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous consolidation cleanup missed handling the case where the ring is dying, and __io_cqring_overflow_flush() doesn't flush entries if the CQ ring is already full. This is fine for the normal CQE overflow flushing, but if the ring is going away, we need to flush everything, even if it means simply freeing the overflown entries. Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/timeout: remove duplicate initialization of the io_timeout list.Ruyi Zhang2024-04-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the __io_timeout_prep function, the io_timeout list is initialized twice, removing the meaningless second initialization. Signed-off-by: Ruyi Zhang <ruyi.zhang@samsung.com> Link: https://lore.kernel.org/r/20240411055953.2029218-1-ruyi.zhang@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>