linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	io_uring: use io-wq manager as backup task if task is exiting	Jens Axboe	2020-04-03	1	-0/+12
\| \| \| \| \| \| \| \| \| \|	If the original task is (or has) exited, then the task work will not get queued properly. Allow for using the io-wq manager task to queue this work for execution, and ensure that the io-wq manager notices and runs this work if woken up (or exiting). Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: handle hashed writes in chains	Pavel Begunkov	2020-03-23	1	-20/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized. Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: close cancel gap for hashed linked work	Pavel Begunkov	2020-03-22	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \|	After io_assign_current_work() of a linked work, it can be decided to offloaded to another thread so doing io_wqe_enqueue(). However, until next io_assign_current_work() it can be cancelled, that isn't handled. Don't assign it, if it's not going to be executed. Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: hash dependent work	Pavel Begunkov	2020-03-14	1	-6/+19
\| \| \| \| \| \| \| \|	Enable io-wq hashing stuff for dependent works simply by re-enqueueing such requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: split hashing and enqueueing	Pavel Begunkov	2020-03-14	1	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \|	It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue(). Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: don't resched if there is no work	Pavel Begunkov	2020-03-14	1	-4/+6
\| \| \| \| \| \| \| \| \| \|	This little tweak restores the behaviour that was before the recent io_worker_handle_work() optimisation patches. It makes the function do cond_resched() and flush_signals() only if there is an actual work to execute. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove duplicated cancel code	Pavel Begunkov	2020-03-12	1	-112/+24
\| \| \| \| \| \| \| \| \| \|	Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring/io-wq: forward submission ref to async	Pavel Begunkov	2020-03-04	1	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL. io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: optimise out *next_work() double lock	Pavel Begunkov	2020-03-04	1	-3/+6
\| \| \| \| \| \| \| \| \| \|	When executing non-linked hashed work, io_worker_handle_work() will lock-unlock wqe->lock to update hash, and then immediately lock-unlock to get next work. Optimise this case and do lock/unlock only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: optimise locking in io_worker_handle_work()	Pavel Begunkov	2020-03-04	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are 2 optimisations: - Now, io_worker_handler_work() do io_assign_current_work() twice per request, and each one adds lock/unlock(worker->lock) pair. The first is to reset worker->cur_work to NULL, and the second to set a real work shortly after. If there is a dependant work, set it immediately, that effectively removes the extra NULL'ing. - And there is no use in taking wqe->lock for linked works, as they are not hashed now. Optimise it out. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: shuffle io_worker_handle_work() code	Pavel Begunkov	2020-03-04	1	-59/+64
\| \| \| \| \| \| \| \| \| \| \| \|	This is a preparation patch, it adds some helpers and makes the next patches cleaner. - extract io_impersonate_work() and io_assign_current_work() - replace @next label with nested do-while - move put_work() right after NULL'ing cur_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: use BIT for ulong hash	Pavel Begunkov	2020-03-02	1	-3/+3
\| \| \| \| \| \| \| \|	@hash_map is unsigned long, but BIT_ULL() is used for manipulations. BIT() is a better match as it returns exactly unsigned long value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: remove IO_WQ_WORK_CB	Pavel Begunkov	2020-03-02	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	IO_WQ_WORK_CB is used only for linked timeouts, which will be armed before the work setup (i.e. mm, override creds, etc). The setup shouldn't take long, so it's ok to arm it a bit later and get rid of IO_WQ_WORK_CB. Make io-wq call work->func() only once, callbacks will handle the rest. i.e. the linked timeout handler will do the actual issue. And as a bonus, it removes an extra indirect call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove unused IO_WQ_WORK_HAS_MM	Pavel Begunkov	2020-03-02	1	-2/+0
\| \| \| \| \| \| \|	IO_WQ_WORK_HAS_MM is set but never used, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL	Pavel Begunkov	2020-03-02	1	-37/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	io_wq_flush() is buggy, during cancelation of a flush, the associated work may be passed to the caller's (i.e. io_uring) @match callback. That callback is expecting it to be embedded in struct io_kiocb. Cancelation of internal work probably doesn't make a lot of sense to begin with. As the flush helper is no longer used, just delete it and the associated work flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation	Pavel Begunkov	2020-03-02	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \|	To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may return next work, which will be ignored and lost. Cancel the whole link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove spin-for-work optimization	Jens Axboe	2020-02-25	1	-19/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: don't call kXalloc_node() with non-online node	Jens Axboe	2020-02-12	1	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Glauber reports a crash on init on a box he has: RIP: 0010:__alloc_pages_nodemask+0x132/0x340 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0 FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: alloc_slab_page+0x46/0x320 new_slab+0x9d/0x4e0 ___slab_alloc+0x507/0x6a0 ? io_wq_create+0xb4/0x2a0 __slab_alloc+0x1c/0x30 kmem_cache_alloc_node_trace+0xa6/0x260 io_wq_create+0xb4/0x2a0 io_uring_setup+0x97f/0xaa0 ? io_remove_personalities+0x30/0x30 ? io_poll_trigger_evfd+0x30/0x30 do_syscall_64+0x5b/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4d116cb1ed which is due to the 'wqe' and 'worker' allocation being node affine. But it isn't valid to call the node affine allocation if the node isn't online. Setup structures for even offline nodes, as usual, but skip them in terms of thread setup to not waste resources. If the node isn't online, just alloc memory with NUMA_NO_NODE. Reported-by: Glauber Costa <glauber@scylladb.com> Tested-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: add io_wq_cancel_pid() to cancel based on a specific pid	Jens Axboe	2020-02-09	1	-0/+29
\| \| \| \| \| \| \| \|	Add a helper that allows the caller to cancel work based on what mm it belongs to. This allows io_uring to cancel work from a given task or thread when it exits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: make io_wqe_cancel_work() take a match handler	Jens Axboe	2020-02-09	1	-11/+22
\| \| \| \| \| \| \| \| \| \| \|	We want to use the cancel functionality for canceling based on not just the work itself. Instead of matching on the work address manually, allow a match handler to tell us if we found the right work item or not. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: add support for inheriting ->fs	Jens Axboe	2020-02-08	1	-0/+8
\| \| \| \| \| \| \| \|	Some work items need this for relative path lookup, make it available like the other inherited credentials/mm/etc. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: fix linked command file table usage	Jens Axboe	2020-01-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it. Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: allow grabbing existing io-wq	Pavel Begunkov	2020-01-28	1	-0/+8
\| \| \| \| \| \| \| \| \|	Export a helper to attach to an existing io-wq, rather than setting up a new one. This is doable now that we have reference counted io_wq's. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring/io-wq: don't use static creds/mm assignments	Jens Axboe	2020-01-28	1	-22/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: make the io_wq ref counted	Jens Axboe	2020-01-27	1	-1/+10
\| \| \| \| \| \| \| \|	In preparation for sharing an io-wq across different users, add a reference count that manages destruction of it. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: support concurrent non-blocking work	Jens Axboe	2020-01-20	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: add support for uncancellable work	Jens Axboe	2020-01-20	1	-1/+7
\| \| \| \| \| \| \| \| \| \|	Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: cancel work if we fail getting a mm reference	Jens Axboe	2020-01-14	1	-4/+8
\| \| \| \| \| \| \|	If we require mm and user context, mark the request for cancellation if we fail to acquire the desired mm. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: add cond_resched() to worker thread	Hillf Danton	2019-12-24	1	-0/+2
\| \| \| \| \| \| \| \|	Reschedule the current IO worker to cut the risk that it is becoming a cpu hog. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove unused busy list from io_sqe	Hillf Danton	2019-12-23	1	-8/+0
\| \| \| \| \| \| \| \| \| \|	Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: fix stale comment and a few typos	Brian Gianforcaro	2019-12-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	- Fix a few typos found while reading the code. - Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds. Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: briefly spin for new work after finishing work	Jens Axboe	2019-12-10	1	-2/+22
\| \| \| \| \| \| \|	To avoid going to sleep only to get woken shortly thereafter, spin briefly for new work upon completion of work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove worker->wait waitqueue	Jens Axboe	2019-12-10	1	-8/+2
\| \| \| \| \| \| \| \| \| \|	We only have one cases of using the waitqueue to wake the worker, the rest are using wake_up_process(). Since we can save some cycles not fiddling with the waitqueue io_wqe_worker(), switch the work activation to task wakeup and get rid of the now unused wait_queue_head_t in struct io_worker. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: use current task creds instead of allocating a new one	Jens Axboe	2019-12-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	syzbot reports: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack. Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: shrink io_wq_work a bit	Jens Axboe	2019-11-26	1	-13/+23
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we're using 40 bytes for the io_wq_work structure, and 16 of those is the doubly link list node. We don't need doubly linked lists, we always add to tail to keep things ordered, and any other use case is list traversal with deletion. For the deletion case, we can easily support any node deletion by keeping track of the previous entry. This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from io_uring to 216 to 208 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: fix handling of NUMA node IDs	Jann Horn	2019-11-26	1	-46/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several things that can go wrong in the current code on NUMA systems, especially if not all nodes are online all the time: - If the identifiers of the online nodes do not form a single contiguous block starting at zero, wq->wqes will be too small, and OOB memory accesses will occur e.g. in the loop in io_wq_create(). - If a node comes online between the call to num_online_nodes() and the for_each_node() loop in io_wq_create(), an OOB write will occur. - If a node comes online between io_wq_create() and io_wq_enqueue(), a lookup is performed for an element that doesn't exist, and an OOB read will probably occur. Fix it by: - using nr_node_ids instead of num_online_nodes() for the allocation size; nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the highest node ID that could possibly come online at some point, even if those nodes' identifiers are not a contiguous block - creating workers for all possible CPUs, not just all online ones This is basically what the normal workqueue code also does, as far as I can tell. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: use kzalloc instead of kcalloc for single-element allocations	Jann Horn	2019-11-26	1	-3/+3
\| \| \| \| \| \| \| \|	These allocations are single-element allocations, so don't use the array allocation wrapper for them. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: async workers should inherit the user creds	Jens Axboe	2019-11-25	1	-0/+10
\| \| \| \| \| \| \| \| \|	If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: have io_wq_create() take a 'data' argument	Jens Axboe	2019-11-25	1	-8/+6
\| \| \| \| \| \| \| \| \| \|	We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_uring: close lookup gap for dependent next work	Jens Axboe	2019-11-25	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove extra space characters	Dan Carpenter	2019-11-25	1	-3/+3
\| \| \| \| \| \| \|	These lines are indented an extra space character. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: wait for io_wq_create() to setup necessary workers	Jens Axboe	2019-11-25	1	-15/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited. Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck. In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue. Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: remove now redundant struct io_wq_nulls_list	Jens Axboe	2019-11-14	1	-19/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Since we don't iterate these lists anymore after commit: e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") we don't need to retain the nulls value we use for them. That means it's pretty pointless to wrap the hlist_nulls_head in a structure, so get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: ensure free/busy list browsing see all items	Jens Axboe	2019-11-13	1	-30/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have two lists for workers in io-wq, a busy and a free list. For certain operations we want to browse all workers, and we currently do that by browsing the two separate lists. But since these lists are RCU protected, we can potentially miss workers if they move between the two lists while we're browsing them. Add a third list, all_list, that simply holds all workers. A worker is added to that list when it starts, and removed when it exits. This makes the worker iteration cleaner, too. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: ensure we have a stable view of ->cur_work for cancellations	Jens Axboe	2019-11-13	1	-19/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	worker->cur_work is currently protected by the lock of the wqe that the worker belongs to. When we send a signal to a worker, we need a stable view of ->cur_work, so we need to hold that lock. But this doesn't work so well, since we have the opposite order potentially on queueing work. If POLL_ADD is used with a signalfd, then io_poll_wake() is called with the signal lock, and that sometimes needs to insert work items. Add a specific worker lock that protects the current work item. Then we can guarantee that the task we're sending a signal is currently processing the exact work we think it is. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io_wq: add get/put_work handlers to io_wq_create()	Jens Axboe	2019-11-13	1	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel. Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: add support for bounded vs unbunded work	Jens Axboe	2019-11-07	1	-80/+216
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	io_uring supports request types that basically have two different lifetimes: 1) Bounded completion time. These are requests like disk reads or writes, which we know will finish in a finite amount of time. 2) Unbounded completion time. These are generally networked IO, where we have no idea how long they will take to complete. Another example is POLL commands. This patch provides support for io-wq to handle these differently, so we don't starve bounded requests by tying up workers for too long. By default all work is bounded, unless otherwise specified in the work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful()	Jens Axboe	2019-11-09	1	-2/+1
\| \| \| \| \| \| \|	We hold the wqe lock at this point (which is also annotated), so there's no need to use the careful variant of list_empty(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: use proper nesting IRQ disabling spinlocks for cancel	Jens Axboe	2019-11-05	1	-6/+9
\| \| \| \| \| \| \| \|	We don't know what context we'll be called in for cancel, it could very well be with IRQs disabled already. Use the IRQ saving variants of the locking primitives. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	io-wq: use kfree_rcu() to simplify the code	YueHaibing	2019-11-02	1	-8/+1
\| \| \| \| \| \| \| \|	The callback function of call_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>