summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ocfs2: remove unneeded semicolonszhengbin2020-01-312-2/+2
| | | | | | | | | | | | | | | | | | | | Fixes coccicheck warnings: fs/ocfs2/cluster/quorum.c:76:2-3: Unneeded semicolon fs/ocfs2/dlmglue.c:573:2-3: Unneeded semicolon Link: http://lkml.kernel.org/r/6ee3aa16-9078-30b1-df3f-22064950bd98@linux.alibaba.com Signed-off-by: zhengbin <zhengbin13@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: ocfs: remove unnecessary assertion in dlm_migrate_lockresAditya Pakki2020-01-311-2/+0
| | | | | | | | | | | | | | | | | | | In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(), target is checked for O2NM_MAX_NODES. Thus, the assertion in dlm_migrate_lockres() is unnecessary and can be removed. The patch eliminates such a check. Link: http://lkml.kernel.org/r/20191218194111.26041-1-pakki001@umn.edu Signed-off-by: Aditya Pakki <pakki001@umn.edu> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* scripts/spelling.txt: add "issus" typoLuca Ceresoli2020-01-311-0/+1
| | | | | | | | | Add "issus" and correct it as "issues". Link: http://lkml.kernel.org/r/20200105221950.8384-1-luca@lucaceresoli.net Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* scripts/spelling.txt: add more spellings to spelling.txtXiong2020-01-311-0/+13
| | | | | | | | | | | | | | | Here are some of the common spelling mistakes and typos that I've found while fixing up spelling mistakes in the kernel. Most of them still exist in more than two source files. Link: http://lkml.kernel.org/r/20191229143626.51238-1-xndchn@gmail.com Signed-off-by: Xiong <xndchn@gmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Chris Paterson <chris.paterson2@renesas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move_pages: report the number of non-attempted pagesYang Shi2020-01-311-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the semantic of move_pages() has changed to return the number of non-migrated pages if they were result of a non-fatal reasons (usually a busy page). This was an unintentional change that hasn't been noticed except for LTP tests which checked for the documented behavior. There are two ways to go around this change. We can even get back to the original behavior and return -EAGAIN whenever migrate_pages is not able to migrate pages due to non-fatal reasons. Another option would be to simply continue with the changed semantic and extend move_pages documentation to clarify that -errno is returned on an invalid input or when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the number of pages that couldn't have been migrated due to ephemeral reasons (e.g. page is pinned or locked for other reasons). This patch implements the second option because this behavior is in place for some time without anybody complaining and possibly new users depending on it. Also it allows to have a slightly easier error handling as the caller knows that it is worth to retry when err > 0. But since the new semantic would be aborted immediately if migration is failed due to ephemeral reasons, need include the number of non-attempted pages in the return value too. Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: thp: don't need care deferred split queue in memcg charge move pathWei Yang2020-01-311-18/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | If compound is true, this means it is a PMD mapped THP. Which implies the page is not linked to any defer list. So the first code chunk will not be executed. Also with this reason, it would not be proper to add this page to a defer list. So the second code chunk is not correct. Based on this, we should remove the defer list related code. [yang.shi@linux.alibaba.com: better patch title] Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory_hotplug: fix remove_memory() lockdep splatDan Williams2020-01-311-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The daxctl unit test for the dax_kmem driver currently triggers the (false positive) lockdep splat below. It results from the fact that remove_memory_block_devices() is invoked under the mem_hotplug_lock() causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs active state tracking). It is a false positive because the sysfs attribute path triggering the memory remove is not the same attribute path associated with memory-block device. sysfs_break_active_protection() is not applicable since there is no real deadlock conflict, instead move memory-block device removal outside the lock. The mem_hotplug_lock() is not needed to synchronize the memory-block device removal vs the page online state, that is already handled by lock_device_hotplug(). Specifically, lock_device_hotplug() is sufficient to allow try_remove_memory() to check the offline state of the memblocks and be assured that any in progress online attempts are flushed / blocked by kernfs_drain() / attribute removal. The add_memory() path safely creates memblock devices under the mem_hotplug_lock(). There is no kernfs active state synchronization in the memblock device_register() path, so nothing to fix there. This change is only possible thanks to the recent change that refactored memory block device removal out of arch_remove_memory() (commit 4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before arch_remove_memory()"), and David's due diligence tracking down the guarantees afforded by kernfs_drain(). Not flagged for -stable since this only impacts ongoing development and lockdep validation, not a runtime issue. ====================================================== WARNING: possible circular locking dependency detected 5.5.0-rc3+ #230 Tainted: G OE ------------------------------------------------------ lt-daxctl/6459 is trying to acquire lock: ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80 but task is already holding lock: ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (mem_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 get_online_mems+0x3e/0xb0 kmem_cache_create_usercopy+0x2e/0x260 kmem_cache_create+0x12/0x20 ptlock_cache_init+0x20/0x28 start_kernel+0x243/0x547 secondary_startup_64+0xb6/0xc0 -> #1 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 cpus_read_lock+0x3e/0xb0 online_pages+0x37/0x300 memory_subsys_online+0x17d/0x1c0 device_online+0x60/0x80 state_store+0x65/0xd0 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (kn->count#241){++++}: check_prev_add+0x98/0xa40 validate_chain+0x576/0x860 __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 __kernfs_remove+0x25f/0x2e0 kernfs_remove_by_name_ns+0x41/0x80 remove_files.isra.0+0x30/0x70 sysfs_remove_group+0x3d/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x39/0x70 device_del+0x16a/0x3f0 device_unregister+0x16/0x60 remove_memory_block_devices+0x82/0xb0 try_remove_memory+0xb5/0x130 remove_memory+0x26/0x40 dev_dax_kmem_remove+0x44/0x6a [kmem] device_release_driver_internal+0xe4/0x1c0 unbind_store+0xef/0x120 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(mem_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); lock(mem_hotplug_lock.rw_sem); lock(kn->count#241); *** DEADLOCK *** No fixes tag as this has been a long standing issue that predated the addition of kernfs lockdep annotations. Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/migrate.c: also overwrite error when it is bigger than zeroWei Yang2020-01-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/sparse.c: reset section's mem_map when fully deactivatedPingfan Liu2020-01-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), when a mem section is fully deactivated, section_mem_map still records the section's start pfn, which is not used any more and will be reassigned during re-addition. In analogy with alloc/free pattern, it is better to clear all fields of section_mem_map. Beside this, it breaks the user space tool "makedumpfile" [1], which makes assumption that a hot-removed section has mem_map as NULL, instead of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile will be better to change the assumption, and need a patch) The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" , trigger a crash, and save vmcore by makedumpfile [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version") Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mempolicy.c: fix out of bounds write in mpol_parse_str()Dan Carpenter2020-01-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | What we are trying to do is change the '=' character to a NUL terminator and then at the end of the function we restore it back to an '='. The problem is there are two error paths where we jump to the end of the function before we have replaced the '=' with NUL. We end up putting the '=' in the wrong place (possibly one element before the start of the buffer). Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Dmitry Vyukov <dvyukov@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix a crash in wb_workfn when a device disappearsTheodore Ts'o2020-01-314-21/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/test_bitmap: correct test data offsets for 32-bitAndy Shevchenko2020-01-311-4/+5
| | | | | | | | | | | | | | | | | On 32-bit platform the size of long is only 32 bits which makes wrong offset in the array of 64 bit size. Calculate offset based on BITS_PER_LONG. Link: http://lkml.kernel.org/r/20200109103601.45929-1-andriy.shevchenko@linux.intel.com Fixes: 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Yury Norov <yury.norov@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-linus-hmm' of ↵Linus Torvalds2020-01-296-326/+375
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull mmu_notifier updates from Jason Gunthorpe: "This small series revises the names in mmu_notifier to make the code clearer and more readable" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
| * mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifierJason Gunthorpe2020-01-143-90/+104
| | | | | | | | | | | | | | | | | | The 'interval_sub' is placed on the 'notifier_subscriptions' interval tree. This eliminates the poor name 'mni' for this variable. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
| * mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifierJason Gunthorpe2020-01-142-96/+117
| | | | | | | | | | | | | | | | The 'subscription' is placed on the 'notifier_subscriptions' list. This eliminates the poor name 'mn' for this variable. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
| * mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptionsJason Gunthorpe2020-01-145-170/+184
| | | | | | | | | | | | | | | | | | | | | | The name mmu_notifier_mm implies that the thing is a mm_struct pointer, and is difficult to abbreviate. The struct is actually holding the interval tree and hlist containing the notifiers subscribed to a mm. Use 'subscriptions' as the variable name for this struct instead of the really terrible and misleading 'mmn_mm'. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | Merge tag 'threads-v5.6' of ↵Linus Torvalds2020-01-2932-7/+427
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
| * | prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaimMike Christie2020-01-283-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several storage drivers like dm-multipath, iscsi, tcmu-runner, amd nbd that have userspace components that can run in the IO path. For example, iscsi and nbd's userspace deamons may need to recreate a socket and/or send IO on it, and dm-multipath's daemon multipathd may need to send SG IO or read/write IO to figure out the state of paths and re-set them up. In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the memalloc_*_save/restore functions to control the allocation behavior, but for userspace we would end up hitting an allocation that ended up writing data back to the same device we are trying to allocate for. The device is then in a state of deadlock, because to execute IO the device needs to allocate memory, but to allocate memory the memory layers want execute IO to the device. Here is an example with nbd using a local userspace daemon that performs network IO to a remote server. We are using XFS on top of the nbd device, but it can happen with any FS or other modules layered on top of the nbd device that can write out data to free memory. Here a nbd daemon helper thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute a request. This kicks off a reclaim operation which results in a WRITE to the nbd device and the nbd thread calling back into the mm layer. [ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000 [ 1626.609193] Call Trace: [ 1626.609195] ? __schedule+0x29b/0x630 [ 1626.609197] ? wait_for_completion+0xe0/0x170 [ 1626.609198] schedule+0x30/0xb0 [ 1626.609200] schedule_timeout+0x1f6/0x2f0 [ 1626.609202] ? blk_finish_plug+0x21/0x2e [ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410 [ 1626.609206] ? wait_for_completion+0xe0/0x170 [ 1626.609208] wait_for_completion+0x108/0x170 [ 1626.609210] ? wake_up_q+0x70/0x70 [ 1626.609212] ? __xfs_buf_submit+0x12e/0x250 [ 1626.609214] ? xfs_bwrite+0x25/0x60 [ 1626.609215] xfs_buf_iowait+0x22/0xf0 [ 1626.609218] __xfs_buf_submit+0x12e/0x250 [ 1626.609220] xfs_bwrite+0x25/0x60 [ 1626.609222] xfs_reclaim_inode+0x2e8/0x310 [ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300 [ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40 [ 1626.609228] super_cache_scan+0x152/0x1a0 [ 1626.609231] do_shrink_slab+0x12c/0x2d0 [ 1626.609233] shrink_slab+0x9c/0x2a0 [ 1626.609235] shrink_node+0xd7/0x470 [ 1626.609237] do_try_to_free_pages+0xbf/0x380 [ 1626.609240] try_to_free_pages+0xd9/0x1f0 [ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30 [ 1626.609251] ? ___slab_alloc+0x238/0x560 [ 1626.609254] __alloc_pages_nodemask+0x30c/0x350 [ 1626.609259] skb_page_frag_refill+0x97/0xd0 [ 1626.609274] sk_page_frag_refill+0x1d/0x80 [ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0 [ 1626.609304] tcp_sendmsg+0x27/0x40 [ 1626.609307] sock_sendmsg+0x54/0x60 [ 1626.609308] ___sys_sendmsg+0x29f/0x320 [ 1626.609313] ? sock_poll+0x66/0xb0 [ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0 [ 1626.609320] ? ep_send_events_proc+0xe6/0x230 [ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0 [ 1626.609324] ? ep_read_events_proc+0xc0/0xc0 [ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20 [ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230 [ 1626.609329] ? __hrtimer_init+0xb0/0xb0 [ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20 [ 1626.609334] ? ep_poll+0x26c/0x4a0 [ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0 [ 1626.609339] ? release_sock+0x43/0x90 [ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20 [ 1626.609342] __sys_sendmsg+0x47/0x80 [ 1626.609347] do_syscall_64+0x5f/0x1c0 [ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0 [ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This patch adds a new prctl command that daemons can use after they have done their initial setup, and before they start to do allocations that are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags so both userspace block and FS threads can use it to avoid the allocation recursion and try to prevent from being throttled while writing out data to free up memory. Signed-off-by: Mike Christie <mchristi@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Masato Suzuki <masato.suzuki@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | sched.h: Annotate sighand_struct with __rcuMadhuparna Bhowmik2020-01-262-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following sparse errors by annotating the sighand_struct with __rcu kernel/fork.c:1511:9: error: incompatible types in comparison expression kernel/exit.c:100:19: error: incompatible types in comparison expression kernel/signal.c:1370:27: error: incompatible types in comparison expression This fix introduces the following sparse error in signal.c due to checking the sighand pointer without rcu primitives: kernel/signal.c:1386:21: error: incompatible types in comparison expression This new sparse error is also fixed in this patch. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20200124045908.26389-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | test: Add test for pidfd getfdSargun Dhillon2020-01-134-1/+260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following tests: * Fetch FD, and then compare via kcmp * Make sure getfd can be blocked by blocking ptrace_may_access * Making sure fetching bad FDs fails * Make sure trying to set flags to non-zero results in an EINVAL Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200107175927.4558-5-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | arch: wire up pidfd_getfd syscallSargun Dhillon2020-01-1320-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | pid: Implement pidfd_getfd syscallSargun Dhillon2020-01-131-0/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This syscall allows for the retrieval of file descriptors from other processes, based on their pidfd. This is possible using ptrace, and injection of parasitic code to inject code which leverages SCM_RIGHTS to move file descriptors between a tracee and a tracer. Unfortunately, ptrace comes with a high cost of requiring the process to be stopped, and breaks debuggers. This does not require stopping the process under manipulation. One reason to use this is to allow sandboxers to take actions on file descriptors on the behalf of another process. For example, this can be combined with seccomp-bpf's user notification to do on-demand fd extraction and take privileged actions. One such privileged action is binding a socket to a privileged port. /* prototype */ /* flags is currently reserved and should be set to 0 */ int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags); /* testing */ Ran self-test suite on x86_64 Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | vfs, fdtable: Add fget_task helperSargun Dhillon2020-01-132-2/+22
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a function which can be used to fetch a file, given an arbitrary task. As long as the user holds a reference (refcnt) to the task_struct it is safe to call, and will either return NULL on failure, or a pointer to the file, with a refcnt. This patch is based on Oleg Nesterov's (cf. [1]) patch from September 2018. [1]: Link: https://lore.kernel.org/r/20180915160423.GA31461@redhat.com Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-2-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds2020-01-2915-477/+2112
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - Support for various new opcodes (fallocate, openat, close, statx, fadvise, madvise, openat2, non-vectored read/write, send/recv, and epoll_ctl) - Faster ring quiesce for fileset updates - Optimizations for overflow condition checking - Support for max-sized clamping - Support for probing what opcodes are supported - Support for io-wq backend sharing between "sibling" rings - Support for registering personalities - Lots of little fixes and improvements * tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits) io_uring: add support for epoll_ctl(2) eventpoll: support non-blocking do_epoll_ctl() calls eventpoll: abstract out epoll_ctl() handler io_uring: fix linked command file table usage io_uring: support using a registered personality for commands io_uring: allow registering credentials io_uring: add io-wq workqueue sharing io-wq: allow grabbing existing io-wq io_uring/io-wq: don't use static creds/mm assignments io-wq: make the io_wq ref counted io_uring: fix refcounting with batched allocations at OOM io_uring: add comment for drain_next io_uring: don't attempt to copy iovec for READ/WRITE io_uring: honor IOSQE_ASYNC for linked reqs io_uring: prep req when do IOSQE_ASYNC io_uring: use labeled array init in io_op_defs io_uring: optimise sqe-to-req flags translation io_uring: remove REQ_F_IO_DRAINED io_uring: file switch work needs to get flushed on exit io_uring: hide uring_fd in ctx ...
| * | io_uring: add support for epoll_ctl(2)Jens Axboe2020-01-292-0/+72
| | | | | | | | | | | | | | | | | | | | | This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | eventpoll: support non-blocking do_epoll_ctl() callsJens Axboe2020-01-292-13/+42
| | | | | | | | | | | | | | | | | | | | | Also make it available outside of epoll, along with the helper that decides if we need to copy the passed in epoll_event. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | eventpoll: abstract out epoll_ctl() handlerJens Axboe2020-01-291-20/+25
| | | | | | | | | | | | | | | | | | No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: fix linked command file table usageJens Axboe2020-01-293-14/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it. Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: support using a registered personality for commandsJens Axboe2020-01-282-2/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: allow registering credentialsJens Axboe2020-01-282-7/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add io-wq workqueue sharingPavel Begunkov2020-01-282-15/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <axboe@kernel.dk> Reported-by: Daurnimator <quae@daurnimator.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io-wq: allow grabbing existing io-wqPavel Begunkov2020-01-282-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Export a helper to attach to an existing io-wq, rather than setting up a new one. This is doable now that we have reference counted io_wq's. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring/io-wq: don't use static creds/mm assignmentsJens Axboe2020-01-284-30/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io-wq: make the io_wq ref countedJens Axboe2020-01-271-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | In preparation for sharing an io-wq across different users, add a reference count that manages destruction of it. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: fix refcounting with batched allocations at OOMPavel Begunkov2020-01-271-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of out of memory the second argument of percpu_ref_put_many() in io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly wrong. Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add comment for drain_nextPavel Begunkov2020-01-271-0/+7
| | | | | | | | | | | | | | | | | | | | | Draining the middle of a link is tricky, so leave a comment there Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't attempt to copy iovec for READ/WRITEJens Axboe2020-01-271-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the non-vectored variant of READV/WRITEV, we don't need to setup an async io context, and we flag that appropriately in the io_op_defs array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c we didn't have these opcodes, so the check there was added just for the READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a single check for needing async context, that covers all four of these read/write variants that don't use an iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: honor IOSQE_ASYNC for linked reqsPavel Begunkov2020-01-221-0/+4
| | | | | | | | | | | | | | | | | | | | | REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: prep req when do IOSQE_ASYNCPavel Begunkov2020-01-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised. Prepare them before punting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use labeled array init in io_op_defsPavel Begunkov2020-01-201-62/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: optimise sqe-to-req flags translationPavel Begunkov2020-01-202-34/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG* Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: remove REQ_F_IO_DRAINEDPavel Begunkov2020-01-201-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: file switch work needs to get flushed on exitJens Axboe2020-01-201-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: hide uring_fd in ctxPavel Begunkov2020-01-201-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: remove extra check in __io_commit_cqringPavel Begunkov2020-01-201-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | __io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: optimise use of ctx->drain_nextPavel Begunkov2020-01-201-20/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add support for probing opcodesJens Axboe2020-01-202-2/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: account fixed file references correctly in batchJens Axboe2020-01-201-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com> Reported-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add opcode to issue trace eventJens Axboe2020-01-202-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add support for IORING_OP_OPENAT2Jens Axboe2020-01-202-6/+64
| | | | | | | | | | | | | | | | | | | | | | | | Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>