summaryrefslogtreecommitdiffstats
path: root/fs/fuse
Commit message (Collapse)AuthorAgeFilesLines
* fuse: direct-io: don't dirty ITER_BVEC pagesMiklos Szeredi2016-08-241-3/+4
| | | | | | | | | | | | | | | | | | | | When reading from a loop device backed by a fuse file it deadlocks on lock_page(). This is because the page is already locked by the read() operation done on the loop device. In this case we don't want to either lock the page or dirty it. So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors. Reported-by: Sheng Yang <sheng@yasker.org> Fixes: aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.1+ Reviewed-by: Sheng Yang <sheng@yasker.org> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Tested-by: Sheng Yang <sheng@yasker.org> Tested-by: Ashish Samant <ashish.samant@oracle.com>
* Merge branch 'work.const-qstr' of ↵Linus Torvalds2016-08-063-9/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull qstr constification updates from Al Viro: "Fairly self-contained bunch - surprising lot of places passes struct qstr * as an argument when const struct qstr * would suffice; it complicates analysis for no good reason. I'd prefer to feed that separately from the assorted fixes (those are in #for-linus and with somewhat trickier topology)" * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: qstr: constify instances in adfs qstr: constify instances in lustre qstr: constify instances in f2fs qstr: constify instances in ext2 qstr: constify instances in vfat qstr: constify instances in procfs qstr: constify instances in fuse qstr constify instances in fs/dcache.c qstr: constify instances in nfs qstr: constify instances in ocfs2 qstr: constify instances in autofs4 qstr: constify instances in hfs qstr: constify instances in hfsplus qstr: constify instances in logfs qstr: constify dentry_init_security
| * qstr: constify instances in fuseAl Viro2016-07-303-9/+5
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2016-07-294-40/+28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This fixes error propagation from writeback to fsync/close for writeback cache mode as well as adding a missing capability flag to the INIT message. The rest are cleanups. (The commits are recent but all the code actually sat in -next for a while now. The recommits are due to conflict avoidance and the addition of Cc: stable@...)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use filemap_check_errors() mm: export filemap_check_errors() to modules fuse: fix wrong assignment of ->flags in fuse_send_init() fuse: fuse_flush must check mapping->flags for errors fuse: fsync() did not return IO errors fuse: don't mess with blocking signals new helper: wait_event_killable_exclusive() fuse: improve aio directIO write performance for size extending writes
| * | fuse: use filemap_check_errors()Miklos Szeredi2016-07-291-12/+2
| | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: fix wrong assignment of ->flags in fuse_send_init()Wei Fang2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags") Cc: <stable@vger.kernel.org>
| * | fuse: fuse_flush must check mapping->flags for errorsMaxim Patlasov2016-07-291-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fuse_flush() calls write_inode_now() that triggers writeback, but actual writeback will happen later, on fuse_sync_writes(). If an error happens, fuse_writepage_end() will set error bit in mapping->flags. So, we have to check mapping->flags after fuse_sync_writes(). Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+
| * | fuse: fsync() did not return IO errorsAlexey Kuznetsov2016-07-291-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to implementation of fuse writeback filemap_write_and_wait_range() does not catch errors. We have to do this directly after fuse_sync_writes() Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+
| * | Merge branch 'for-miklos' of ↵Miklos Szeredi2016-07-212-28/+4
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-next
| | * | fuse: don't mess with blocking signalsAl Viro2016-07-191-27/+3
| | | | | | | | | | | | | | | | | | | | | | | | just use wait_event_killable{,_exclusive}(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fuse: improve aio directIO write performance for size extending writesAshish Sangwan2016-06-302-12/+10
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While sending the blocking directIO in fuse, the write request is broken into sub-requests, each of default size 128k and all the requests are sent in non-blocking background mode if async_dio mode is supported by libfuse. The process which issue the write wait for the completion of all the sub-requests. Sending multiple requests parallely gives a chance to perform parallel writes in the user space fuse implementation if it is multi-threaded and hence improves the performance. When there is a size extending aio dio write, we switch to blocking mode so that we can properly update the size of the file after completion of the writes. However, in this situation all the sub-requests are sent in serialized manner where the next request is sent only after receiving the reply of the current request. Hence the multi-threaded user space implementation is not utilized properly. This patch changes the size extending aio dio behavior to exactly follow blocking dio. For multi threaded fuse implementation having 10 threads and using buffer size of 64MB to perform async directIO, we are getting double the speed. Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-07-281-4/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: "The rest of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits) mm, compaction: simplify contended compaction handling mm, compaction: introduce direct compaction priority mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations mm, page_alloc: make THP-specific decisions more generic mm, page_alloc: restructure direct compaction handling in slowpath mm, page_alloc: don't retry initial attempt in slowpath mm, page_alloc: set alloc_flags only once in slowpath lib/stackdepot.c: use __GFP_NOWARN for stack allocations mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB mm, kasan: account for object redzone in SLUB's nearest_obj() mm: fix use-after-free if memory allocation failed in vma_adjust() zsmalloc: Delete an unnecessary check before the function call "iput" mm/memblock.c: fix index adjustment error in __next_mem_range_rev() mem-hotplug: alloc new page from a nearest neighbor node when mem-offline mm: optimize copy_page_to/from_iter_iovec mm: add cond_resched() to generic_swapfile_activate() Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode mm: hwpoison: remove incorrect comments make __section_nr() more efficient ...
| * | | mm: move most file-based accounting to the nodeMel Gorman2016-07-281-4/+4
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'salted-string-hash'Linus Torvalds2016-07-282-3/+2
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the vfs dentry hashing to mix in the parent pointer at the _beginning_ of the hash, rather than at the end. That actually improves both the hash and the code generation, because we can move more of the computation to the "static" part of the dcache setup, and do less at lookup runtime. It turns out that a lot of other hash users also really wanted to mix in a base pointer as a 'salt' for the hash, and so the slightly extended interface ends up working well for other cases too. Users that want a string hash that is purely about the string pass in a 'salt' pointer of NULL. * merge branch 'salted-string-hash': fs/dcache.c: Save one 32-bit multiply in dcache lookup vfs: make the string hashes salt the hash
| * | vfs: make the string hashes salt the hashLinus Torvalds2016-06-102-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We always mixed in the parent pointer into the dentry name hash, but we did it late at lookup time. It turns out that we can simplify that lookup-time action by salting the hash with the parent pointer early instead of late. A few other users of our string hashes also wanted to mix in their own pointers into the hash, and those are updated to use the same mechanism. Hash users that don't have any particular initial salt can just use the NULL pointer as a no-salt. Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Use the right predicate in ->atomic_open() instancesAl Viro2016-07-051-1/+1
| |/ |/| | | | | | | | | | | | | ->atomic_open() can be given an in-lookup dentry *or* a negative one found in dcache. Use d_in_lookup() to tell one from another, rather than d_unhashed(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fuse: serialize dirops by defaultMiklos Szeredi2016-06-303-1/+31
|/ | | | | | | | | | Negotiate with userspace filesystems whether they support parallel readdir and lookup. Disable parallelism by default for fear of breaking fuse filesystems. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 9902af79c01a ("parallel lookups: actual switch to rwsem") Fixes: d9b3dbdcfd62 ("fuse: switch to ->iterate_shared()")
* switch ->setxattr() to passing dentry and inode separatelyAl Viro2016-05-271-3/+3
| | | | | | | | | | | | | | | | smack ->d_instantiate() uses ->setxattr(), so to be able to call it before we'd hashed the new dentry and attached it to inode, we need ->setxattr() instances getting the inode as an explicit argument rather than obtaining it from dentry. Similar change for ->getxattr() had been done in commit ce23e64. Unlike ->getxattr() (which is used by both selinux and smack instances of ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately it got missed back then. Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'work.preadv2' of ↵Linus Torvalds2016-05-171-2/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs cleanups from Al Viro: "More cleanups from Christoph" * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd: use RWF_SYNC fs: add RWF_DSYNC aand RWF_SYNC ceph: use generic_write_sync fs: simplify the generic_write_sync prototype fs: add IOCB_SYNC and IOCB_DSYNC direct-io: remove the offset argument to dio_complete direct-io: eliminate the offset argument to ->direct_IO xfs: eliminate the pos variable in xfs_file_dio_aio_write filemap: remove the pos argument to generic_file_direct_write filemap: remove pos variables in generic_file_read_iter
| * direct-io: eliminate the offset argument to ->direct_IOChristoph Hellwig2016-05-011-1/+2
| | | | | | | | | | | | | | | | Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * filemap: remove the pos argument to generic_file_direct_writeChristoph Hellwig2016-05-011-1/+1
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'ovl-fixes' into for-linusAl Viro2016-05-171-1/+1
|\ \ | | | | | | | | | | | | | | | Backmerge to resolve a conflict in ovl_lookup_real(); "ovl_lookup_real(): use lookup_one_len_unlocked()" instead, but it was too late in the cycle to rebase.
| * | fuse: Fix return value from fuse_get_user_pages()Ashish Samant2016-04-251-1/+1
| |/ | | | | | | | | | | | | | | | | | | fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io read will not return 0 to indicate that read has completed. Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()") Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | fuse: switch to ->iterate_shared()Al Viro2016-05-021-49/+45
| | | | | | | | | | | | | | Switch dcache pre-seeding on readdir to d_alloc_parallel(); nothing else is needed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge getxattr prototype change into work.lookupsAl Viro2016-05-021-3/+2
|\ \ | |/ |/| | | The rest of work.xattr stuff isn't needed for this branch
| * ->getxattr(): pass dentry and inode as separate argumentsAl Viro2016-04-111-3/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-043-57/+57
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fuse: return patrial success from fuse_direct_io()Ashish Samant2016-03-161-12/+11
| | | | | | | | | | | | | | If a user calls writev/readv in direct io mode with partially valid data in the iovec array such that any vector other than the first one in the array contains invalid data, we currently return the error for the invalid iovec. Instead, we should return the number of bytes already written/read and not the error as we do in the non direct_io case. Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: Add reference counting for fuse_io_privSeth Forshee2016-03-143-9/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'reqs' member of fuse_io_priv serves two purposes. First is to track the number of oustanding async requests to the server and to signal that the io request is completed. The second is to be a reference count on the structure to know when it can be freed. For sync io requests these purposes can be at odds. fuse_direct_IO() wants to block until the request is done, and since the signal is sent when 'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to use the object after the userspace server has completed processing requests. This leads to some handshaking and special casing that it needlessly complicated and responsible for at least one race condition. It's much cleaner and safer to maintain a separate reference count for the object lifecycle and to let 'reqs' just be a count of outstanding requests to the userspace server. Then we can know for sure when it is safe to free the object without any handshaking or special cases. The catch here is that most of the time these objects are stack allocated and should not be freed. Initializing these objects with a single reference that is never released prevents accidental attempts to free the objects. Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally") Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: do not use iocb after it may have been freedRobert Doebbelin2016-03-141-3/+4
| | | | | | | | | | | | | | | | There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an iocb that could have been freed if async io has already completed. The fix in this case is simple and obvious: cache the result before starting io. It was discovered by KASan: kernel: ================================================================== kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390 Signed-off-by: Robert Doebbelin <robert@quobyte.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: bcba24ccdc82 ("fuse: enable asynchronous processing direct IO") Cc: <stable@vger.kernel.org> # 3.10+
* wrappers for ->i_mutex accessAl Viro2016-01-222-23/+23
| | | | | | | | | | | parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2016-01-212-8/+68
|\ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This adds SEEK_HOLE and SEEK_DATA support in lseek" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add support for SEEK_HOLE and SEEK_DATA in lseek
| * fuse: add support for SEEK_HOLE and SEEK_DATA in lseekRavishankar N2015-11-102-8/+68
| | | | | | | | | | | | | | | | | | | | | | A useful performance improvement for accessing virtual machine images via FUSE mount. See https://bugzilla.redhat.com/show_bug.cgi?id=1220173 for a use-case for glusterFS. Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
* | kmemcg: account certain kmem allocations to memcgVladimir Davydov2016-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'work.symlinks' of ↵Linus Torvalds2016-01-111-7/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs RCU symlink updates from Al Viro: "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode even if the symlink is not an embedded one. No changes since the mailbomb on Jan 1" * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->get_link() to delayed_call, kill ->put_link() kill free_page_put_link() teach nfs_get_link() to work in RCU mode teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode teach shmem_get_link() to work in RCU mode teach page_get_link() to work in RCU mode replace ->follow_link() with new method that could stay in RCU mode don't put symlink bodies in pagecache into highmem namei: page_getlink() and page_follow_link_light() are the same thing ufs: get rid of ->setattr() for symlinks udf: don't duplicate page_symlink_inode_operations logfs: don't duplicate page_symlink_inode_operations switch befs long symlinks to page_symlink_operations
| * | switch ->get_link() to delayed_call, kill ->put_link()Al Viro2015-12-301-3/+3
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | kill free_page_put_link()Al Viro2015-12-291-3/+3
| | | | | | | | | | | | | | | | | | all callers are better off with kfree_put_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | replace ->follow_link() with new method that could stay in RCU modeAl Viro2015-12-081-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2015-12-112-1/+3
|\ \ \ | |/ / |/| / | |/ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Two bugfixes, both bound for -stable" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: break infinite loop in fuse_fill_write_pages() cuse: fix memory leak
| * fuse: break infinite loop in fuse_fill_write_pages()Roman Gushchin2015-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a report about unkillable task eating CPU. Further investigation shows, that the problem is in the fuse_fill_write_pages() function. If iov's first segment has zero length, we get an infinite loop, because we never reach iov_iter_advance() call. Fix this by calling iov_iter_advance() before repeating an attempt to copy data from userspace. A similar problem is described in 124d3b7041f ("fix writev regression: pan hanging unkillable and un-straceable"). If zero-length segmend is followed by segment with invalid address, iov_iter_fault_in_readable() checks only first segment (zero-length), iov_iter_copy_from_user_atomic() skips it, fails at second and returns zero -> goto again without skipping zero-length segment. Patch calls iov_iter_advance() before goto again: we'll skip zero-length segment at second iteraction and iov_iter_fault_in_readable() will detect invalid address. Special thanks to Konstantin Khlebnikov, who helped a lot with the commit description. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Maxim Patlasov <mpatlasov@parallels.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: ea9b9907b82a ("fuse: implement perform_write") Cc: <stable@vger.kernel.org>
| * cuse: fix memory leakMiklos Szeredi2015-11-101-0/+2
| | | | | | | | | | | | | | | | | | | | The problem is that fuse_dev_alloc() acquires an extra reference to cc.fc, and the original ref count is never dropped. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure") Cc: <stable@vger.kernel.org> # v4.2+
* | Move locks API users to locks_lock_inode_wait()Benjamin Coddington2015-10-221-1/+1
|/ | | | | | | | | Instead of having users check for FL_POSIX or FL_FLOCK to call the correct locks API function, use the check within locks_lock_inode_wait(). This allows for some later cleanup. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
* fs/fuse: fix ioctl type confusionJann Horn2015-08-161-1/+9
| | | | | | | | | fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd, leading to a type confusion issue. Fix it by checking file->f_op. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-07-041-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
| * fs: Rename file_remove_suid() to file_remove_privs()Jan Kara2015-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | file_remove_suid() is a misnomer since it removes also file capabilities stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells something else than whether file_remove_suid() call is necessary which leads to bugs. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2015-07-031-6/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace updates from Eric Biederman: "Long ago and far away when user namespaces where young it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mounted on empty directories of proc and sysfs should be ignored but the test for empty directories was insufficient. So in my tree directories on proc, sysctl and sysfs that will always be empty are created specially. Every other technique is imperfect as an ordinary directory can have entries added even after a readdir returns and shows that the directory is empty. Special creation of directories for mount points makes the code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that are created specially. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of the work but unfortunately unprivileged userspace winds up mounting fresh copies of proc and sysfs with noexec and nosuid clear when root set those flags on the previous mount of proc and sysfs. So for now only the atime, read-only and nodev attributes which userspace happens to keep consistent are enforced. Dealing with the noexec and nosuid attributes remains for another time. This set of changes also addresses an issue with how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful since it was added (as all of the code in the kernel was converted) and is not now actively wrong. There is also a short list of issues that have not been fixed yet that I will mention briefly. It is possible to rename a directory from below to above a bind mount. At which point any directory pointers below the renamed directory can be walked up to the root directory of the filesystem. With user namespaces enabled a bind mount of the bind mount can be created allowing the user to pick a directory whose children they can rename to outside of the bind mount. This is challenging to fix and doubly so because all obvious solutions must touch code that is in the performance part of pathname resolution. As mentioned above there is also a question of how to ensure that developers by accident or with purpose do not introduce exectuable files on sysfs and proc and in doing so introduce security regressions in the current userspace that will not be immediately obvious and as such are likely to require breaking userspace in painful ways once they are recognized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Remove incorrect debugging WARN in prepend_path mnt: Update fs_fully_visible to test for permanently empty directories sysfs: Create mountpoints with sysfs_create_mount_point sysfs: Add support for permanently empty directories to serve as mount points. kernfs: Add support for always empty directories. proc: Allow creating permanently empty directories that serve as mount points sysctl: Allow creating permanently empty directories that serve as mountpoints. fs: Add helper functions for permanently empty directories. vfs: Ignore unlocked mounts in fs_fully_visible mnt: Modify fs_fully_visible to deal with locked ro nodev and atime mnt: Refactor the logic for mounting sysfs and proc in a user namespace
| * | sysfs: Create mountpoints with sysfs_create_mount_pointEric W. Biederman2015-07-011-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Cc: stable@vger.kernel.org Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2015-07-025-489/+624
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This is the start of improving fuse scalability. An input queue and a processing queue is split out from the monolithic fuse connection, each of those having their own spinlock. The end of the patchset adds the ability to clone a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Based on a patchset by Srinivas Eeda and Ashish Samant. Thanks to Ashish for the review of this series" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits) fuse: update MAINTAINERS entry fuse: separate pqueue for clones fuse: introduce per-instance fuse_dev structure fuse: device fd clone fuse: abort: no fc->lock needed for request ending fuse: no fc->lock for pqueue parts fuse: no fc->lock in request_end() fuse: cleanup request_end() fuse: request_end(): do once fuse: add req flag for private list fuse: pqueue locking fuse: abort: group pqueue accesses fuse: cleanup fuse_dev_do_read() fuse: move list_del_init() from request_end() into callers fuse: duplicate ->connected in pqueue fuse: separate out processing queue fuse: simplify request_wait() fuse: no fc->lock for iqueue parts fuse: allow interrupt queuing without fc->lock fuse: iqueue locking ...
| * | | fuse: separate pqueue for clonesMiklos Szeredi2015-07-013-30/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make each fuse device clone refer to a separate processing queue. The only constraint on userspace code is that the request answer must be written to the same device clone as it was read off. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * | | fuse: introduce per-instance fuse_dev structureMiklos Szeredi2015-07-014-34/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow fuse device clones to refer to be distinguished. This patch just adds the infrastructure by associating a separate "struct fuse_dev" with each clone. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: Ashish Samant <ashish.samant@oracle.com>