summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* mm: zero-seek shrinkersJohannes Weiner2018-10-262-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <dmituzas@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner2018-10-261-3/+0
| | | | | | | | | | | | | | | | | | | | | | | There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, proc: add KReclaimable to /proc/meminfoVlastimil Babka2018-10-261-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab allocations that can be reclaimed via shrinker. In /proc/meminfo, we can show the sum of all reclaimable kernel allocations (including slab) as "KReclaimable". Add the same counter also to per-node meminfo under /sys With this counter, users will have more complete information about kernel memory usage. Non-slab reclaimable pages (currently just the ION allocator) will not be missing from /proc/meminfo, making users wonder where part of their memory went. More precisely, they already appear in MemAvailable, but without the new counter, it's not obvious why the value in MemAvailable doesn't fully correspond with the sum of other counters participating in it. Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dcache: allocate external names from reclaimable kmalloc cachesVlastimil Babka2018-10-261-29/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use the newly introduced kmalloc-reclaimable-X caches, to allocate external names in dcache, which will take care of the proper accounting automatically, and also improve anti-fragmentation page grouping. This effectively reverts commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory") and instead passes __GFP_RECLAIMABLE to kmalloc(). The accounting thus moves from NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also considered in MemAvailable calculation and overcommit decisions. Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cramfs: convert to use vmf_insert_mixedNicolas Pitre2018-10-261-1/+4
| | | | | | | | | | | | | | | cramfs is the only remaining user of vm_insert_mixed() and should be converted to vmf_insert_mixed(). Based on a previous patch from Matthew Wilcox. Link: http://lkml.kernel.org/r/nycvar.YSQ.7.76.1808290945450.10215@knanqh.ubzr Signed-off-by: Nicolas Pitre <nico@linaro.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com>a Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/iomap.c: change return type to vm_fault_tSouptick Joarder2018-10-261-1/+1
| | | | | | | | | | | | | | Change iomap_page_mkwrite() return type to vm_fault_t. see commit 1c8f422059ae ("mm: change return type to vm_fault_t") for reference. Link: http://lkml.kernel.org/r/20180827172050.GA18673@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove set but not used variable 'rb'YueHaibing2018-10-261-2/+0
| | | | | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/refcounttree.c: In function 'ocfs2_create_reflink_node': fs/ocfs2/refcounttree.c:4138:31: warning: variable 'rb' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1536198443-113047-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2/dlm/dlmdebug.c: fix a sleep-in-atomic-context bug in ↵Jia-Ju Bai2018-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dlm_print_one_mle() The kernel module may sleep with holding a spinlock. The function call paths (from bottom to top) in Linux-4.16 are: [FUNC] get_zeroed_page(GFP_NOFS) fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 255: __dlm_put_mle in dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 254: spin_lock in dlm_put_ml [FUNC] get_zeroed_page(GFP_NOFS) fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 222: __dlm_put_mle in dlm_put_mle_inuse fs/ocfs2/dlm/dlmmaster.c, 219: spin_lock in dlm_put_mle_inuse To fix this bug, GFP_NOFS is replaced with GFP_ATOMIC. This bug is found by my static analysis tool DSAC. Link: http://lkml.kernel.org/r/20180901112528.27025-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove unneeded null checkDing Xiang2018-10-261-2/+1
| | | | | | | | | | | | | | | Null check for kfree is unnecessary, so remove it. Link: http://lkml.kernel.org/r/1535704514-26559-1-git-send-email-dingxiang@cmss.chinamobile.com Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove unused pointer 'eb'Colin Ian King2018-10-261-4/+0
| | | | | | | | | | | | | | | | | | | Pointer 'eb' is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'eb' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180828141907.10826-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2/dlm: remove unnecessary parenthesesNathan Chancellor2018-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | Clang warns when more than one set of parentheses is used for a single conditional statement: fs/ocfs2/dlm/dlmthread.c:534:18: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if ((res->owner == dlm->node_num)) { ~~~~~~~~~~~^~~~~~~~~~~~~~~~ fs/ocfs2/dlm/dlmthread.c:534:18: note: remove extraneous parentheses around the comparison to silence this warning if ((res->owner == dlm->node_num)) { ~ ^ ~ Link: http://lkml.kernel.org/r/20180924181929.6853-1-natechancellor@gmail.com Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: disable irqs when taking the waitqueue lockChristoph Hellwig2018-10-261-4/+4
| | | | | | | | | | | | | | | | userfaultfd contains howe-grown locking of the waitqueue lock, and does not disable interrupts. This relies on the fact that no one else takes it from interrupt context and violates an invariat of the normal waitqueue locking scheme. With aio poll it is easy to trigger other locks that disable interrupts (or are called from interrupt context). Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [4.19.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range()Vlastimil Babka2018-10-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Leonardo reports an apparent regression in 4.19-rc7: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 RIP: 0010:smaps_pte_range+0x32d/0x540 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __walk_page_range+0x3c2/0x6f0 walk_page_vma+0x42/0x60 smap_gather_stats+0x79/0xe0 ? gather_pte_stats+0x320/0x320 ? gather_hugetlb_stats+0x70/0x70 show_smaps_rollup+0xcd/0x1c0 seq_read+0x157/0x400 __vfs_read+0x3a/0x180 ? security_file_permission+0x93/0xc0 ? security_file_permission+0x93/0xc0 vfs_read+0x8f/0x140 ksys_read+0x55/0xc0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Decoded code matched to local compilation+disassembly points to smaps_pte_entry(): } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = find_get_entry(vma->vm_file->f_mapping, linear_page_index(vma, addr)); Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that case, however for smaps_rollup, smap_gather_stats() can set the flag true for one vma and leave it true for subsequent vma's where it should be false. To fix, reset the check_shmem_swap flag to false. There's also related bug which sets mss->swap to shmem_swapped, which in the context of smaps_rollup overwrites any value accumulated from previous vma's. Fix that as well. Note that the report suggests a regression between 4.17.19 and 4.19-rc7, which makes the 4.19 series ending with commit 258f669e7e88 ("mm: /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But the mss was reused for rollup since 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") so let's play it safe with the stable backport. Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Tested-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Colascione <dancol@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'driver-core-4.20-rc1' of ↵Linus Torvalds2018-10-261-1/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is a small number of driver core patches for 4.20-rc1. Not much happened here this merge window, only a very tiny number of patches that do: - add BUS_ATTR_WO() for use by drivers - component error path fixes - kernfs range check fix - other tiny error path fixes and const changes All of these have been in linux-next with no reported issues for a while" * tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: devres: provide devm_kstrdup_const() mm: move is_kernel_rodata() to asm-generic/sections.h devres: constify p in devm_kfree() driver core: add BUS_ATTR_WO() macro kernfs: Fix range checks in kernfs_get_target_path component: fix loop condition to call unbind() if bind() fails drivers/base/devtmpfs.c: don't pretend path is const in delete_path kernfs: update comment about kernfs_path() return value
| * kernfs: Fix range checks in kernfs_get_target_pathBernd Edlinger2018-09-161-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The terminating NUL byte is only there because the buffer is allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the returned string may not be zero-terminated if it is exactly PATH_MAX characters long. Furthermore also the initial loop may theoretically exceed PATH_MAX and cause a fault. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'linus' of ↵Linus Torvalds2018-10-252-14/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove VLA usage - Add cryptostat user-space interface - Add notifier for new crypto algorithms Algorithms: - Add OFB mode - Remove speck Drivers: - Remove x86/sha*-mb as they are buggy - Remove pcbc(aes) from x86/aesni - Improve performance of arm/ghash-ce by up to 85% - Implement CTS-CBC in arm64/aes-blk, faster by up to 50% - Remove PMULL based arm64/crc32 driver - Use PMULL in arm64/crct10dif - Add aes-ctr support in s5p-sss - Add caam/qi2 driver Others: - Pick better transform if one becomes available in crc-t10dif" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (124 commits) crypto: chelsio - Update ntx queue received from cxgb4 crypto: ccree - avoid implicit enum conversion crypto: caam - add SPDX license identifier to all files crypto: caam/qi - simplify CGR allocation, freeing crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static crypto: arm64/aes-blk - ensure XTS mask is always loaded crypto: testmgr - fix sizeof() on COMP_BUF_SIZE crypto: chtls - remove set but not used variable 'csk' crypto: axis - fix platform_no_drv_owner.cocci warnings crypto: x86/aes-ni - fix build error following fpu template removal crypto: arm64/aes - fix handling sub-block CTS-CBC inputs crypto: caam/qi2 - avoid double export crypto: mxs-dcp - Fix AES issues crypto: mxs-dcp - Fix SHA null hashes and output length crypto: mxs-dcp - Implement sha import/export crypto: aegis/generic - fix for big endian systems crypto: morus/generic - fix for big endian systems crypto: lrw - fix rebase error after out of bounds fix crypto: cavium/nitrox - use pci_alloc_irq_vectors() while enabling MSI-X. crypto: cavium/nitrox - NITROX command queue changes. ...
| * | crypto: speck - remove SpeckJason A. Donenfeld2018-09-042-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are unused, undesired, and have never actually been used by anybody. The original authors of this code have changed their mind about its inclusion. While originally proposed for disk encryption on low-end devices, the idea was discarded [1] in favor of something else before that could really get going. Therefore, this patch removes Speck. [1] https://marc.info/?l=linux-crypto-vger&m=153359499015659 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | Merge branch 'next-integrity' of ↵Linus Torvalds2018-10-251-4/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull integrity updates from James Morris: "From Mimi: This contains a couple of bug fixes, including one for a recent problem with calculating file hashes on overlayfs, and some code cleanup" * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: MAINTAINERS: add Jarkko as maintainer for trusted keys ima: open a new file instance if no read permissions ima: fix showing large 'violations' or 'runtime_measurements_count' security/integrity: remove unnecessary 'init_keyring' variable security/integrity: constify some read-only data vfs: require i_size <= SIZE_MAX in kernel_read_file()
| * | | vfs: require i_size <= SIZE_MAX in kernel_read_file()Eric Biggers2018-10-101-4/+4
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit systems, the buffer allocated by kernel_read_file() is too small if the file size is > SIZE_MAX, due to truncation to size_t. Fortunately, since the 'count' argument to kernel_read() is also truncated to size_t, only the allocated space is filled; then, -EIO is returned since 'pos != i_size' after the read loop. But this is not obvious and seems incidental. We should be more explicit about this case. So, fail early if i_size > SIZE_MAX. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
* | | Merge branch 'work.lookup' of ↵Linus Torvalds2018-10-251-9/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more ->lookup() cleanups from Al Viro: "Some ->lookup() instances are still overcomplicating the life for themselves, open-coding the stuff that would be handled by d_splice_alias() just fine. Simplify a couple of such cases caught this cycle and document d_splice_alias() intended use" * 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Document d_splice_alias() calling conventions for ->lookup() users. simplify btrfs_lookup() clean erofs_lookup()
| * | | simplify btrfs_lookup()Al Viro2018-10-101-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | d_splice_alias() is fine with ERR_PTR(-E...) for inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge branch 'work.compat' of ↵Linus Torvalds2018-10-251-68/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat_ioctl fixes from Al Viro: "A bunch of compat_ioctl fixes, mostly in bluetooth. Hopefully, most of fs/compat_ioctl.c will get killed off over the next few cycles; between this, tty series already merged and Arnd's work this cycle ought to take a good chunk out of the damn thing..." * 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: hidp: fix compat_ioctl hidp: constify hidp_connection_add() cmtp: fix compat_ioctl bnep: fix compat_ioctl compat_ioctl: trim the pointless includes
| * | | | hidp: fix compat_ioctlAl Viro2018-09-101-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) no point putting it into fs/compat_ioctl.c when you handle it in your ->compat_ioctl() anyway. 2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does layout massage (pointer-chasing there) 3) use compat_ptr() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | cmtp: fix compat_ioctlAl Viro2018-09-101-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use compat_ptr(). And don't mess with fs/compat_ioctl.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | bnep: fix compat_ioctlAl Viro2018-09-101-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | use compat_ptr() properly and don't bother with fs/compat_ioctl.c - it's all handled in ->compat_ioctl() anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | compat_ioctl: trim the pointless includesAl Viro2018-09-101-39/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | Merge branch 'timers-core-for-linus' of ↵Linus Torvalds2018-10-257-60/+60
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "The timers and timekeeping departement provides: - Another large y2038 update with further preparations for providing the y2038 safe timespecs closer to the syscalls. - An overhaul of the SHCMT clocksource driver - SPDX license identifier updates - Small cleanups and fixes all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits) tick/sched : Remove redundant cpu_online() check clocksource/drivers/dw_apb: Add reset control clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE clocksource/drivers: Unify the names to timer-* format clocksource/drivers/sh_cmt: Add R-Car gen3 support dt-bindings: timer: renesas: cmt: document R-Car gen3 support clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines clocksource/drivers/sh_cmt: Fixup for 64-bit machines clocksource/drivers/sh_tmu: Convert to SPDX identifiers clocksource/drivers/sh_mtu2: Convert to SPDX identifiers clocksource/drivers/sh_cmt: Convert to SPDX identifiers clocksource/drivers/renesas-ostm: Convert to SPDX identifiers clocksource: Convert to using %pOFn instead of device_node.name tick/broadcast: Remove redundant check RISC-V: Request newstat syscalls y2038: signal: Change rt_sigtimedwait to use __kernel_timespec y2038: socket: Change recvmmsg to use __kernel_timespec y2038: sched: Change sched_rr_get_interval to use __kernel_timespec y2038: utimes: Rework #ifdef guards for compat syscalls ...
| * | | | | y2038: utimes: Rework #ifdef guards for compat syscallsArnd Bergmann2018-08-291-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After changing over to 64-bit time_t syscalls, many architectures will want compat_sys_utimensat() but not respective handlers for utime(), utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that support CONFIG_COMPAT set it, but future 64-bit architectures will not (tile would not have needed it either, but got removed). As older 32-bit architectures get converted to using CONFIG_64BIT_TIME, they will have to use __ARCH_WANT_SYS_UTIME32 instead of __ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't need either of them as they never had a utime syscall. Since the compat_utimbuf structure is now required outside of CONFIG_COMPAT, I'm moving it into compat_time.h. Signed-off-by: Arnd Bergmann <arnd@arndb.de> --- changed from last version: - renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
| * | | | | y2038: Compile utimes()/futimesat() conditionallyArnd Bergmann2018-08-291-29/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are four generations of utimes() syscalls: utime(), utimes(), futimesat() and utimensat(), each one being a superset of the previous one. For y2038 support, we have to add another one, which is the same as the existing utimensat() but always passes 64-bit times_t based timespec values. There are currently 10 architectures that only use utimensat(), two that use utimes(), futimesat() and utimensat() but not utime(), and 11 architectures that have all four, and those define __ARCH_WANT_SYS_UTIME in order to get a sys_utime implementation. Since all the new architectures only want utimensat(), moving all the legacy entry points into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha and ia64 grow a tiny bit as they now also get an unused sys_utime(), but it didn't seem worth the extra complexity of adding yet another ifdef for those. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | | | y2038: Change sys_utimensat() to use __kernel_timespecArnd Bergmann2018-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When 32-bit architectures get changed to support 64-bit time_t, utimensat() needs to use the new __kernel_timespec structure as its argument. The older utime(), utimes() and futimesat() system calls don't need a corresponding change as they are no longer used on C libraries that have 64-bit time support. As we do for the other syscalls that have timespec arguments, we reuse the 'compat' syscall entry points to implement the traditional four interfaces, and only leave the new utimensat() as a native handler, so that the same code gets used on both 32-bit and 64-bit kernels on each syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | | | asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macroArnd Bergmann2018-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sys_llseek sytem call is needed on all 32-bit architectures and none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard and simplify the include/asm-generic/unistd.h header further. Since 32-bit tasks can run either natively or in compat mode on 64-bit architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT. There are a few 64-bit architectures that also reference sys_llseek in their 64-bit ABI (e.g. sparc), but I verified that those all select CONFIG_COMPAT, so the #if check is still correct here. It's a bit odd to include it in the syscall table though, as it's the same as sys_lseek() on 64-bit, but with strange calling conventions. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | | | y2038: Remove newstat family from default syscall setArnd Bergmann2018-08-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have four generations of stat() syscalls: - the oldstat syscalls that are only used on the older architectures - the newstat family that is used on all 64-bit architectures but lacked support for large files on 32-bit architectures. - the stat64 family that is used mostly on 32-bit architectures to replace newstat - statx() to replace all of the above, adding 64-bit timestamps among other things. We already compile stat64 only on those architectures that need it, but newstat is always built, including on those that don't reference it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of __ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of newstat. All architectures that need it use an explict define, the others now get a little bit smaller, and future architecture (including 64-bit targets) won't ever see it. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | | | y2038: globally rename compat_time to old_time32Arnd Bergmann2018-08-275-27/+27
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph Hellwig suggested a slightly different path for handling backwards compatibility with the 32-bit time_t based system calls: Rather than simply reusing the compat_sys_* entry points on 32-bit architectures unchanged, we get rid of those entry points and the compat_time types by renaming them to something that makes more sense on 32-bit architectures (which don't have a compat mode otherwise), and then share the entry points under the new name with the 64-bit architectures that use them for implementing the compatibility. The following types and interfaces are renamed here, and moved from linux/compat_time.h to linux/time32.h: old new --- --- compat_time_t old_time32_t struct compat_timeval struct old_timeval32 struct compat_timespec struct old_timespec32 struct compat_itimerspec struct old_itimerspec32 ns_to_compat_timeval() ns_to_old_timeval32() get_compat_itimerspec64() get_old_itimerspec32() put_compat_itimerspec64() put_old_itimerspec32() compat_get_timespec64() get_old_timespec32() compat_put_timespec64() put_old_timespec32() As we already have aliases in place, this patch addresses only the instances that are relevant to the system call interface in particular, not those that occur in device drivers and other modules. Those will get handled separately, while providing the 64-bit version of the respective interfaces. I'm not renaming the timex, rusage and itimerval structures, as we are still debating what the new interface will look like, and whether we will need a replacement at all. This also doesn't change the names of the syscall entry points, which can be done more easily when we actually switch over the 32-bit architectures to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix. Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* | | | | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2018-10-2414-454/+1259
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - further restructure ext4 documentation - fix up ext4's delayed allocation for bigalloc file systems - fix up some syzbot-detected races in EXT4_IOC_MOVE_EXT, EXT4_IOC_SWAP_BOOT, and ext4_remount - ... and a few other miscellaneous bugs and optimizations. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: fix use-after-free race in ext4_remount()'s error path ext4: cache NULL when both default_acl and acl are NULL docs: promote the ext4 data structures book to top level docs: move ext4 administrative docs to admin-guide/ jbd2: fix use after free in jbd2_log_do_checkpoint() ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTR ext4: fix setattr project check in fssetxattr ioctl docs: make ext4 readme tables readable docs: fix ext4 documentation table formatting problems docs: generate a separate ext4 pdf file from the documentation ext4: convert fault handler to use vm_fault_t type ext4: initialize retries variable in ext4_da_write_inline_data_begin() ext4: fix EXT4_IOC_SWAP_BOOT ext4: fix build error when DX_DEBUG is defined ext4: fix argument checking in EXT4_IOC_MOVE_EXT ext4: fix reserved cluster accounting at page invalidation time ext4: adjust reserved cluster count when removing extents ext4: reduce reserved cluster count by number of allocated clusters ext4: fix reserved cluster accounting at delayed write time ext4: add new pending reservation mechanism ...
| * | | | | ext4: fix use-after-free race in ext4_remount()'s error pathTheodore Ts'o2018-10-122-26/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible for ext4_show_quota_options() to try reading s_qf_names[i] while it is being modified by ext4_remount() --- most notably, in ext4_remount's error path when the original values of the quota file name gets restored. Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # 3.2+
| * | | | | ext4: cache NULL when both default_acl and acl are NULLChengguang Xu2018-10-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | default_acl and acl of newly created inode will be initiated as ACL_NOT_CACHED in vfs function inode_init_always() and later will be updated by calling xxx_init_acl() in specific filesystems. However, when default_acl and acl are NULL then they keep the value of ACL_NOT_CACHED. This patch changes the code to cache NULL for acl / default_acl in this case to save unnecessary ACL lookup attempt. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
| * | | | | jbd2: fix use after free in jbd2_log_do_checkpoint()Jan Kara2018-10-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code cleaning transaction's lists of checkpoint buffers has a bug where it increases bh refcount only after releasing journal->j_list_lock. Thus the following race is possible: CPU0 CPU1 jbd2_log_do_checkpoint() jbd2_journal_try_to_free_buffers() __journal_try_to_free_buffer(bh) ... while (transaction->t_checkpoint_io_list) ... if (buffer_locked(bh)) { <-- IO completes now, buffer gets unlocked --> spin_unlock(&journal->j_list_lock); spin_lock(&journal->j_list_lock); __jbd2_journal_remove_checkpoint(jh); spin_unlock(&journal->j_list_lock); try_to_free_buffers(page); get_bh(bh) <-- accesses freed bh Fix the problem by grabbing bh reference before unlocking journal->j_list_lock. Fixes: dc6e8d669cf5 ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()") Fixes: be1158cc615f ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()") Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com CC: stable@vger.kernel.org Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTRWang Shilong2018-10-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We return most failure of dquota_initialize() except inode evict, this could make a bit sense, for example we allow file removal even quota files are broken? But it dosen't make sense to allow setting project if quota files etc are broken. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | | | ext4: fix setattr project check in fssetxattr ioctlWang Shilong2018-10-031-23/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, project quota could be changed by fssetxattr ioctl, and existed permission check inode_owner_or_capable() is obviously not enough, just think that common users could change project id of file, that could make users to break project quota easily. This patch try to follow same regular of xfs project quota: "Project Quota ID state is only allowed to change from within the init namespace. Enforce that restriction only if we are trying to change the quota ID state. Everything else is allowed in user namespaces." Besides that, check and set project id'state should be an atomic operation, protect whole operation with inode lock, ext4_ioctl_setproject() is only used for ioctl EXT4_IOC_FSSETXATTR, we have held mnt_want_write_file() before ext4_ioctl_setflags(), and ext4_ioctl_setproject() is called after ext4_ioctl_setflags(), we could share codes, so remove it inside ext4_ioctl_setproject(). Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
| * | | | | ext4: convert fault handler to use vm_fault_t typeSouptick Joarder2018-10-022-16/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return type of ext4_page_mkwrite and ext4_filemap_fault are changed to use vm_fault_t type. With this patch all the callers of block_page_mkwrite_return() are changed to handle vm_fault_t. So converting the return type of block_page_mkwrite_return() to vm_fault_t. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Matthew Wilcox <willy@infradead.org>
| * | | | | ext4: initialize retries variable in ext4_da_write_inline_data_begin()Lukas Czerner2018-10-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variable retries is not initialized in ext4_da_write_inline_data_begin() which can lead to nondeterministic number of retries in case we hit ENOSPC. Initialize retries to zero as we do everywhere else. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: bc0ca9df3b2a ("ext4: retry allocation when inline->extent conversion failed") Cc: stable@kernel.org
| * | | | | ext4: fix EXT4_IOC_SWAP_BOOTTheodore Ts'o2018-10-021-6/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code EXT4_IOC_SWAP_BOOT ioctl hasn't been updated in a while, and it's a bit broken with respect to more modern ext4 kernels, especially metadata checksums. Other problems fixed with this commit: * Don't allow installing a DAX, swap file, or an encrypted file as a boot loader. * Respect the immutable and append-only flags. * Wait until any DIO operations are finished *before* calling truncate_inode_pages(). * Don't swap inode->i_flags, since these flags have nothing to do with the inode blocks --- and it will give the IMA/audit code heartburn when the inode is evicted. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Reported-by: syzbot+e81ccd4744c6c4f71354@syzkaller.appspotmail.com
| * | | | | ext4: fix build error when DX_DEBUG is definedGabriel Krisman Bertazi2018-10-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enabling DX_DEBUG triggers the build error below. info is an attribute of the dxroot structure. linux/fs/ext4/namei.c:2264:12: error: ‘info’ undeclared (first use in this function); did you mean ‘insl’? info->indirect_levels)); Fixes: e08ac99fa2a2 ("ext4: add largedir feature") Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
| * | | | | ext4: fix argument checking in EXT4_IOC_MOVE_EXTTheodore Ts'o2018-10-021-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the starting block number of either the source or destination file exceeds the EOF, EXT4_IOC_MOVE_EXT should return EINVAL. Also fixed the helper function mext_check_coverage() so that if the logical block is beyond EOF, make it return immediately, instead of looping until the block number wraps all the away around. This takes long enough that if there are multiple threads trying to do pound on an the same inode doing non-sensical things, it can end up triggering the kernel's soft lockup detector. Reported-by: syzbot+c61979f6f2cba5cb3c06@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | | | ext4: fix reserved cluster accounting at page invalidation timeEric Whitney2018-10-013-19/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new code to count canceled pending cluster reservations on bigalloc file systems and to reduce the cluster reservation count on all file systems using delayed allocation. This replaces old code in ext4_da_page_release_reservations that was incorrect. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: adjust reserved cluster count when removing extentsEric Whitney2018-10-014-114/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: reduce reserved cluster count by number of allocated clustersEric Whitney2018-10-013-160/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. Ext4 also does not reduce the reserved cluster count when allocating clusters for non-delayed allocated writes that have previously been reserved for delayed writes. This also results in overcounts. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. Address comment correcting handling of unsigned long long constant from Jan Kara's review of RFC version of this patch. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: fix reserved cluster accounting at delayed write timeEric Whitney2018-10-015-18/+206
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: add new pending reservation mechanismEric Whitney2018-10-014-0/+249
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new pending reservation mechanism to help manage reserved cluster accounting. Its primary function is to avoid the need to read extents from the disk when invalidating pages as a result of a truncate, punch hole, or collapse range operation. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: generalize extents status tree search functionsEric Whitney2018-10-015-72/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>