summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
* | mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()John Hubbard2017-07-101-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | __remove_zone() sets up up zone_type, but never uses it for anything. This does not cause a warning, due to the (necessary) use of -Wno-unused-but-set-variable. However, it's noise, so just delete it. Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/swap_slots.c: don't disable preemption while taking the per-CPU cacheSebastian Andrzej Siewior2017-07-101-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_cpu_var() disables preemption and returns the per-CPU version of the variable. Disabling preemption is useful to ensure atomic access to the variable within the critical section. In this case however, after the per-CPU version of the variable is obtained the ->free_lock is acquired. For that reason it seems the raw accessor could be used. It only seems that ->slots_ret should be retested (because with disabled preemption this variable can not be set to NULL otherwise). This popped up during PREEMPT-RT testing because it tries to take spinlocks in a preempt disabled section. In RT, spinlocks can sleep. Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ying Huang <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallbackRasmus Villemoes2017-07-101-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since current_order starts as MAX_ORDER-1 and is then only decremented, the second half of the loop condition seems superfluous. However, if order is 0, we may decrement current_order past 0, making it UINT_MAX. This is obviously too subtle ([1], [2]). Since we need to add some comment anyway, change the two variables to signed, making the counting-down for loop look more familiar, and apparently also making gcc generate slightly smaller code. [1] https://lkml.org/lkml/2016/6/20/493 [2] https://lkml.org/lkml/2017/6/19/345 [akpm@linux-foundation.org: fix up reject fixupping] Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Hao Lee <haolee.swjtu@gmail.com> Acked-by: Wei Yang <weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: avoid taking zone lock in pagetypeinfo_showmixed()Vinayak Menon2017-07-102-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pagetypeinfo_showmixedcount_print is found to take a lot of time to complete and it does this holding the zone lock and disabling interrupts. In some cases it is found to take more than a second (On a 2.4GHz,8Gb RAM,arm64 cpu). Avoid taking the zone lock similar to what is done by read_page_owner, which means possibility of inaccurate results. Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migrationMichal Hocko2017-07-101-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | new_page is yet another duplication of the migration callback which has to handle hugetlb migration specially. We can safely use the generic new_page_nodemask for the same purpose. Please note that gigantic hugetlb pages do not need any special handling because alloc_huge_page_nodemask will make sure to check pages in all per node pools. The reason this was done previously was that alloc_huge_page_node treated NO_NUMA_NODE and a specific node differently and so alloc_huge_page_node(nid) would check on this specific node. Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | hugetlb: add support for preferred node to alloc_huge_page_nodemaskMichal Hocko2017-07-101-44/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_huge_page_nodemask tries to allocate from any numa node in the allowed node mask starting from lower numa nodes. This might lead to filling up those low NUMA nodes while others are not used. We can reduce this risk by introducing a concept of the preferred node similar to what we have in the regular page allocator. We will start allocating from the preferred nid and then iterate over all allowed nodes in the zonelist order until we try them all. This is mimicing the page allocator logic except it operates on per-node mempools. dequeue_huge_page_vma already does this so distill the zonelist logic into a more generic dequeue_huge_page_nodemask and use it in alloc_huge_page_nodemask. This will allow us to use proper per numa distance fallback also for alloc_huge_page_node which can use alloc_huge_page_nodemask now and we can get rid of alloc_huge_page_node helper which doesn't have any user anymore. Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, hugetlb: unclutter hugetlb allocation layersMichal Hocko2017-07-101-104/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm, hugetlb: allow proper node fallback dequeue". While working on a hugetlb migration issue addressed in a separate patchset[1] I have noticed that the hugetlb allocations from the preallocated pool are quite subotimal. [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org There is no fallback mechanism implemented and no notion of preferred node. I have tried to work around it but Vlastimil was right to push back for a more robust solution. It seems that such a solution is to reuse zonelist approach we use for the page alloctor. This series has 3 patches. The first one tries to make hugetlb allocation layers more clear. The second one implements the zonelist hugetlb pool allocation and introduces a preferred node semantic which is used by the migration callbacks. The last patch is a clean up. This patch (of 3): Hugetlb allocation path for fresh huge pages is unnecessarily complex and it mixes different interfaces between layers. __alloc_buddy_huge_page is the central place to perform a new allocation. It checks for the hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is all good except that __alloc_buddy_huge_page pushes vma and address down the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with two different allocation modes - one for memory policy and other node specific (or to make it more obscure node non-specific) requests. This just screams for a reorganization. This patch pulls out all the vma specific handling up to __alloc_buddy_huge_page_with_mpol where it belongs. __alloc_buddy_huge_page will get nodemask argument and __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the page allocator. In short: __alloc_buddy_huge_page_with_mpol - memory policy handling __alloc_buddy_huge_page - overcommit handling and accounting __hugetlb_alloc_buddy_huge_page - page allocator layer Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop is not really needed because the page allocator already handles the cpusets update. Finally __hugetlb_alloc_buddy_huge_page had a special case for node specific allocations (when no policy is applied and there is a node given). This has relied on __GFP_THISNODE to not fallback to a different node. alloc_huge_page_node is the only caller which relies on this behavior so move the __GFP_THISNODE there. Not only does this remove quite some code it also should make those layers easier to follow and clear wrt responsibilities. Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/oom_kill.c: add tracepoints for oom reaper-related eventsRoman Gushchin2017-07-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo "oom:mark_victim" > set_event $ echo "oom:wake_reaper" >> set_event $ echo "oom:skip_task_reaping" >> set_event $ echo "oom:start_task_reaping" >> set_event $ echo "oom:finish_task_reaping" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | userfaultfd: non-cooperative: add madvise() event for MADV_FREE requestMike Rapoport2017-07-101-20/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd monitor. The monitor has to stop handling #PF events in the range being freed. We are reusing userfaultfd_remove callback along with the logic required to re-get and re-validate the VMA which may change or disappear because userfaultfd_remove releases mmap_sem. Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/truncate.c: fix THP handling in invalidate_mapping_pages()Jan Kara2017-07-101-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The condition checking for THP straddling end of invalidated range is wrong - it checks 'index' against 'end' but 'index' has been already advanced to point to the end of THP and thus the condition can never be true. As a result THP straddling 'end' has been fully invalidated. Given the nature of invalidate_mapping_pages(), this could be only performance issue. In fact, we are lucky the condition is wrong because if it was ever true, we'd leave locked page behind. Fix the condition checking for THP straddling 'end' and also properly unlock the page. Also update the comment before the condition to explain why we decide not to invalidate the page as it was not clear to me and I had to ask Kirill. Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/hugetlb.c: replace memfmt with string_get_sizeMatthew Wilcox2017-07-101-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hugetlb code has its own function to report human-readable sizes. Convert it to use the shared string_get_size() function. This will lead to a minor difference in user visible output (MiB/GiB instead of MB/GB), but some would argue that's desirable anyway. Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit()Michal Hocko2017-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alice has reported the following UBSAN splat: UBSAN: Undefined behaviour in mm/memcontrol.c:661:17 signed integer overflow: -2147483644 - 2147483525 cannot be represented in type 'long int' CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4 Hardware name: XXXXXX, BIOS YYYYYY Call Trace: dump_stack+0x59/0x87 ubsan_epilogue+0xe/0x40 handle_overflow+0xbb/0xf0 __ubsan_handle_sub_overflow+0x12/0x20 memcg_check_events.isra.36+0x223/0x360 mem_cgroup_commit_charge+0x55/0x140 wp_page_copy+0x34e/0xb80 do_wp_page+0x1e6/0x1300 handle_mm_fault+0x88b/0x1990 __do_page_fault+0x2de/0x8a0 do_page_fault+0x1a/0x20 error_code+0x67/0x6c The reason is that we subtract two signed types. Let's fix this by truly mimicing time_after and cast the result of the subtraction. Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Alice Ferrazzi <alicef@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, hugetlb: schedule when potentially allocating many hugepagesDavid Rientjes2017-07-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A few hugetlb allocators loop while calling the page allocator and can potentially prevent rescheduling if the page allocator slowpath is not utilized. Conditionally schedule when large numbers of hugepages can be allocated. Anshuman: "Fixes a task which was getting hung while writing like 10000 hugepages (16MB on POWER8) into /proc/sys/vm/nr_hugepages." Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: unify new_node_page and alloc_migrate_targetMichal Hocko2017-07-102-26/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") has duplicated a large part of alloc_migrate_target with some hotplug specific special casing. To be more precise it tried to enfore the allocation from a different node than the original page. As a result the two function diverged in their shared logic, e.g. the hugetlb allocation strategy. Let's unify the two and express different NUMA requirements by the given nodemask. new_node_page will simply exclude the node it doesn't care about and alloc_migrate_target will use all the available nodes. alloc_migrate_target will then learn to migrate hugetlb pages more sanely and use preallocated pool when possible. Please note that alloc_migrate_target used to call alloc_page resp. alloc_pages_current so the memory policy of the current context which is quite strange when we consider that it is used in the context of alloc_contig_range which just tries to migrate pages which stand in the way. Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | hugetlb, memory_hotplug: prefer to use reserved pages for migrationMichal Hocko2017-07-102-7/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | new_node_page will try to use the origin's next NUMA node as the migration destination for hugetlb pages. If such a node doesn't have any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to allocate a surplus page instead. This is quite subotpimal for any configuration when hugetlb pages are no distributed to all NUMA nodes evenly. Say we have a hotplugable node 4 and spare hugetlb pages are node 0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0 Now we consume the whole pool on node 4 and try to offline this node. All the allocated pages should be moved to node0 which has enough preallocated pages to hold them. With the current implementation offlining very likely fails because hugetlb allocations during runtime are much less reliable. Fix this by reusing the nodemask which excludes migration source and try to find a first node which has a page in the preallocated pool first and fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is consumed. [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub] Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, memory_hotplug: simplify empty node mask handling in new_node_pageMichal Hocko2017-07-101-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | new_node_page tries to allocate the target page on a different NUMA node than the source page. This makes sense in most cases during the hotplug because we are likely to offline the whole numa node. But there are cases where there are no other nodes to fallback (e.g. when offlining parts of the only existing node) and we have to fallback to allocating from the source node. The current code does that but it can be simplified by checking the nmask and updating it before we even try to allocate rather than special casing it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, memory_hotplug: support movable_node for hotpluggable nodesMichal Hocko2017-07-101-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | movable_node kernel parameter allows making hotpluggable NUMA nodes to put all the hotplugable memory into movable zone which allows more or less reliable memory hotremove. At least this is the case for the NUMA nodes present during the boot (see find_zone_movable_pfns_for_nodes). This is not the case for the memory hotplug, though. echo online > /sys/devices/system/memory/memoryXYZ/state will default to a kernel zone (usually ZONE_NORMAL) unless the particular memblock is already in the movable zone range which is not the case normally when onlining the memory from the udev rule context for a freshly hotadded NUMA node. The only option currently is to have a special udev rule to echo online_movable to all memblocks belonging to such a node which is rather clumsy. Not to mention this is inconsistent as well because what ended up in the movable zone during the boot will end up in a kernel zone after hotremove & hotadd without special care. It would be nice to reuse memblock_is_hotpluggable but the runtime hotplug doesn't have that information available because the boot and hotplug paths are not shared and it would be really non trivial to make them use the same code path because the runtime hotplug doesn't play with the memblock allocator at all. Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if movable_node is enabled and the range doesn't overlap with the existing normal zone. This should provide a reasonable default onlining strategy. Strictly speaking the semantic is not identical with the boot time initialization because find_zone_movable_pfns_for_nodes covers only the hotplugable range as described by the BIOS/FW. From my experience this is usually a full node though (except for Node0 which is special and never goes away completely). If this turns out to be a problem in the real life we can tweak the code to store hotplug flag into memblocks but let's keep this simple now. Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/migrate.c: stabilise page count when migrating transparent hugepagesWill Deacon2017-07-101-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When migrating a transparent hugepage, migrate_misplaced_transhuge_page guards itself against a concurrent fastgup of the page by checking that the page count is equal to 2 before and after installing the new pmd. If the page count changes, then the pmd is reverted back to the original entry, however there is a small window where the new (possibly writable) pmd is installed and the underlying page could be written by userspace. Restoring the old pmd could therefore result in loss of data. This patch fixes the problem by freezing the page count whilst updating the page tables, which protects against a concurrent fastgup without the need to restore the old pmd in the failure case (since the page count can no longer change under our feet). Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/hugetlb.c: warn the user when issues arise on boot due to hugepagesLiam R. Howlett2017-07-101-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the user specifies too many hugepages or an invalid default_hugepagesz the communication to the user is implicit in the allocation message. This patch adds a warning when the desired page count is not allocated and prints an error when the default_hugepagesz is invalid on boot. During boot hugepages will allocate until there is a fraction of the hugepage size left. That is, we allocate until either the request is satisfied or memory for the pages is exhausted. When memory for the pages is exhausted, it will most likely lead to the system failing with the OOM manager not finding enough (or anything) to kill (unless you're using really big hugepages in the order of 100s of MB or in the GBs). The user will most likely see the OOM messages much later in the boot sequence than the implicitly stated message. Worse yet, you may even get an OOM for each processor which causes many pages of OOMs on modern systems. Although these messages will be printed earlier than the OOM messages, at least giving the user errors and warnings will highlight the configuration as an issue. I'm trying to point the user in the right direction by providing a more robust statement of what is failing. During the sysctl or echo command, the user can check the results much easier than if the system hangs during boot and the scenario of having nothing to OOM for kernel memory is highly unlikely. Mike said: "Before sending out this patch, I asked Liam off list why he was doing it. Was it something he just thought would be useful? Or, was there some type of user situation/need. He said that he had been called in to assist on several occasions when a system OOMed during boot. In almost all of these situations, the user had grossly misconfigured huge pages. DB users want to pre-allocate just the right amount of huge pages, but sometimes they can be really off. In such situations, the huge page init code just allocates as many huge pages as it can and reports the number allocated. There is no indication that it quit allocating because it ran out of memory. Of course, a user could compare the number in the message to what they requested on the command line to determine if they got all the huge pages they requested. The thought was that it would be useful to at least flag this situation. That way, the user might be able to better relate the huge page allocation failure to the OOM. I'm not sure if the e-mail discussion made it obvious that this is something he has seen on several occasions. I see Michal's point that this will only flag the situation where someone configures huge pages very badly. And, a more extensive look at the situation of misconfiguring huge pages might be in order. But, this has happened on several occasions which led to the creation of this patch" [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration] Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/cma.c: warn if the CMA area could not be activatedAnshuman Khandual2017-07-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While activating a CMA area we check to make sure that all the PFNs in the range are inside the same zone. This is a requirement for alloc_contig_range() to work. Any CMA area failing the check is disabled for good. This happens silently right now making all future cma_alloc() allocations failure inevitable. Here we add an error message stating that the CMA area could not be activated which makes it easier to explain any future cma_alloc() failures on it. While in there, change the bail out goto label from 'err' to 'not_in_zone' which makes more sense. Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vmalloc: show lazy-purged vma info in vmallocinfoYisheng Xie2017-07-101-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ioremap a 67112960 bytes vm_area with the vmallocinfo: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap we get the result: 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER': if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); So it makes idiot like me a litte puzzled why this was a jump the vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving 0xed000000-0xf1000000 as a big hole. This patch is to show all of vm_area, including vmas which are freeing but still in the vmap_area_list, to make it more clear about why we will get 0xf1000000-0xf5001000 in the above case. And we will get a vmallocinfo like: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap [..] 0xece7c000-0xece7e000 8192 unpurged vm_area 0xece7e000-0xece83000 20480 vm_map_ram 0xf0099000-0xf00aa000 69632 vm_map_ram after this patch. Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/memcontrol: exclude @root from checks in mem_cgroup_lowSean Christopherson2017-07-101-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make @root exclusive in mem_cgroup_low; it is never considered low when looked at directly and is not checked when traversing the tree. In effect, @root is handled identically to how root_mem_cgroup was previously handled by mem_cgroup_low. If @root is not excluded from the checks, a cgroup underneath @root will never be considered low during targeted reclaim of @root, e.g. due to memory.current > memory.high, unless @root is misconfigured to have memory.low > memory.high. Excluding @root enables using memory.low to prioritize memory usage between cgroups within a subtree of the hierarchy that is limited by memory.high or memory.max, e.g. when ROOT owns @root's controls but delegates the @root directory to a USER so that USER can create and administer children of @root. For example, given cgroup A with children B and C: A / \ B C and 1. A/memory.current > A/memory.high 2. A/B/memory.current < A/B/memory.low 3. A/C/memory.current >= A/C/memory.low As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we should reclaim from 'C' until 'A' is no longer high or until we can no longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low and we will reclaim indiscriminately from both 'B' and 'C'. Here is the test I used to confirm the bug and the patch. 20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test #!/bin/bash x62mb=$((62<<20)) x66mb=$((66<<20)) x94mb=$((94<<20)) x98mb=$((98<<20)) setup() { set -e if [[ -n $DEBUG ]]; then set -x fi trap teardown EXIT HUP INT TERM if [[ ! -e /mnt/1gb.swap ]]; then sudo fallocate -l 1G /mnt/1gb.swap > /dev/null sudo mkswap /mnt/1gb.swap > /dev/null fi if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then sudo swapon /mnt/1gb.swap fi if [[ ! -e /cgroup/cgroup.controllers ]]; then sudo mount -t cgroup2 none /cgroup fi grep -q memory /cgroup/cgroup.controllers sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control" sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control" sudo sh -c "echo '96m' > /cgroup/A/memory.high" mkdir /cgroup/A/0 mkdir /cgroup/A/1 echo 64m > /cgroup/A/0/memory.low } teardown() { set +e trap - EXIT HUP INT TERM if [[ -z $1 ]]; then printf "\n" printf "%0.s*" {1..35} printf "\nFAILED!\n\n" tail /cgroup/A/**/memory.current printf "%0.s*" {1..35} printf "\n\n" fi ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % sleep 2 if [[ -e /cgroup/A/0 ]]; then rmdir /cgroup/A/0 fi if [[ -e /cgroup/A/1 ]]; then rmdir /cgroup/A/1 fi if [[ -e /cgroup/A ]]; then sudo rmdir /cgroup/A fi } stress_test() { sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/cgroup.procs" sleep 1 # A/0 should be consuming more memory than A/1 [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]] # A/0 should be consuming ~64mb [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]] # A should cumulatively be consuming ~96mb [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]] # Stop the stressors ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % } teardown 1 setup for ((i=1;i<=$1;i++)); do printf "ITERATION $i of $1 - stress_test 0 1" stress_test 0 1 printf "\x1b[2K\r" printf "ITERATION $i of $1 - stress_test 1 0" stress_test 1 0 printf "\x1b[2K\r" printf "ITERATION $i of $1 - PASSED\n" done teardown 1 echo PASSED! 20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10 Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: make PR_SET_THP_DISABLE immediately activeMichal Hocko2017-07-102-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any existing mapping because it only updated mm->def_flags which is a template for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE flag set. This can be quite surprising for all those applications which do not do prctl(); fork() & exec() and want to control their own THP behavior. Another usecase when the immediate semantic of the prctl might be useful is a combination of pre- and post-copy migration of containers with CRIU. In this case CRIU populates a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and CRIU expects to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the expected page faults do not occur. In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a temporary mechanism for enabling/disabling THP process wide. Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is tested when decision whether to use huge pages is taken either during page fault of at the time of THP collapse. It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously. Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE") Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, vmpressure: pass-through notification supportDavid Rientjes2017-07-101-29/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, vmpressure events are not pass-through, i.e. they propagate up through the memcg hierarchy until an event notifier is found for any threshold level. This presents a difficulty when a thread waiting on a read(2) for a vmpressure event cannot distinguish between local memory pressure and memory pressure in a descendant memcg, especially when that thread may not control the memcg hierarchy. Consider a user-controlled child memcg with a smaller limit than a top-level memcg controlled by the "Activity Manager" specified in Documentation/cgroup-v1/memory.txt. It may register for memory pressure notification for descendant memcgs to make a policy decision: oom kill a low priority job, increase the limit, decrease other limits, etc. If it registers for memory pressure notification on the top-level memcg, it currently cannot distinguish between memory pressure in its own memcg or a descendant memcg, which is user-controlled. Conversely, if a user registers for memory pressure notification on their own descendant memcg, the Activity Manager does not receive any pressure notification for that child memcg hierarchy. Vmpressure events are not received for ancestor memcgs if the memcg experiencing pressure have notifiers registered, perhaps outside the knowledge of the thread waiting on read(2) at the top level. Both of these are consequences of vmpressure notification not being pass-through. This implements a pass-through behavior for vmpressure events. When writing to control.event_control, vmpressure event handlers may optionally specify a mode. There are two new modes: - "hierarchy": always propagate memory pressure events up the hierarchy regardless if descendant memcgs have their own notifiers registered, and - "local": only receive notifications when the memcg for which the event is registered experiences memory pressure. Of course, processes may register for one notification of "low,local", for example, and another for "low". If no mode is specified, the current behavior is maintained for backwards compatibility. See the change to Documentation/cgroup-v1/memory.txt for full specification. [dan.carpenter@oracle.com: free the same pointer we allocated] Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Anton Vorontsov <anton@enomsg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hwpoison: introduce idenfity_page_stateNaoya Horiguchi2017-07-101-32/+25
| | | | | | | | | | | | | | | | | | Factoring duplicate code into a function. Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hugetlb: delete dequeue_hwpoisoned_huge_page()Naoya Horiguchi2017-07-102-45/+0
| | | | | | | | | | | | | | | | | | dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it. Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hwpoison: dissolve in-use hugepage in unrecoverable memory errorNaoya Horiguchi2017-07-101-40/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to keep the error hugepage away from the system, which is OK but not good enough because the hugepage still has a refcount and unpoison doesn't work on the error hugepage (PageHWPoison flags are cleared but pages are still leaked.) And there's "wasting health subpages" issue too. This patch reworks on me_huge_page() to solve these issues. For hugetlb file, recently we have truncating code so let's use it in hugetlbfs specific ->error_remove_page(). For anonymous hugepage, it's helpful to dissolve the error page after freeing it into free hugepage list. Migration entry and PageHWPoison in the head page prevent the access to it. TODO: dissolve_free_huge_page() can fail but we don't considered it yet. It's not critical (and at least no worse that now) because in such case the error hugepage just stays in free hugepage list without being dissolved. By virtue of PageHWPoison in head page, it's never allocated to processes. [akpm@linux-foundation.org: fix unused var warnings] Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hwpoison: introduce memory_failure_hugetlb()Naoya Horiguchi2017-07-101-52/+82
| | | | | | | | | | | | | | | | | | | | | | | | memory_failure() is a big function and hard to maintain. Handling hugetlb- and non-hugetlb- case in a single function is not good, so this patch separates PageHuge() branch into a new function, which saves many PageHuge() check. Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: soft-offline: dissolve free hugepage if soft-offlinedNaoya Horiguchi2017-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | Now we have code to rescue most of healthy pages from a hwpoisoned hugepage. So let's apply it to soft_offline_free_page too. Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hugetlb: soft-offline: dissolve source hugepage after successful migrationAnshuman Khandual2017-07-103-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently hugepage migrated by soft-offline (i.e. due to correctable memory errors) is contained as a hugepage, which means many non-error pages in it are unreusable, i.e. wasted. This patch solves this issue by dissolving source hugepages into buddy. As done in previous patch, PageHWPoison is set only on a head page of the error hugepage. Then in dissoliving we move the PageHWPoison flag to the raw error page so that all healthy subpages return back to buddy. [arnd@arndb.de: fix warnings: replace some macros with inline functions] Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hwpoison: change PageHWPoison behavior on hugetlb pagesNaoya Horiguchi2017-07-101-63/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We'd like to narrow down the error region in memory error on hugetlb pages. However, currently we set PageHWPoison flags on all subpages in the error hugepage and add # of subpages to num_hwpoison_pages, which doesn't fit our purpose. So this patch changes the behavior and we only set PageHWPoison on the head page then increase num_hwpoison_pages only by 1. This is a preparation for narrow-down part which comes in later patches. Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache()Naoya Horiguchi2017-07-101-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page now, but it's not enough because later code doesn't handle hugetlb properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the end of this function fires for hugetlb, which makes no sense. So we should return immediately for hugetlb pages. Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hugetlb: prevent reuse of hwpoisoned free hugepagesNaoya Horiguchi2017-07-102-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: hwpoison: fixlet for hugetlb migration". This patchset updates the hwpoison/hugetlb code to address 2 reported issues. One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First half was already fixed in mainline, and another half about hugetlb cases are solved in this series. Another issue is "narrow-down error affected region into a single 4kB page instead of a whole hugetlb page" issue, which was tried by Anshuman (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com) and I updated it to apply it more widely. This patch (of 9): We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages as we did before. So current dequeue_huge_page_node() doesn't work as intended because it still uses is_migrate_isolate_page() for this check. This patch fixes it with PageHWPoison flag. Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/zsmalloc.c: fix -Wunneeded-internal-declaration warningNick Desaulniers2017-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is only compiled in as a runtime check when CONFIG_DEBUG_VM is set, otherwise is checked at compile time and not actually compiled in. Fixes the following warning, found with Clang: mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration] static int is_first_page(struct page *page) ^ Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereferenceGustavo A. R. Silva2017-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The NULL check at line 1226: if (!pgdat), implies that pointer pgdat might be NULL. rollback_node_hotadd() dereferences this pointer. Add NULL check to avoid a potential NULL pointer dereference. Addresses-Coverity-ID: 1369133 Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, vmscan: avoid thrashing anon lru when free + file is lowDavid Rientjes2017-07-101-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do not swap anon pages just because free+file is low'") reintroduces is to prefer swapping anonymous memory rather than trashing the file lru. If the anonymous inactive lru for the set of eligible zones is considered low, however, or the length of the list for the given reclaim priority does not allow for effective anonymous-only reclaiming, then avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the small list and leave unreclaimed memory on the file lrus. If the inactive list is insufficient, fallback to balanced reclaim so the file lru doesn't remain untouched. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTEYevgen Pronenko2017-07-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The preferred strategy to define debugfs attributes is to use the DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe(). Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com Signed-off-by: Yevgen Pronenko <y.pronenko@gmail.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, page_alloc: fallback to smallest page when not stealing whole pageblockVlastimil Babka2017-07-101-9/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 3bc48f96cf11 ("mm, page_alloc: split smallest stolen page in fallback") we pick the smallest (but sufficient) page of all that have been stolen from a pageblock of different migratetype. However, there are cases when we decide not to steal the whole pageblock. Practically in the current implementation it means that we are trying to fallback for a MIGRATE_MOVABLE allocation of order X, go through the freelists from MAX_ORDER-1 down to X, and find free page of order Y. If Y is less than pageblock_order / 2, we decide not to steal all pages from the pageblock. When Y > X, it means we are potentially splitting a larger page than we need, as there might be other pages of order Z, where X <= Z < Y. Since Y is already too small to steal whole pageblock, picking smallest available Z will result in the same decision and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE pageblock. This patch therefore changes the fallback algorithm so that in the situation described above, we switch the fallback search strategy to go from order X upwards to find the smallest suitable fallback. In theory there shouldn't be a downside of this change wrt fragmentation. This has been tested with mmtests' stress-highalloc performing GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint statistics: 4.12.0-rc2 4.12.0-rc2 1-kernel4 2-kernel4 Page alloc extfrag event 25640976 69680977 Extfrag fragmenting 25621086 69661364 Extfrag fragmenting for unmovable 74409 73204 Extfrag fragmenting unmovable placed with movable 69003 67684 Extfrag fragmenting unmovable placed with reclaim. 5406 5520 Extfrag fragmenting for reclaimable 6398 8467 Extfrag fragmenting reclaimable placed with movable 869 884 Extfrag fragmenting reclaimable placed with unmov. 5529 7583 Extfrag fragmenting for movable 25540279 69579693 Since we force movable allocations to steal the smallest available page (which we then practially always split), we steal less per fallback, so the number of fallbacks increases and steals potentially happen from different pageblocks. This is however not an issue for movable pages that can be compacted. Importantly, the "unmovable placed with movable" statistics is lower, which is the result of less fragmentation in the unmovable pageblocks. The effect on reclaimable allocation is a bit unclear. Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | swap: add block io poll in swapin pathShaohua Li2017-07-104-9/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For fast flash disk, async IO could introduce overhead because of context switch. block-mq now supports IO poll, which improves performance and latency a lot. swapin is a good place to use this technique, because the task is waiting for the swapin page to continue execution. In my virtual machine, directly read 4k data from a NVMe with iopoll is about 60% better than that without poll. With iopoll support in swapin patch, my microbenchmark (a task does random memory write) is about 10%~25% faster. CPU utilization increases a lot though, 2x and even 3x CPU utilization. This will depend on disk speed. While iopoll in swapin isn't intended for all usage cases, it's a win for latency sensistive workloads with high speed swap disk. block layer has knob to control poll in runtime. If poll isn't enabled in block layer, there should be no noticeable change in swapin. I got a chance to run the same test in a NVMe with DRAM as the media. In simple fio IO test, blkpoll boosts 50% performance in single thread test and ~20% in 8 threads test. So this is the base line. In above swap test, blkpoll boosts ~27% performance in single thread test. blkpoll uses 2x CPU time though. If we enable hybid polling, the performance gain has very slight drop but CPU time is only 50% worse than that without blkpoll. Also we can adjust parameter of hybid poll, with it, the CPU time penality is reduced further. In 8 threads test, blkpoll doesn't help though. The performance is similar to that without blkpoll, but cpu utilization is similar too. There is lock contention in swap path. The cpu time spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse than that without it. The swapin readahead might read several pages in in the same time and form a big IO request. Since the IO will take longer time, it doesn't make sense to do poll, so the patch only does iopoll for single page swapin. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c Signed-off-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds2017-07-081-0/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ARM updates from Russell King: - add support for ftrace-with-registers, which is needed for kgraft and other ftrace tools - support for mremap() for the sigpage/vDSO so that checkpoint/restore can work - add timestamps to each line of the register dump output - remove the unused KTHREAD_SIZE from nommu - align the ARM bitops APIs with the generic API (using unsigned long pointers rather than void pointers) - make the configuration of userspace Thumb support an expert option so that we can default it on, and avoid some hard to debug userspace crashes * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO ARM: 8679/1: bitops: Align prototypes to generic API ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS ARM: make configuration of userspace Thumb support an expert option ARM: 8673/1: Fix __show_regs output timestamps
| * | ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSODmitry Safonov2017-06-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CRIU restores application mappings on the same place where they were before Checkpoint. That means, that we need to move vDSO and sigpage during restore on exactly the same place where they were before C/R. Make mremap() code update mm->context.{sigpage,vdso} pointers during VMA move. Sigpage is used for landing after handling a signal - if the pointer is not updated during moving, the application might crash on any signal after mremap(). vDSO pointer on ARM32 is used only for setting auxv at this moment, update it during mremap() in case of future usage. Without those updates, current work of CRIU on ARM32 is not reliable. Historically, we error Checkpointing if we find vDSO page on ARM32 and suggest user to disable CONFIG_VDSO. But that's not correct - it goes from x86 where signal processing is ended in vDSO blob. For arm32 it's sigpage, which is not disabled with `CONFIG_VDSO=n'. Looks like C/R was working by luck - because userspace on ARM32 at this moment always sets SA_RESTORER. Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: linux-arm-kernel@lists.infradead.org Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Christopher Covington <cov@codeaurora.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
* | | Merge tag 'for-linus-v4.13-2' of ↵Linus Torvalds2017-07-072-18/+110
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
| * | | fs: new infrastructure for writeback error handling and reportingJeff Layton2017-07-061-0/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most filesystems currently use mapping_set_error and filemap_check_errors for setting and reporting/clearing writeback errors at the mapping level. filemap_check_errors is indirectly called from most of the filemap_fdatawait_* functions and from filemap_write_and_wait*. These functions are called from all sorts of contexts to wait on writeback to finish -- e.g. mostly in fsync, but also in truncate calls, getattr, etc. The non-fsync callers are problematic. We should be reporting writeback errors during fsync, but many places spread over the tree clear out errors before they can be properly reported, or report errors at nonsensical times. If I get -EIO on a stat() call, there is no reason for me to assume that it is because some previous writeback failed. The fact that it also clears out the error such that a subsequent fsync returns 0 is a bug, and a nasty one since that's potentially silent data corruption. This patch adds a small bit of new infrastructure for setting and reporting errors during address_space writeback. While the above was my original impetus for adding this, I think it's also the case that current fsync semantics are just problematic for userland. Most applications that call fsync do so to ensure that the data they wrote has hit the backing store. In the case where there are multiple writers to the file at the same time, this is really hard to determine. The first one to call fsync will see any stored error, and the rest get back 0. The processes with open fds may not be associated with one another in any way. They could even be in different containers, so ensuring coordination between all fsync callers is not really an option. One way to remedy this would be to track what file descriptor was used to dirty the file, but that's rather cumbersome and would likely be slow. However, there is a simpler way to improve the semantics here without incurring too much overhead. This set adds an errseq_t to struct address_space, and a corresponding one is added to struct file. Writeback errors are recorded in the mapping's errseq_t, and the one in struct file is used as the "since" value. This changes the semantics of the Linux fsync implementation such that applications can now use it to determine whether there were any writeback errors since fsync(fd) was last called (or since the file was opened in the case of fsync having never been called). Note that those writeback errors may have occurred when writing data that was dirtied via an entirely different fd, but that's the case now with the current mapping_set_error/filemap_check_error infrastructure. This will at least prevent you from getting a false report of success. The new behavior is still consistent with the POSIX spec, and is more reliable for application developers. This patch just adds some basic infrastructure for doing this, and ensures that the f_wb_err "cursor" is properly set when a file is opened. Later patches will change the existing code to use this new infrastructure for reporting errors at fsync time. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
| * | | mm: don't TestClearPageError in __filemap_fdatawait_rangeJeff Layton2017-07-061-15/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The -EIO returned here can end up overriding whatever error is marked in the address space, and be returned at fsync time, even when there is a more appropriate error stored in the mapping. Read errors are also sometimes tracked on a per-page level using PG_error. Suppose we have a read error on a page, and then that page is subsequently dirtied by overwriting the whole page. Writeback doesn't clear PG_error, so we can then end up successfully writing back that page and still return -EIO on fsync. Worse yet, PG_error is cleared during a sync() syscall, but the -EIO return from that is silently discarded. Any subsystem that is relying on PG_error to report errors during fsync can easily lose writeback errors due to this. All you need is a stray sync() call to wait for writeback to complete and you've lost the error. Since the handling of the PG_error flag is somewhat inconsistent across subsystems, let's just rely on marking the address space when there are writeback errors. Change the TestClearPageError call to ClearPageError, and make __filemap_fdatawait_range a void return function. Signed-off-by: Jeff Layton <jlayton@redhat.com>
| * | | mm: clear AS_EIO/AS_ENOSPC when writeback initiation failsJeff Layton2017-07-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | filemap_write_and_wait{_range} will return an error if writeback initiation fails, but won't clear errors in the address_space. This is particularly problematic on DAX, as filemap_fdatawrite* is effectively synchronous there. Ensure that we clear the AS_EIO/AS_ENOSPC flags when filemap_fdatawrite* returns an error. Signed-off-by: Jeff Layton <jlayton@redhat.com>
| * | | jbd2: don't clear and reset errors after waiting on writebackJeff Layton2017-07-061-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resetting this flag is almost certainly racy, and will be problematic with some coming changes. Make filemap_fdatawait_keep_errors return int, but not clear the flag(s). Have jbd2 call it instead of filemap_fdatawait and don't attempt to re-set the error flag if it fails. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>
| * | | mm: fix mapping_set_error call in me_pagecache_dirtyJeff Layton2017-07-061-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error code should be negative. Since this ends up in the default case anyway, this is harmless, but it's less confusing to negate it. Also, later patches will require a negative error code here. Link: http://lkml.kernel.org/r/20170525103355.6760-1-jlayton@redhat.com Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'for-linus-v4.13-1' of ↵Linus Torvalds2017-07-071-10/+9
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
| * | | mm: clean up error handling in write_one_pageJeff Layton2017-07-051-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't try to check PageError since that's potentially racy and not necessarily going to be set after writepage errors out. Instead, check the mapping for an error after writepage returns. That should also help us detect errors that occurred if the VM tried to clean the page earlier due to memory pressure. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
| * | | mm: drop "wait" parameter from write_one_page()Jeff Layton2017-07-051-7/+7
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The callers all set it to 1. Also, make it clear that this function will not set any sort of AS_* error, and that the caller must do so if necessary. No existing caller uses this on normal files, so none of them need it. Also, add __must_check here since, in general, the callers need to handle an error here in some fashion. Link: http://lkml.kernel.org/r/20170525103303.6524-1-jlayton@redhat.com Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>