From 3e374919b314f20e2a04f641ebc1093d758f66a4 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Sat, 21 Sep 2013 21:56:34 +0000 Subject: slab_common: Do not check for duplicate slab names SLUB can alias multiple slab kmem_create_requests to one slab cache to save memory and increase the cache hotness. As a result the name of the slab can be stale. Only check the name for duplicates if we are in debug mode where we do not merge multiple caches. This fixes the following problem reported by Jonathan Brassow: The problem with kmem_cache* is this: *) Assume CONFIG_SLUB is set 1) kmem_cache_create(name="foo-a") - creates new kmem_cache structure 2) kmem_cache_create(name="foo-b") - If identical cache characteristics, it will be merged with the previously created cache associated with "foo-a". The cache's refcount will be incremented and an alias will be created via sysfs_slab_alias(). 3) kmem_cache_destroy() - Attempting to destroy cache associated with "foo-a", but instead the refcount is simply decremented. I don't even think the sysfs aliases are ever removed... 4) kmem_cache_create(name="foo-a") - This FAILS because kmem_cache_sanity_check colides with the existing name ("foo-a") associated with the non-removed cache. This is a problem for RAID (specifically dm-raid) because the name used for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache persists for long enough, the memory address of an old mddev will be reused for a new mddev - causing an identical formulation of the cache name. Even though kmem_cache_destory had long ago been used to delete the old cache, the merging of caches has cause the name and cache of that old instance to be preserved and causes a colision (and thus failure) in kmem_cache_create(). I see this regularly in my testing. Reported-by: Jonathan Brassow Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg --- mm/slab_common.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/slab_common.c b/mm/slab_common.c index a3443278ce3a..e2e98af703ea 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -56,6 +56,7 @@ static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char *name, continue; } +#if !defined(CONFIG_SLUB) || !defined(CONFIG_SLUB_DEBUG_ON) /* * For simplicity, we won't check this in the list of memcg * caches. We have control over memcg naming, and if there @@ -69,6 +70,7 @@ static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char *name, s = NULL; return -EINVAL; } +#endif } WARN_ON(strchr(name, ' ')); /* It confuses parsers */ -- cgit v1.2.3 From 9c56751271e7a917783fb57ec49fe8382e0dc867 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 16 Oct 2013 13:46:43 -0700 Subject: mm, memcg: protect mem_cgroup_read_events for cpu hotplug for_each_online_cpu() needs the protection of {get,put}_online_cpus() so cpu_online_mask doesn't change during the iteration. cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock that is used kernel-wide to synchronize cpu hotplug activity. Memcg has a cpu hotplug notifier, called while there may not be any cpu hotplug refcounts, which drains per-cpu event counts to memcg->nocpu_base.events to maintain a cumulative event count as cpus disappear. Without get_online_cpus() in mem_cgroup_read_events(), it's possible to account for the event count on a dying cpu twice, and this value may be significantly large. In fact, all memcg->pcp_counter_lock use should be nested by {get,put}_online_cpus(). This fixes that issue and ensures the reported statistics are not vastly over-reported during cpu hotplug. Signed-off-by: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c52ddbc839b..5335b2b6be77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -866,6 +866,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, unsigned long val = 0; int cpu; + get_online_cpus(); for_each_online_cpu(cpu) val += per_cpu(memcg->stat->events[idx], cpu); #ifdef CONFIG_HOTPLUG_CPU @@ -873,6 +874,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, val += memcg->nocpu_base.events[idx]; spin_unlock(&memcg->pcp_counter_lock); #endif + put_online_cpus(); return val; } -- cgit v1.2.3 From ae39332162a837c3791bb21172d22382a90a6fd1 Mon Sep 17 00:00:00 2001 From: Andrew Vagin Date: Wed, 16 Oct 2013 13:46:46 -0700 Subject: mm/vmscan.c: don't forget to free shrinker->nr_deferred This leak was added by commit 1d3d4437eae1 ("vmscan: per-node deferred work"). unreferenced object 0xffff88006ada3bd0 (size 8): comm "criu", pid 14781, jiffies 4295238251 (age 105.641s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [] kmemleak_alloc+0x5e/0xc0 [] __kmalloc+0x247/0x310 [] register_shrinker+0x3c/0xa0 [] sget+0x5ab/0x670 [] proc_mount+0x54/0x170 [] mount_fs+0x43/0x1b0 [] vfs_kern_mount+0x72/0x110 [] kern_mount_data+0x19/0x30 [] pid_ns_prepare_proc+0x20/0x40 [] alloc_pid+0x466/0x4a0 [] copy_process+0xc6a/0x1860 [] do_fork+0x8b/0x370 [] SyS_clone+0x16/0x20 [] stub_clone+0x69/0x90 [] 0xffffffffffffffff Signed-off-by: Andrew Vagin Cc: Mel Gorman Cc: Michal Hocko Cc: Rik van Riel Cc: Johannes Weiner Cc: Glauber Costa Cc: Chuck Lever Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 53f2f82f83ae..eea668d9cff6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -211,6 +211,7 @@ void unregister_shrinker(struct shrinker *shrinker) down_write(&shrinker_rwsem); list_del(&shrinker->list); up_write(&shrinker_rwsem); + kfree(shrinker->nr_deferred); } EXPORT_SYMBOL(unregister_shrinker); -- cgit v1.2.3 From 16c794b4f3ff38e97a71ce2472c7792f859d1022 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Wed, 16 Oct 2013 13:46:48 -0700 Subject: mm/hugetlb.c: correct missing private flag clearing We should clear the page's private flag when returing the page to the hugepage pool. Otherwise, marked hugepage can be allocated to the user who tries to allocate the non-reserved hugepage. If this user fail to map this hugepage, he would try to return the page to the hugepage pool. Since this page has a private flag, resv_huge_pages would mistakenly increase. This patch fixes this situation. Signed-off-by: Joonsoo Kim Cc: Rik van Riel Cc: Mel Gorman Cc: Michal Hocko Cc: "Aneesh Kumar K.V" Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Cc: Davidlohr Bueso Cc: David Gibson Cc: Wanpeng Li Cc: Naoya Horiguchi Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b49579c7f2a5..691f2264a6ce 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -653,6 +653,7 @@ static void free_huge_page(struct page *page) BUG_ON(page_count(page)); BUG_ON(page_mapcount(page)); restore_reserve = PagePrivate(page); + ClearPagePrivate(page); spin_lock(&hugetlb_lock); hugetlb_cgroup_uncharge_page(hstate_index(h), -- cgit v1.2.3 From c3d16e16522fe3fe8759735850a0676da18f4b1d Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Wed, 16 Oct 2013 13:46:51 -0700 Subject: mm: migration: do not lose soft dirty bit if page is in migration state If page migration is turned on in config and the page is migrating, we may lose the soft dirty bit. If fork and mprotect are called on migrating pages (once migration is complete) pages do not obtain the soft dirty bit in the correspond pte entries. Fix it adding an appropriate test on swap entries. Signed-off-by: Cyrill Gorcunov Cc: Pavel Emelyanov Cc: Andy Lutomirski Cc: Matt Mackall Cc: Xiao Guangrong Cc: Marcelo Tosatti Cc: KOSAKI Motohiro Cc: Stephen Rothwell Cc: Peter Zijlstra Cc: "Aneesh Kumar K.V" Cc: Naoya Horiguchi Cc: Mel Gorman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 ++ mm/migrate.c | 2 ++ mm/mprotect.c | 7 +++++-- 3 files changed, 9 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index ca0003947115..f7b7692c05ed 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -837,6 +837,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, */ make_migration_entry_read(&entry); pte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(*src_pte)) + pte = pte_swp_mksoft_dirty(pte); set_pte_at(src_mm, addr, src_pte, pte); } } diff --git a/mm/migrate.c b/mm/migrate.c index a26bccd44ccb..7a7325ee1d08 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -161,6 +161,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, get_page(new); pte = pte_mkold(mk_pte(new, vma->vm_page_prot)); + if (pte_swp_soft_dirty(*ptep)) + pte = pte_mksoft_dirty(pte); if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); #ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/mprotect.c b/mm/mprotect.c index 94722a4d6b43..a3af058f68e4 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -94,13 +94,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry = pte_to_swp_entry(oldpte); if (is_write_migration_entry(entry)) { + pte_t newpte; /* * A protection check is difficult so * just be safe and disable write */ make_migration_entry_read(&entry); - set_pte_at(mm, addr, pte, - swp_entry_to_pte(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(oldpte)) + newpte = pte_swp_mksoft_dirty(newpte); + set_pte_at(mm, addr, pte, newpte); } pages++; } -- cgit v1.2.3 From aa9bca05a467c61dcea4142b2877d5392de5bdce Mon Sep 17 00:00:00 2001 From: Weijie Yang Date: Wed, 16 Oct 2013 13:46:54 -0700 Subject: mm/zswap: bugfix: memory leak when re-swapon zswap_tree is not freed when swapoff, and it got re-kmalloced in swapon, so a memory leak occurs. Free the memory of zswap_tree in zswap_frontswap_invalidate_area(). Signed-off-by: Weijie Yang Reviewed-by: Bob Liu Cc: Minchan Kim Reviewed-by: Minchan Kim Cc: From: Weijie Yang Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: call zswap_frontswap_invalidate_page to invalidate entry x. finished, entry x and its zbud is not freed as its refcount != 0 now, the swap_map[x] = 0 thread 0: now call zswap_get_swap_cache_page swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount Now, the memory of zswap_entry x and its zpage leak. Modify: - check the refcount in fail path, free memory if it is not referenced. - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path can be not only caused by nomem but also by invalidate. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Weijie Yang Reviewed-by: Bob Liu Cc: Minchan Kim Cc: Acked-by: Seth Jennings Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zswap.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'mm') diff --git a/mm/zswap.c b/mm/zswap.c index 841e35f1db22..d93510c6aa2d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -804,6 +804,10 @@ static void zswap_frontswap_invalidate_area(unsigned type) } tree->rbroot = RB_ROOT; spin_unlock(&tree->lock); + + zbud_destroy_pool(tree->pool); + kfree(tree); + zswap_trees[type] = NULL; } static struct zbud_ops zswap_zbud_ops = { -- cgit v1.2.3 From ef5a22be2c525293b777ccd879a8017c41c7ed5a Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Wed, 16 Oct 2013 13:46:56 -0700 Subject: mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages Commit 11feeb498086 ("kvm: optimize away THP checks in kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic compound pages. That commit depends on the assumption that PG_reserved is identical for all head and tail pages of a compound page. So that if get_user_pages returns a tail page, we don't need to check the head page in order to know if we deal with a reserved page that requires different refcounting. The assumption that PG_reserved is the same for head and tail pages is certainly correct for THP and regular hugepages, but gigantic hugepages allocated through bootmem don't clear the PG_reserved on the tail pages (the clearing of PG_reserved is done later only if the gigantic hugepage is freed). This patch corrects the gigantic compound page initialization so that we can retain the optimization in 11feeb498086. The cacheline was already modified in order to set PG_tail so this won't affect the boot time of large memory systems. [akpm@linux-foundation.org: tweak comment layout and grammar] Signed-off-by: Andrea Arcangeli Reported-by: andy123 Acked-by: Rik van Riel Cc: Gleb Natapov Cc: Mel Gorman Cc: Hugh Dickins Acked-by: Rafael Aquini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 691f2264a6ce..0b7656e804d1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -696,8 +696,22 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order) /* we rely on prep_new_huge_page to set the destructor */ set_compound_order(page, order); __SetPageHead(page); + __ClearPageReserved(page); for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { __SetPageTail(p); + /* + * For gigantic hugepages allocated through bootmem at + * boot, it's safer to be consistent with the not-gigantic + * hugepages and clear the PG_reserved bit from all tail pages + * too. Otherwse drivers using get_user_pages() to access tail + * pages may get the reference counting wrong if they see + * PG_reserved set on a tail page (despite the head page not + * having PG_reserved set). Enforcing this consistency between + * head and tail pages allows drivers to optimize away a check + * on the head page when they need know if put_page() is needed + * after get_user_pages(). + */ + __ClearPageReserved(p); set_page_count(p, 0); p->first_page = page; } @@ -1330,9 +1344,9 @@ static void __init gather_bootmem_prealloc(void) #else page = virt_to_page(m); #endif - __ClearPageReserved(page); WARN_ON(page_count(page) != 1); prep_compound_huge_page(page, h->order); + WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); /* * If we had gigantic hugepages allocated at boot time, we need -- cgit v1.2.3 From 4942642080ea82d99ab5b653abb9a12b7ba31f4a Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 16 Oct 2013 13:46:59 -0700 Subject: mm: memcg: handle non-error OOM situations more gracefully Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") assumed that only a few places that can trigger a memcg OOM situation do not return VM_FAULT_OOM, like optional page cache readahead. But there are many more and it's impractical to annotate them all. First of all, we don't want to invoke the OOM killer when the failed allocation is gracefully handled, so defer the actual kill to the end of the fault handling as well. This simplifies the code quite a bit for added bonus. Second, since a failed allocation might not be the abrupt end of the fault, the memcg OOM handler needs to be re-entrant until the fault finishes for subsequent allocation attempts. If an allocation is attempted after the task already OOMed, allow it to bypass the limit so that it can quickly finish the fault and invoke the OOM killer. Reported-by: azurIt Signed-off-by: Johannes Weiner Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 11 +---- mm/memcontrol.c | 139 +++++++++++++++++++++----------------------------------- mm/memory.c | 18 +++++--- mm/oom_kill.c | 2 +- 4 files changed, 65 insertions(+), 105 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 1e6aec4a2d2e..ae4846ff4849 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1616,7 +1616,6 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; struct page *page; - bool memcg_oom; pgoff_t size; int ret = 0; @@ -1625,11 +1624,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? Either - * way, try readahead, but disable the memcg OOM killer for it - * as readahead is optional and no errors are propagated up - * the fault stack. The OOM killer is enabled while trying to - * instantiate the faulting page individually below. + * Do we have something in the page cache already? */ page = find_get_page(mapping, offset); if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { @@ -1637,14 +1632,10 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ - memcg_oom = mem_cgroup_toggle_oom(false); do_async_mmap_readahead(vma, ra, file, page, offset); - mem_cgroup_toggle_oom(memcg_oom); } else if (!page) { /* No page in the page cache at all */ - memcg_oom = mem_cgroup_toggle_oom(false); do_sync_mmap_readahead(vma, ra, file, offset); - mem_cgroup_toggle_oom(memcg_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5335b2b6be77..65fc6a449841 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2161,110 +2161,59 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) memcg_wakeup_oom(memcg); } -/* - * try to call OOM killer - */ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - bool locked; - int wakeups; - if (!current->memcg_oom.may_oom) return; - - current->memcg_oom.in_memcg_oom = 1; - /* - * As with any blocking lock, a contender needs to start - * listening for wakeups before attempting the trylock, - * otherwise it can miss the wakeup from the unlock and sleep - * indefinitely. This is just open-coded because our locking - * is so particular to memcg hierarchies. + * We are in the middle of the charge context here, so we + * don't want to block when potentially sitting on a callstack + * that holds all kinds of filesystem and mm locks. + * + * Also, the caller may handle a failed allocation gracefully + * (like optional page cache readahead) and so an OOM killer + * invocation might not even be necessary. + * + * That's why we don't do anything here except remember the + * OOM context and then deal with it at the end of the page + * fault when the stack is unwound, the locks are released, + * and when we know whether the fault was overall successful. */ - wakeups = atomic_read(&memcg->oom_wakeups); - mem_cgroup_mark_under_oom(memcg); - - locked = mem_cgroup_oom_trylock(memcg); - - if (locked) - mem_cgroup_oom_notify(memcg); - - if (locked && !memcg->oom_kill_disable) { - mem_cgroup_unmark_under_oom(memcg); - mem_cgroup_out_of_memory(memcg, mask, order); - mem_cgroup_oom_unlock(memcg); - /* - * There is no guarantee that an OOM-lock contender - * sees the wakeups triggered by the OOM kill - * uncharges. Wake any sleepers explicitely. - */ - memcg_oom_recover(memcg); - } else { - /* - * A system call can just return -ENOMEM, but if this - * is a page fault and somebody else is handling the - * OOM already, we need to sleep on the OOM waitqueue - * for this memcg until the situation is resolved. - * Which can take some time because it might be - * handled by a userspace task. - * - * However, this is the charge context, which means - * that we may sit on a large call stack and hold - * various filesystem locks, the mmap_sem etc. and we - * don't want the OOM handler to deadlock on them - * while we sit here and wait. Store the current OOM - * context in the task_struct, then return -ENOMEM. - * At the end of the page fault handler, with the - * stack unwound, pagefault_out_of_memory() will check - * back with us by calling - * mem_cgroup_oom_synchronize(), possibly putting the - * task to sleep. - */ - current->memcg_oom.oom_locked = locked; - current->memcg_oom.wakeups = wakeups; - css_get(&memcg->css); - current->memcg_oom.wait_on_memcg = memcg; - } + css_get(&memcg->css); + current->memcg_oom.memcg = memcg; + current->memcg_oom.gfp_mask = mask; + current->memcg_oom.order = order; } /** * mem_cgroup_oom_synchronize - complete memcg OOM handling + * @handle: actually kill/wait or just clean up the OOM state * - * This has to be called at the end of a page fault if the the memcg - * OOM handler was enabled and the fault is returning %VM_FAULT_OOM. + * This has to be called at the end of a page fault if the memcg OOM + * handler was enabled. * - * Memcg supports userspace OOM handling, so failed allocations must + * Memcg supports userspace OOM handling where failed allocations must * sleep on a waitqueue until the userspace task resolves the * situation. Sleeping directly in the charge context with all kinds * of locks held is not a good idea, instead we remember an OOM state * in the task and mem_cgroup_oom_synchronize() has to be called at - * the end of the page fault to put the task to sleep and clean up the - * OOM state. + * the end of the page fault to complete the OOM handling. * * Returns %true if an ongoing memcg OOM situation was detected and - * finalized, %false otherwise. + * completed, %false otherwise. */ -bool mem_cgroup_oom_synchronize(void) +bool mem_cgroup_oom_synchronize(bool handle) { + struct mem_cgroup *memcg = current->memcg_oom.memcg; struct oom_wait_info owait; - struct mem_cgroup *memcg; + bool locked; /* OOM is global, do not handle */ - if (!current->memcg_oom.in_memcg_oom) - return false; - - /* - * We invoked the OOM killer but there is a chance that a kill - * did not free up any charges. Everybody else might already - * be sleeping, so restart the fault and keep the rampage - * going until some charges are released. - */ - memcg = current->memcg_oom.wait_on_memcg; if (!memcg) - goto out; + return false; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - goto out_memcg; + if (!handle) + goto cleanup; owait.memcg = memcg; owait.wait.flags = 0; @@ -2273,13 +2222,25 @@ bool mem_cgroup_oom_synchronize(void) INIT_LIST_HEAD(&owait.wait.task_list); prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); - /* Only sleep if we didn't miss any wakeups since OOM */ - if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + mem_cgroup_mark_under_oom(memcg); + + locked = mem_cgroup_oom_trylock(memcg); + + if (locked) + mem_cgroup_oom_notify(memcg); + + if (locked && !memcg->oom_kill_disable) { + mem_cgroup_unmark_under_oom(memcg); + finish_wait(&memcg_oom_waitq, &owait.wait); + mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask, + current->memcg_oom.order); + } else { schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); -out_memcg: - mem_cgroup_unmark_under_oom(memcg); - if (current->memcg_oom.oom_locked) { + mem_cgroup_unmark_under_oom(memcg); + finish_wait(&memcg_oom_waitq, &owait.wait); + } + + if (locked) { mem_cgroup_oom_unlock(memcg); /* * There is no guarantee that an OOM-lock contender @@ -2288,10 +2249,9 @@ out_memcg: */ memcg_oom_recover(memcg); } +cleanup: + current->memcg_oom.memcg = NULL; css_put(&memcg->css); - current->memcg_oom.wait_on_memcg = NULL; -out: - current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2705,6 +2665,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, || fatal_signal_pending(current))) goto bypass; + if (unlikely(task_in_memcg_oom(current))) + goto bypass; + /* * We always charge the cgroup the mm_struct belongs to. * The mm_struct's mem_cgroup changes on task migration if the diff --git a/mm/memory.c b/mm/memory.c index f7b7692c05ed..1311f26497e6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3865,15 +3865,21 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, * space. Kernel faults are handled more gracefully. */ if (flags & FAULT_FLAG_USER) - mem_cgroup_enable_oom(); + mem_cgroup_oom_enable(); ret = __handle_mm_fault(mm, vma, address, flags); - if (flags & FAULT_FLAG_USER) - mem_cgroup_disable_oom(); - - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) - mem_cgroup_oom_synchronize(); + if (flags & FAULT_FLAG_USER) { + mem_cgroup_oom_disable(); + /* + * The task may have entered a memcg OOM situation but + * if the allocation error was handled gracefully (no + * VM_FAULT_OOM), there is no need to kill anything. + * Just clean up the OOM state peacefully. + */ + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) + mem_cgroup_oom_synchronize(false); + } return ret; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 314e9d274381..6738c47f1f72 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -680,7 +680,7 @@ void pagefault_out_of_memory(void) { struct zonelist *zonelist; - if (mem_cgroup_oom_synchronize()) + if (mem_cgroup_oom_synchronize(true)) return; zonelist = node_zonelist(first_online_node, GFP_KERNEL); -- cgit v1.2.3 From 84235de394d9775bfaa7fa9762a59d91fef0c1fc Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 16 Oct 2013 13:47:00 -0700 Subject: fs: buffer: move allocation failure loop into the allocator Buffer allocation has a very crude indefinite loop around waking the flusher threads and performing global NOFS direct reclaim because it can not handle allocation failures. The most immediate problem with this is that the allocation may fail due to a memory cgroup limit, where flushers + direct reclaim might not make any progress towards resolving the situation at all. Because unlike the global case, a memory cgroup may not have any cache at all, only anonymous pages but no swap. This situation will lead to a reclaim livelock with insane IO from waking the flushers and thrashing unrelated filesystem cache in a tight loop. Use __GFP_NOFAIL allocations for buffers for now. This makes sure that any looping happens in the page allocator, which knows how to orchestrate kswapd, direct reclaim, and the flushers sensibly. It also allows memory cgroups to detect allocations that can't handle failure and will allow them to ultimately bypass the limit if reclaim can not make progress. Reported-by: azurIt Signed-off-by: Johannes Weiner Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 65fc6a449841..34d3ca9572d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2766,6 +2766,8 @@ done: return 0; nomem: *ptr = NULL; + if (gfp_mask & __GFP_NOFAIL) + return 0; return -ENOMEM; bypass: *ptr = root_mem_cgroup; -- cgit v1.2.3 From e3b6c655b91e01a1dade056cfa358581b47a5351 Mon Sep 17 00:00:00 2001 From: Fengguang Wu Date: Wed, 16 Oct 2013 13:47:03 -0700 Subject: writeback: fix negative bdi max pause MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Toralf runs trinity on UML/i386. After some time it hangs and the last message line is BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521] It's found that pages_dirtied becomes very large. More than 1000000000 pages in this case: period = HZ * pages_dirtied / task_ratelimit; BUG_ON(pages_dirtied > 2000000000); BUG_ON(pages_dirtied > 1000000000); <--------- UML debug printf shows that we got negative pause here: ick: pause : -984 ick: pages_dirtied : 0 ick: task_ratelimit: 0 pause: + if (pause < 0) { + extern int printf(char *, ...); + printf("ick : pause : %li\n", pause); + printf("ick: pages_dirtied : %lu\n", pages_dirtied); + printf("ick: task_ratelimit: %lu\n", task_ratelimit); + BUG_ON(1); + } trace_balance_dirty_pages(bdi, Since pause is bounded by [min_pause, max_pause] where min_pause is also bounded by max_pause. It's suspected and demonstrated that the max_pause calculation goes wrong: ick: pause : -717 ick: min_pause : -177 ick: max_pause : -717 ick: pages_dirtied : 14 ick: task_ratelimit: 0 The problem lies in the two "long = unsigned long" assignments in bdi_max_pause() which might go negative if the highest bit is 1, and the min_t(long, ...) check failed to protect it falling under 0. Fix all of them by using "unsigned long" throughout the function. Signed-off-by: Fengguang Wu Reported-by: Toralf Förster Tested-by: Toralf Förster Reviewed-by: Jan Kara Cc: Richard Weinberger Cc: Geert Uytterhoeven Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f5236f804aa6..63807583d8e8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1210,11 +1210,11 @@ static unsigned long dirty_poll_interval(unsigned long dirty, return 1; } -static long bdi_max_pause(struct backing_dev_info *bdi, - unsigned long bdi_dirty) +static unsigned long bdi_max_pause(struct backing_dev_info *bdi, + unsigned long bdi_dirty) { - long bw = bdi->avg_write_bandwidth; - long t; + unsigned long bw = bdi->avg_write_bandwidth; + unsigned long t; /* * Limit pause time for small memory systems. If sleeping for too long @@ -1226,7 +1226,7 @@ static long bdi_max_pause(struct backing_dev_info *bdi, t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); t++; - return min_t(long, t, MAX_PAUSE); + return min_t(unsigned long, t, MAX_PAUSE); } static long bdi_min_pause(struct backing_dev_info *bdi, -- cgit v1.2.3 From 5b808a2300a5ac45f4798ebfac8b367e98a4b692 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Wed, 16 Oct 2013 13:47:06 -0700 Subject: swap: fix set_blocksize race during swapon/swapoff Fix race between swapoff and swapon. Swapoff used old_block_size from swap_info outside of swapon_mutex so it could be overwritten by concurrent swapon. The race has visible effect only if more than one swap block device exists with different block sizes (e.g. /dev/sda1 with block size 4096 and /dev/sdb1 with 512). In such case it leads to setting the blocksize of swapped off device with wrong blocksize. The bug can be triggered with multiple concurrent swapoff and swapon: 0. Swap for some device is on. 1. swapoff: First the swapoff is called on this device and "struct swap_info_struct *p" is assigned. This is done under swap_lock however this lock is released for the call try_to_unuse(). 2. swapon: After the assignment above (and before acquiring swapon_mutex & swap_lock by swapoff) the swapon is called on the same device. The p->old_block_size is assigned to the value of block_size the device. This block size should be the same as previous but sometimes it is not. The swapon ends successfully. 3. swapoff: Swapoff resumes, grabs the locks and mutex and continues to disable this swap device. Now it sets the block size to value taken from swap_info which was overwritten by swapon in 2. Signed-off-by: Krzysztof Kozlowski Reported-by: Weijie Yang Cc: Bob Liu Cc: Konrad Rzeszutek Wilk Cc: Shaohua Li Cc: Minchan Kim Acked-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 3963fc24fcc1..de7c904e52e5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1824,6 +1824,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) struct filename *pathname; int i, type, prev; int err; + unsigned int old_block_size; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1914,6 +1915,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) } swap_file = p->swap_file; + old_block_size = p->old_block_size; p->swap_file = NULL; p->max = 0; swap_map = p->swap_map; @@ -1938,7 +1940,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) inode = mapping->host; if (S_ISBLK(inode->i_mode)) { struct block_device *bdev = I_BDEV(inode); - set_blocksize(bdev, p->old_block_size); + set_blocksize(bdev, old_block_size); blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); } else { mutex_lock(&inode->i_mutex); -- cgit v1.2.3 From 750e8165f5e87b6a142be953640eabb13a9d350a Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 16 Oct 2013 13:47:08 -0700 Subject: mm: fix BUG in __split_huge_page_pmd Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). It's invalid: we don't always have down_write of mmap_sem there: a racing do_huge_pmd_wp_page() might have copied-on-write to another huge page before our split_huge_page() got the anon_vma lock. Forget the BUG_ON, just go back and try again if this happens. Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Cc: Andrea Arcangeli Cc: Naoya Horiguchi Cc: David Rientjes Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7489884682d8..610e3df2768a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, mmun_start = haddr; mmun_end = haddr + HPAGE_PMD_SIZE; +again: mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); spin_lock(&mm->page_table_lock); if (unlikely(!pmd_trans_huge(*pmd))) { @@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, split_huge_page(page); put_page(page); - BUG_ON(pmd_trans_huge(*pmd)); + + /* + * We don't always have down_write of mmap_sem here: a racing + * do_huge_pmd_wp_page() might have copied-on-write to another + * huge page before our split_huge_page() got the anon_vma lock. + */ + if (unlikely(pmd_trans_huge(*pmd))) + goto again; } void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address, -- cgit v1.2.3 From 57a8f0cdb87da776bf0e4ce7554a9133854fa779 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 16 Oct 2013 13:47:09 -0700 Subject: mm: revert mremap pud_free anti-fix Revert commit 1ecfd533f4c5 ("mm/mremap.c: call pud_free() after fail calling pmd_alloc()"). The original code was correct: pud_alloc(), pmd_alloc(), pte_alloc_map() ensure that the pud, pmd, pt is already allocated, and seldom do they need to allocate; on failure, upper levels are freed if appropriate by the subsequent do_munmap(). Whereas commit 1ecfd533f4c5 did an unconditional pud_free() of a most-likely still-in-use pud: saved only by the near-impossiblity of pmd_alloc() failing. Signed-off-by: Hugh Dickins Cc: Chen Gang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mremap.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/mremap.c b/mm/mremap.c index 91b13d6a16d4..0843feb66f3d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -25,7 +25,6 @@ #include #include #include -#include #include "internal.h" @@ -63,10 +62,8 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma, return NULL; pmd = pmd_alloc(mm, pud, addr); - if (!pmd) { - pud_free(mm, pud); + if (!pmd) return NULL; - } VM_BUG_ON(pmd_trans_huge(*pmd)); -- cgit v1.2.3 From 1dd49bfa3465756b3ce72214b58a33e4afb67aa3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:42 +0100 Subject: mm: numa: Do not account for a hinting fault if we raced If another task handled a hinting fault in parallel then do not double account for it. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 610e3df2768a..33ee637648ba 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, check_same: spin_lock(&mm->page_table_lock); - if (unlikely(!pmd_same(pmd, *pmdp))) + if (unlikely(!pmd_same(pmd, *pmdp))) { + /* Someone else took our fault */ + current_nid = -1; goto out_unlock; + } clear_pmdnuma: pmd = pmd_mknonnuma(pmd); set_pmd_at(mm, haddr, pmdp, pmd); -- cgit v1.2.3 From 42836f5f8baa33085f547098b74aa98991ee9216 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:43 +0100 Subject: mm: Wait for THP migrations to complete during NUMA hinting faults The locking for migrating THP is unusual. While normal page migration prevents parallel accesses using a migration PTE, THP migration relies on a combination of the page_table_lock, the page lock and the existance of the NUMA hinting PTE to guarantee safety but there is a bug in the scheme. If a THP page is currently being migrated and another thread traps a fault on the same page it checks if the page is misplaced. If it is not, then pmd_numa is cleared. The problem is that it checks if the page is misplaced without holding the page lock meaning that the racing thread can be migrating the THP when the second thread clears the NUMA bit and faults a stale page. This patch checks if the page is potentially being migrated and stalls using the lock_page if it is potentially being migrated before checking if the page is misplaced or not. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 33ee637648ba..e10d780c4781 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (current_nid == numa_node_id()) count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); - target_nid = mpol_misplaced(page, vma, haddr); - if (target_nid == -1) { - put_page(page); - goto clear_pmdnuma; - } + /* + * Acquire the page lock to serialise THP migrations but avoid dropping + * page_table_lock if at all possible + */ + if (trylock_page(page)) + goto got_lock; - /* Acquire the page lock to serialise THP migrations */ + /* Serialise against migrationa and check placement check placement */ spin_unlock(&mm->page_table_lock); lock_page(page); @@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, put_page(page); goto out_unlock; } - spin_unlock(&mm->page_table_lock); + +got_lock: + target_nid = mpol_misplaced(page, vma, haddr); + if (target_nid == -1) { + unlock_page(page); + put_page(page); + goto clear_pmdnuma; + } /* Migrate the THP to the requested node */ + spin_unlock(&mm->page_table_lock); migrated = migrate_misplaced_transhuge_page(mm, vma, pmdp, pmd, addr, page, target_nid); if (!migrated) -- cgit v1.2.3 From 587fe586f44a48f9691001ba6c45b86c8e4ba21f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:44 +0100 Subject: mm: Prevent parallel splits during THP migration THP migrations are serialised by the page lock but on its own that does not prevent THP splits. If the page is split during THP migration then the pmd_same checks will prevent page table corruption but the unlock page and other fix-ups potentially will cause corruption. This patch takes the anon_vma lock to prevent parallel splits during migration. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++-------------- 1 file changed, 30 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e10d780c4781..d8534b3630e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1278,18 +1278,18 @@ out: int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pmd_t pmd, pmd_t *pmdp) { + struct anon_vma *anon_vma = NULL; struct page *page; unsigned long haddr = addr & HPAGE_PMD_MASK; int target_nid; int current_nid = -1; - bool migrated; + bool migrated, page_locked; spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; page = pmd_page(pmd); - get_page(page); current_nid = page_to_nid(page); count_vm_numa_event(NUMA_HINT_FAULTS); if (current_nid == numa_node_id()) @@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * Acquire the page lock to serialise THP migrations but avoid dropping * page_table_lock if at all possible */ - if (trylock_page(page)) - goto got_lock; + page_locked = trylock_page(page); + target_nid = mpol_misplaced(page, vma, haddr); + if (target_nid == -1) { + /* If the page was locked, there are no parallel migrations */ + if (page_locked) { + unlock_page(page); + goto clear_pmdnuma; + } - /* Serialise against migrationa and check placement check placement */ + /* Otherwise wait for potential migrations and retry fault */ + spin_unlock(&mm->page_table_lock); + wait_on_page_locked(page); + goto out; + } + + /* Page is misplaced, serialise migrations and parallel THP splits */ + get_page(page); spin_unlock(&mm->page_table_lock); - lock_page(page); + if (!page_locked) { + lock_page(page); + page_locked = true; + } + anon_vma = page_lock_anon_vma_read(page); /* Confirm the PTE did not while locked */ spin_lock(&mm->page_table_lock); @@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out_unlock; } -got_lock: - target_nid = mpol_misplaced(page, vma, haddr); - if (target_nid == -1) { - unlock_page(page); - put_page(page); - goto clear_pmdnuma; - } - /* Migrate the THP to the requested node */ spin_unlock(&mm->page_table_lock); migrated = migrate_misplaced_transhuge_page(mm, vma, @@ -1330,6 +1339,8 @@ got_lock: goto check_same; task_numa_fault(target_nid, HPAGE_PMD_NR, true); + if (anon_vma) + page_unlock_anon_vma_read(anon_vma); return 0; check_same: @@ -1346,6 +1357,11 @@ clear_pmdnuma: update_mmu_cache_pmd(vma, addr, pmdp); out_unlock: spin_unlock(&mm->page_table_lock); + +out: + if (anon_vma) + page_unlock_anon_vma_read(anon_vma); + if (current_nid != -1) task_numa_fault(current_nid, HPAGE_PMD_NR, false); return 0; -- cgit v1.2.3 From c61109e34f60f6e85bb43c5a1cd51c0e3db40847 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:45 +0100 Subject: mm: numa: Sanitize task_numa_fault() callsites There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 25 +++++++++++++------------ mm/memory.c | 53 +++++++++++++++++++++-------------------------------- 2 files changed, 34 insertions(+), 44 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d8534b3630e4..00ddfcdd810e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, struct anon_vma *anon_vma = NULL; struct page *page; unsigned long haddr = addr & HPAGE_PMD_MASK; + int page_nid = -1, this_nid = numa_node_id(); int target_nid; - int current_nid = -1; - bool migrated, page_locked; + bool page_locked; + bool migrated = false; spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; page = pmd_page(pmd); - current_nid = page_to_nid(page); + page_nid = page_to_nid(page); count_vm_numa_event(NUMA_HINT_FAULTS); - if (current_nid == numa_node_id()) + if (page_nid == this_nid) count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); /* @@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(&mm->page_table_lock); migrated = migrate_misplaced_transhuge_page(mm, vma, pmdp, pmd, addr, page, target_nid); - if (!migrated) + if (migrated) + page_nid = target_nid; + else goto check_same; - task_numa_fault(target_nid, HPAGE_PMD_NR, true); - if (anon_vma) - page_unlock_anon_vma_read(anon_vma); - return 0; + goto out; check_same: spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(pmd, *pmdp))) { /* Someone else took our fault */ - current_nid = -1; + page_nid = -1; goto out_unlock; } clear_pmdnuma: @@ -1362,8 +1362,9 @@ out: if (anon_vma) page_unlock_anon_vma_read(anon_vma); - if (current_nid != -1) - task_numa_fault(current_nid, HPAGE_PMD_NR, false); + if (page_nid != -1) + task_numa_fault(page_nid, HPAGE_PMD_NR, migrated); + return 0; } diff --git a/mm/memory.c b/mm/memory.c index 1311f26497e6..d176154c243f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3521,12 +3521,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, } int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, - unsigned long addr, int current_nid) + unsigned long addr, int page_nid) { get_page(page); count_vm_numa_event(NUMA_HINT_FAULTS); - if (current_nid == numa_node_id()) + if (page_nid == numa_node_id()) count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); return mpol_misplaced(page, vma, addr); @@ -3537,7 +3537,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, { struct page *page = NULL; spinlock_t *ptl; - int current_nid = -1; + int page_nid = -1; int target_nid; bool migrated = false; @@ -3567,15 +3567,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } - current_nid = page_to_nid(page); - target_nid = numa_migrate_prep(page, vma, addr, current_nid); + page_nid = page_to_nid(page); + target_nid = numa_migrate_prep(page, vma, addr, page_nid); pte_unmap_unlock(ptep, ptl); if (target_nid == -1) { - /* - * Account for the fault against the current node if it not - * being replaced regardless of where the page is located. - */ - current_nid = numa_node_id(); put_page(page); goto out; } @@ -3583,11 +3578,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Migrate to the requested node */ migrated = migrate_misplaced_page(page, target_nid); if (migrated) - current_nid = target_nid; + page_nid = target_nid; out: - if (current_nid != -1) - task_numa_fault(current_nid, 1, migrated); + if (page_nid != -1) + task_numa_fault(page_nid, 1, migrated); return 0; } @@ -3602,7 +3597,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long offset; spinlock_t *ptl; bool numa = false; - int local_nid = numa_node_id(); spin_lock(&mm->page_table_lock); pmd = *pmdp; @@ -3625,9 +3619,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) { pte_t pteval = *pte; struct page *page; - int curr_nid = local_nid; + int page_nid = -1; int target_nid; - bool migrated; + bool migrated = false; + if (!pte_present(pteval)) continue; if (!pte_numa(pteval)) @@ -3649,25 +3644,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(page_mapcount(page) != 1)) continue; - /* - * Note that the NUMA fault is later accounted to either - * the node that is currently running or where the page is - * migrated to. - */ - curr_nid = local_nid; - target_nid = numa_migrate_prep(page, vma, addr, - page_to_nid(page)); - if (target_nid == -1) { + page_nid = page_to_nid(page); + target_nid = numa_migrate_prep(page, vma, addr, page_nid); + pte_unmap_unlock(pte, ptl); + if (target_nid != -1) { + migrated = migrate_misplaced_page(page, target_nid); + if (migrated) + page_nid = target_nid; + } else { put_page(page); - continue; } - /* Migrate to the requested node */ - pte_unmap_unlock(pte, ptl); - migrated = migrate_misplaced_page(page, target_nid); - if (migrated) - curr_nid = target_nid; - task_numa_fault(curr_nid, 1, migrated); + if (page_nid != -1) + task_numa_fault(page_nid, 1, migrated); pte = pte_offset_map_lock(mm, pmdp, addr, &ptl); } -- cgit v1.2.3 From 3f926ab945b60a5824369d21add7710622a2eac0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:46 +0100 Subject: mm: Close races between THP migration and PMD numa clearing THP migration uses the page lock to guard against parallel allocations but there are cases like this still open Task A Task B --------------------- --------------------- do_huge_pmd_numa_page do_huge_pmd_numa_page lock_page mpol_misplaced == -1 unlock_page goto clear_pmdnuma lock_page mpol_misplaced == 2 migrate_misplaced_transhuge pmd = pmd_mknonnuma set_pmd_at During hours of testing, one crashed with weird errors and while I have no direct evidence, I suspect something like the race above happened. This patch extends the page lock to being held until the pmd_numa is cleared to prevent migration starting in parallel while the pmd_numa is being cleared. It also flushes the old pmd entry and orders pagetable insertion before rmap insertion. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 33 +++++++++++++++------------------ mm/migrate.c | 19 +++++++++++-------- 2 files changed, 26 insertions(+), 26 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 00ddfcdd810e..cca80d96e509 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, target_nid = mpol_misplaced(page, vma, haddr); if (target_nid == -1) { /* If the page was locked, there are no parallel migrations */ - if (page_locked) { - unlock_page(page); + if (page_locked) goto clear_pmdnuma; - } - /* Otherwise wait for potential migrations and retry fault */ + /* + * Otherwise wait for potential migrations and retry. We do + * relock and check_same as the page may no longer be mapped. + * As the fault is being retried, do not account for it. + */ spin_unlock(&mm->page_table_lock); wait_on_page_locked(page); + page_nid = -1; goto out; } /* Page is misplaced, serialise migrations and parallel THP splits */ get_page(page); spin_unlock(&mm->page_table_lock); - if (!page_locked) { + if (!page_locked) lock_page(page); - page_locked = true; - } anon_vma = page_lock_anon_vma_read(page); /* Confirm the PTE did not while locked */ @@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(!pmd_same(pmd, *pmdp))) { unlock_page(page); put_page(page); + page_nid = -1; goto out_unlock; } - /* Migrate the THP to the requested node */ + /* + * Migrate the THP to the requested node, returns with page unlocked + * and pmd_numa cleared. + */ spin_unlock(&mm->page_table_lock); migrated = migrate_misplaced_transhuge_page(mm, vma, pmdp, pmd, addr, page, target_nid); if (migrated) page_nid = target_nid; - else - goto check_same; goto out; - -check_same: - spin_lock(&mm->page_table_lock); - if (unlikely(!pmd_same(pmd, *pmdp))) { - /* Someone else took our fault */ - page_nid = -1; - goto out_unlock; - } clear_pmdnuma: + BUG_ON(!PageLocked(page)); pmd = pmd_mknonnuma(pmd); set_pmd_at(mm, haddr, pmdp, pmd); VM_BUG_ON(pmd_numa(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); + unlock_page(page); out_unlock: spin_unlock(&mm->page_table_lock); diff --git a/mm/migrate.c b/mm/migrate.c index 7a7325ee1d08..c04692774e88 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1715,12 +1715,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, unlock_page(new_page); put_page(new_page); /* Free it */ - unlock_page(page); + /* Retake the callers reference and putback on LRU */ + get_page(page); putback_lru_page(page); - - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); - isolated = 0; - goto out; + mod_zone_page_state(page_zone(page), + NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); + goto out_fail; } /* @@ -1737,9 +1737,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); - page_add_new_anon_rmap(new_page, vma, haddr); - + pmdp_clear_flush(vma, haddr, pmd); set_pmd_at(mm, haddr, pmd, entry); + page_add_new_anon_rmap(new_page, vma, haddr); update_mmu_cache_pmd(vma, address, &entry); page_remove_rmap(page); /* @@ -1758,7 +1758,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR); count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR); -out: mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); @@ -1767,6 +1766,10 @@ out: out_fail: count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); out_dropref: + entry = pmd_mknonnuma(entry); + set_pmd_at(mm, haddr, pmd, entry); + update_mmu_cache_pmd(vma, address, &entry); + unlock_page(page); put_page(page); return 0; -- cgit v1.2.3 From 0255d491848032f6c601b6410c3b8ebded3a37b1 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:47 +0100 Subject: mm: Account for a THP NUMA hinting update as one PTE update A THP PMD update is accounted for as 512 pages updated in vmstat. This is large difference when estimating the cost of automatic NUMA balancing and can be misleading when comparing results that had collapsed versus split THP. This patch addresses the accounting issue. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- mm/mprotect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mprotect.c b/mm/mprotect.c index a3af058f68e4..412ba2b7326a 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else if (change_huge_pmd(vma, pmd, addr, newprot, prot_numa)) { - pages += HPAGE_PMD_NR; + pages++; continue; } /* fall through */ -- cgit v1.2.3 From c56b097af26cb11c1f49a4311ba538c825666fed Mon Sep 17 00:00:00 2001 From: Russell King Date: Wed, 30 Oct 2013 14:16:16 +0000 Subject: mm: list_lru: fix almost infinite loop causing effective livelock I've seen a fair number of issues with kswapd and other processes appearing to get stuck in v3.12-rc. Using sysrq-p many times seems to indicate that it gets stuck somewhere in list_lru_walk_node(), called from prune_icache_sb() and super_cache_scan(). I never seem to be able to trigger a calltrace for functions above that point. So I decided to add the following to super_cache_scan(): @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink, inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid); dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid); total_objects = dentries + inodes + fs_objects + 1; +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects); /* proportion the scan between the caches */ dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes); +BUG_ON(dentries == 0); +BUG_ON(inodes == 0); /* * prune the dcache first as the icache is pinned by it, then @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink, freed += sb->s_op->free_cached_objects(sb, fs_objects, sc->nid); } - +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed); drop_super(sb); return freed; } and shortly thereafter, having applied some pressure, I got this: update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635 update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0 ------------[ cut here ]------------ Kernel BUG at c0101994 [verbose debug info unavailable] Internal error: Oops - BUG: 0 [#3] SMP ARM Modules linked in: fuse rfcomm bnep bluetooth hid_cypress CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G D 3.12.0-rc7+ #154 task: daea1200 ti: c3bf8000 task.ti: c3bf8000 PC is at super_cache_scan+0x1c0/0x278 LR is at trace_hardirqs_on+0x14/0x18 Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240) ... Backtrace: (super_cache_scan) from [] (shrink_slab+0x254/0x4c8) (shrink_slab) from [] (try_to_free_pages+0x3a0/0x5e0) (try_to_free_pages) from [] (__alloc_pages_nodemask+0x5) (__alloc_pages_nodemask) from [] (__pte_alloc+0x2c/0x13) (__pte_alloc) from [] (handle_mm_fault+0x84c/0x914) (handle_mm_fault) from [] (do_page_fault+0x1f0/0x3bc) (do_page_fault) from [] (do_translation_fault+0xac/0xb8) (do_translation_fault) from [] (do_DataAbort+0x38/0xa0) (do_DataAbort) from [] (__dabt_usr+0x38/0x40) Notice that we had a very low number of inodes, which were reduced to zero my mult_frac(). Now, prune_icache_sb() calls list_lru_walk_node() passing that number of inodes (0) into that as the number of objects to scan: long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, int nid) { LIST_HEAD(freeable); long freed; freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate, &freeable, &nr_to_scan); which does: unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk) { struct list_lru_node *nlru = &lru->node[nid]; struct list_head *item, *n; unsigned long isolated = 0; spin_lock(&nlru->lock); restart: list_for_each_safe(item, n, &nlru->list) { enum lru_status ret; /* * decrement nr_to_walk first so that we don't livelock if we * get stuck on large numbesr of LRU_RETRY items */ if (--(*nr_to_walk) == 0) break; So, if *nr_to_walk was zero when this function was entered, that means we're wanting to operate on (~0UL)+1 objects - which might as well be infinite. Clearly this is not correct behaviour. If we think about the behaviour of this function when *nr_to_walk is 1, then clearly it's wrong - we decrement first and then test for zero - which results in us doing nothing at all. A post-decrement would give the desired behaviour - we'd try to walk one object and one object only if *nr_to_walk were one. It also gives the correct behaviour for zero - we exit at this point. Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour") Signed-off-by: Russell King Cc: Dave Chinner Cc: Al Viro Cc: Andrew Morton [ Modified to make sure we never underflow the count: this function gets called in a loop, so the 0 -> ~0ul transition is dangerous - Linus ] Signed-off-by: Linus Torvalds --- mm/list_lru.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/list_lru.c b/mm/list_lru.c index 72467914b856..72f9decb0104 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -81,8 +81,9 @@ restart: * decrement nr_to_walk first so that we don't livelock if we * get stuck on large numbesr of LRU_RETRY items */ - if (--(*nr_to_walk) == 0) + if (!*nr_to_walk) break; + --*nr_to_walk; ret = isolate(item, &nlru->lock, cb_arg); switch (ret) { -- cgit v1.2.3 From 3017f079efd6af199b0852b5c425364513db460e Mon Sep 17 00:00:00 2001 From: Chen LinX Date: Wed, 30 Oct 2013 13:56:18 -0700 Subject: mm/pagewalk.c: fix walk_page_range() access of wrong PTEs When walk_page_range walk a memory map's page tables, it'll skip VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it maybe larger than 'end'. In next loop, 'addr' will be larger than 'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr' will growing forever in pagemap_pte_range, pte_to_pagemap_entry will access the wrong pte. BUG: Bad page map in process procrank pte:8437526f pmd:785de067 addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1 Call Trace: dump_stack+0x16/0x18 print_bad_pte+0x114/0x1b0 vm_normal_page+0x56/0x60 pagemap_pte_range+0x17a/0x1d0 walk_page_range+0x19e/0x2c0 pagemap_read+0x16e/0x200 vfs_read+0x84/0x150 SyS_read+0x4a/0x80 syscall_call+0x7/0xb Signed-off-by: Liu ShuoX Signed-off-by: Chen LinX Acked-by: Kirill A. Shutemov Reviewed-by: Naoya Horiguchi Cc: [3.10.x+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/pagewalk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 5da2cbcfdbb5..2beeabf502c5 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -242,7 +242,7 @@ int walk_page_range(unsigned long addr, unsigned long end, if (err) break; pgd++; - } while (addr = next, addr != end); + } while (addr = next, addr < end); return err; } -- cgit v1.2.3 From 5e8cfc3c75b3e43497389896c0ecda62fc311ce9 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Wed, 30 Oct 2013 13:56:21 -0700 Subject: memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting") memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages 1048576 == 0x10,0000 == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0xffff,ffff */ count is now 0x1,0000,0000 instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once "percpu: fix this_cpu_sub() subtrahend casting for unsigneds" is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen Acked-by: Tejun Heo Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 34d3ca9572d6..497ec33ff22d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3774,7 +3774,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, /* Update stat data for mem_cgroup */ preempt_disable(); WARN_ON_ONCE(from->stat->count[idx] < nr_pages); - __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_sub(from->stat->count[idx], nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); } -- cgit v1.2.3