summaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c
Commit message (Collapse)AuthorAgeFilesLines
* cpusets: update mems allowed in page allocatorDavid Rientjes2008-11-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | If all allowable memory is unreclaimable, it is possible to loop forever in the page allocator for ~__GFP_NORETRY allocations. During this time, it is also possible for a task's cpuset to expand its set of allowable nodes so that it now includes free memory. The cached copy of this set, current->mems_allowed, is stale, however, since there has not been a subsequent call to cpuset_update_task_memory_state(). The cached copy of the set of allowable nodes is now updated in the page allocator's slow path so the additional memory is available to get_page_from_freelist(). [akpm@linux-foundation.org: add comment] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: pull gigantic page initialisation out of the default pathAndy Whitcroft2008-11-061-7/+21
| | | | | | | | | | | | | | | | | | As we can determine exactly when a gigantic page is in use we can optimise the common regular page cases by pulling out gigantic page initialisation into its own function. As gigantic pages are never released to buddy we do not need a destructor. This effectivly reverts the previous change to the main buddy allocator. It also adds a paranoid check to ensure we never release gigantic pages from hugetlbfs to the main buddy. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: allocate all page_cgroup at bootKAMEZAWA Hiroyuki2008-10-201-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate all page_cgroup at boot and remove page_cgroup poitner from struct page. This patch adds an interface as struct page_cgroup *lookup_page_cgroup(struct page*) All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported. Remove page_cgroup pointer reduces the amount of memory by - 4 bytes per PAGE_SIZE. - 8 bytes per PAGE_SIZE if memory controller is disabled. (even if configured.) On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory. On my x86-64 server with 48GB of memory, this saves 96MB of memory. I think this reduction makes sense. By pre-allocation, kmalloc/kfree in charge/uncharge are removed. This means - we're not necessary to be afraid of kmalloc faiulre. (this can happen because of gfp_mask type.) - we can avoid calling kmalloc/kfree. - we can avoid allocating tons of small objects which can be fragmented. - we can know what amount of memory will be used for this extra-lru handling. I added printk message as "allocated %ld bytes of page_cgroup" "please try cgroup_disable=memory option if you don't want" maybe enough informative for users. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lockGerald Schaefer2008-10-201-2/+2
| | | | | | | | | | | | | This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock. There seems to be no need for the lru_lock anymore, but there is a need for zone->lock instead, because that function may call move_freepages() via setup_zone_migrate_reserve(). Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: print out meminit for memmapYinghai Lu2008-10-201-4/+3
| | | | | | | | | Improve debuggability of memory setup problems. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mlock: count attempts to free mlocked pageLee Schermerhorn2008-10-201-0/+1
| | | | | | | | | | | | | Allow free of mlock()ed pages. This shouldn't happen, but during developement, it occasionally did. This patch allows us to survive that condition, while keeping the statistics and events correct for debug. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mlock: mlocked pages are unevictableNick Piggin2008-10-201-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that mlocked pages also live on the unevictable LRU, so kswapd will not scan them over and over again. This is achieved through various strategies: 1) add yet another page flag--PG_mlocked--to indicate that the page is locked for efficient testing in vmscan and, optionally, fault path. This allows early culling of unevictable pages, preventing them from getting to page_referenced()/try_to_unmap(). Also allows separate accounting of mlock'd pages, as Nick's original patch did. Note: Nick's original mlock patch used a PG_mlocked flag. I had removed this in favor of the PG_unevictable flag + an mlock_count [new page struct member]. I restored the PG_mlocked flag to eliminate the new count field. 2) add the mlock/unevictable infrastructure to mm/mlock.c, with internal APIs in mm/internal.h. This is a rework of Nick's original patch to these files, taking into account that mlocked pages are now kept on unevictable LRU list. 3) update vmscan.c:page_evictable() to check PageMlocked() and, if vma passed in, the vm_flags. Note that the vma will only be passed in for new pages in the fault path; and then only if the "cull unevictable pages in fault path" patch is included. 4) add try_to_unlock() to rmap.c to walk a page's rmap and ClearPageMlocked() if no other vmas have it mlocked. Reuses as much of try_to_unmap() as possible. This effectively replaces the use of one of the lru list links as an mlock count. If this mechanism let's pages in mlocked vmas leak through w/o PG_mlocked set [I don't know that it does], we should catch them later in try_to_unmap(). One hopes this will be rare, as it will be relatively expensive. Original mm/internal.h, mm/rmap.c and mm/mlock.c changes: Signed-off-by: Nick Piggin <npiggin@suse.de> splitlru: introduce __get_user_pages(): New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS. because current get_user_pages() can't grab PROT_NONE pages theresore it cause PROT_NONE pages can't munlock. [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch] [akpm@linux-foundation.org: untangle patch interdependencies] [akpm@linux-foundation.org: fix things after out-of-order merging] [hugh@veritas.com: fix page-flags mess] [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm'] [kosaki.motohiro@jp.fujitsu.com: build fix] [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments] [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Unevictable LRU Page StatisticsLee Schermerhorn2008-10-201-2/+16
| | | | | | | | | | | | | | | | Report unevictable pages per zone and system wide. Kosaki Motohiro added support for memory controller unevictable statistics. [riel@redhat.com: fix printk in show_free_areas()] [akpm@linux-foundation.org: fix units in /proc/vmstats] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: second chance replacement for anonymous pagesRik van Riel2008-10-201-0/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We avoid evicting and scanning anonymous pages for the most part, but under some workloads we can end up with most of memory filled with anonymous pages. At that point, we suddenly need to clear the referenced bits on all of memory, which can take ages on very large memory systems. We can reduce the maximum number of pages that need to be scanned by not taking the referenced state into account when deactivating an anonymous page. After all, every anonymous page starts out referenced, so why check? If an anonymous page gets referenced again before it reaches the end of the inactive list, we move it back to the active list. To keep the maximum amount of necessary work reasonable, we scale the active to inactive ratio with the size of memory, using the formula active:inactive ratio = sqrt(memory in GB * 10). Kswapd CPU use now seems to scale by the amount of pageout bandwidth, instead of by the amount of memory present in the system. [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg] [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: split LRU lists into anon & file setsRik van Riel2008-10-201-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* define page_file_cache() functionRik van Riel2008-10-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Define page_file_cache() function to answer the question: is page backed by a file? Originally part of Rik van Riel's split-lru patch. Extracted to make available for other, independent reclaim patches. Moved inline function to linux/mm_inline.h where it will be needed by subsequent "split LRU" and "noreclaim" patches. Unfortunately this needs to use a page flag, since the PG_swapbacked state needs to be preserved all the way to the point where the page is last removed from the LRU. Trying to derive the status from other info in the page resulted in wrong VM statistics in earlier split VM patchsets. The total number of page flags in use on a 32 bit machine after this patch is 19. [akpm@linux-foundation.org: fix up out-of-order merge fallout] [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[ Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: Use an indexed array for LRU variablesChristoph Lameter2008-10-201-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we are defining explicit variables for the inactive and active list. An indexed array can be more generic and avoid repeating similar code in several places in the reclaim code. We are saving a few bytes in terms of code size: Before: text data bss dec hex filename 4097753 573120 4092484 8763357 85b7dd vmlinux After: text data bss dec hex filename 4097729 573120 4092484 8763333 85b7c5 vmlinux Having an easy way to add new lru lists may ease future work on the reclaim code. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c:free_area_init_nodes() fix inappropriate use of enumAndrew Morton2008-10-161-1/+1
| | | | | | | | | | | | Local variable `i' is a) misleadingly-named for an `enum zone_type' and b) used for indexing zones as well as nodes as well as node_maps. Make it an `int'. Reported-by: Frans Pop <elendil@planet.nl> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: handle initialising compound pages at orders greater than MAX_ORDERAndy Whitcroft2008-10-021-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we initialise a compound page we initialise the page flags and head page pointer for all base pages spanned by that page. When we initialise a gigantic page (a page of order greater than or equal to MAX_ORDER) we have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we assume that all elements of the mem_map in this page are contigious in memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages, and with SPARSEMEM enabled they will not be contigious. This leads us to walk off the end of the first section and scribble on everything which follows, BAD. When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next section of the mem_map. As gigantic pages can only be maximally aligned we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from the start of the page. This is a bug fix for the gigantic page support in hugetlbfs. Credit to Mel Gorman for spotting the issue. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/bootmem: silence section mismatch warning - ↵Marcin Slusarz2008-09-021-1/+1
| | | | | | | | | | | | | | | | | | contig_page_data/bootmem_node_data WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data The variable contig_page_data references the variable __initdata bootmem_node_data If the reference is valid then annotate the variable with __init* (see linux/init.h) or name the variable: *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console, Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Sean MacLennan <smaclennan@pikatech.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make setup_zone_migrate_reserve() aware of overlapping nodesAdam Litke2008-09-021-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | I have gotten to the root cause of the hugetlb badness I reported back on August 15th. My system has the following memory topology (note the overlapping node): Node 0 Memory: 0x8000000-0x44000000 Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000 setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding no candidates, it happily continues the scan into 0x8000000-0x44000000. When a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on the wrong zone. Oops. setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: use no-panic variant of alloc_bootmem() in ↵Jan Beulich2008-08-121-1/+1
| | | | | | | | | | | | | alloc_large_system_hash() .. since a failed allocation is being (initially) handled gracefully, and panic()-ed upon failure explicitly in the function if retries with smaller sizes failed. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove find_max_pfn_with_active_regionsYinghai Lu2008-07-301-17/+0
| | | | | | | | | | | | It has no user now Also print out info about adding/removing active regions. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* stop_machine: Wean existing callers off stop_machine_run()Rusty Russell2008-07-281-2/+2
| | | | Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* memory hotplug: small fixes to bootmem freeing for memory hotremoveYasunori Goto2008-07-241-1/+1
| | | | | | | | | | | | | | | | | - Change some naming * Magic -> types * MIX_INFO -> MIX_SECTION_INFO * Change definition of bootmem type from direct hex value - __free_pages_bootmem() becomes __meminit. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: cleanupsAdrian Bunk2008-07-241-12/+13
| | | | | | | | | | | | | | | | | | | | This patch contains the following cleanups: - make the following needlessly global variables static: - required_kernelcore - zone_movable_pfn[] - make the following needlessly global functions static: - move_freepages() - move_freepages_block() - setup_pageset() - find_usable_zone_for_movable() - adjust_zone_range_for_zone_movable() - __absent_pages_in_range() - find_min_pfn_for_node() - find_zone_movable_pfns_for_nodes() Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add alloc_pages_exact() and free_pages_exact()Timur Tabi2008-07-241-0/+53
| | | | | | | | | | | | | | | | | alloc_pages_exact() is similar to alloc_pages(), except that it allocates the minimum number of pages to fulfill the request. This is useful if you want to allocate a very large buffer that is slightly larger than an even power-of-two number of pages. In that case, alloc_pages() will waste a lot of memory. I have a video driver that wants to allocate a 5MB buffer. alloc_pages() wiill waste 3MB of physically-contiguous memory. Signed-off-by: Timur Tabi <timur@freescale.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: export prep_compound_page to mmAndi Kleen2008-07-241-1/+1
| | | | | | | | | | | | hugetlb will need to get compound pages from bootmem to handle the case of them being greater than or equal to MAX_ORDER. Export the constructor function needed for this. Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: drop unneeded pgdat argument from free_area_init_node()Johannes Weiner2008-07-241-5/+6
| | | | | | | | | | | | | free_area_init_node() gets passed in the node id as well as the node descriptor. This is redundant as the function can trivially get the node descriptor itself by means of NODE_DATA() and the node's id. I checked all the users and NODE_DATA() seems to be usable everywhere from where this function is called. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* buddy: clarify comments describing buddy mergeAndy Whitcroft2008-07-241-1/+2
| | | | | | | | | | | | | | | | | | | | | In __free_one_page(), the comment "Move the buddy up one level" appears attached to the break and by implication when the break is taken we are moving it up one level: if (!page_is_buddy(page, buddy, order)) break; /* Move the buddy up one level. */ In reality the inverse is true, we break out when we can no longer merge this page with its buddy. Looking back into pre-history (into the full git history) it appears that these two lines accidentally got joined as part of another change. Move the comment down where it belongs below the if and clarify its language. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: inline some __alloc_pages() wrappersKOSAKI Motohiro2008-07-241-17/+2
| | | | | | | | | | | | Two zonelist patch series rewrote __page_alloc() largely. Now, it is just a wrapper function. Inlining them will save a function call. [akpm@linux-foundation.org: export __alloc_pages_internal] Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move bootmem descriptors definition to a single placeJohannes Weiner2008-07-241-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | There are a lot of places that define either a single bootmem descriptor or an array of them. Use only one central array with MAX_NUMNODES items instead. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: print out the zonelists on request for manual verificationMel Gorman2008-07-241-0/+1
| | | | | | | | | | | | This patch prints out the zonelists during boot for manual verification by the user if the mminit_loglevel is MMINIT_VERIFY or higher. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make defensive checks around PFN values registered for memory usageMel Gorman2008-07-241-0/+2
| | | | | | | | | | | | | | | | | | | | There are a number of different views to how much memory is currently active. There is the arch-independent zone-sizing view, the bootmem allocator and memory models view. Architectures register this information at different times and is not necessarily in sync particularly with respect to some SPARSEMEM limitations. This patch introduces mminit_validate_memmodel_limits() which is able to validate and correct PFN ranges with respect to the memory model. It is only SPARSEMEM that currently validates itself. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: verify the page links and memory modelMel Gorman2008-07-241-0/+8
| | | | | | | | | | | | | | | | | | Print out information on how the page flags are being used if mminit_loglevel is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the flags regardless of loglevel. When the page flags are updated with section, node and zone information, a check are made to ensure the values can be retrieved correctly. Finally we confirm that pfn_to_page and page_to_pfn are the correct inverse functions. [akpm@linux-foundation.org: fix printk warnings] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add a basic debugging framework for memory initialisationMel Gorman2008-07-241-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Boot initialisation is very complex, with significant numbers of architecture-specific routines, hooks and code ordering. While significant amounts of the initialisation is architecture-independent, it trusts the data received from the architecture layer. This is a mistake, and has resulted in a number of difficult-to-diagnose bugs. This patchset adds some validation and tracing to memory initialisation. It also introduces a few basic defensive measures. The validation code can be explicitly disabled for embedded systems. This patch: Add additional debugging and verification code for memory initialisation. Once enabled, the verification checks are always run and when required additional debugging information may be outputted via a mminit_loglevel= command-line parameter. The verification code is placed in a new file mm/mm_init.c. Ideally other mm initialisation code will be moved here over time. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'generic-ipi' into generic-ipi-for-linusIngo Molnar2008-07-151-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/powerpc/Kconfig arch/s390/kernel/time.c arch/x86/kernel/apic_32.c arch/x86/kernel/cpu/perfctr-watchdog.c arch/x86/kernel/i8259_64.c arch/x86/kernel/ldt.c arch/x86/kernel/nmi_64.c arch/x86/kernel/smpboot.c arch/x86/xen/smp.c include/asm-x86/hw_irq_32.h include/asm-x86/hw_irq_64.h include/asm-x86/mach-default/irq_vectors.h include/asm-x86/mach-voyager/irq_vectors.h include/asm-x86/smp.h kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * on_each_cpu(): kill unused 'retry' parameterJens Axboe2008-06-261-1/+1
| | | | | | | | | | | | | | | | | | It's not even passed on to smp_call_function() anymore, since that was removed. So kill it. Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | mm, generic, x86 boot: more tweaks to hex prints of some pfn addressesPaul Jackson2008-07-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some problems with (and applies on top of) a previous patch: x86 boot: show pfn addresses in hex not decimal in some kernel info printks Primarily change "0x%8lx" format, which displays with a right aligned space filled hex number (spaces between the "0x" prefix and the number), into "%0#10lx" format, which zero fills instead of space fills, and which uses the printf flag '#' to request the "0x" prefix instead of hard coding it. Also replace some other "0x%lx" formats with "%#lx", making use of the '#' printf flag again. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: "Yinghai Lu" <yhlu.kernel@gmail.com> Cc: "Jack Steiner" <steiner@sgi.com> Cc: "Mike Travis" <travis@sgi.com> Cc: "Huang Cc: Ying" <ying.huang@intel.com> Cc: "Andi Kleen" <andi@firstfloor.org> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | x86 boot: more consistently use type int for node idsPaul Jackson2008-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Everywhere I look, node id's are of type 'int', except in this one case, which has 'unsigned long'. Change this one to 'int' as well. There is nothing special about the way this variable 'nid' is used in this routine to justify using an unusual type here. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: "Yinghai Lu" <yhlu.kernel@gmail.com> Cc: "Jack Steiner" <steiner@sgi.com> Cc: "Mike Travis" <travis@sgi.com> Cc: "Huang Cc: Ying" <ying.huang@intel.com> Cc: "Andi Kleen" <andi@firstfloor.org> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | x86 boot: show pfn addresses in hex not decimal in some kernel info printksPaul Jackson2008-07-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Page frame numbers (the portion of physical addresses above the low order page offsets) are displayed in several kernel debug and info prints in decimal, not hex. Decimal addresse are unreadable. Use hex. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: "Yinghai Lu" <yhlu.kernel@gmail.com> Cc: "Jack Steiner" <steiner@sgi.com> Cc: "Mike Travis" <travis@sgi.com> Cc: "Huang Cc: Ying" <ying.huang@intel.com> Cc: "Andi Kleen" <andi@firstfloor.org> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | RFC x86: try to remove arch_get_ram_rangeYinghai Lu2008-07-081-3/+7
| | | | | | | | | | | | | | | | want to remove arch_get_ram_range, and use early_node_map instead. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | x86, mm: use add_highpages_with_active_regions() for high pages init v2Yinghai Lu2008-07-081-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | use early_node_map to init high pages, so we can remove page_is_ram() and page_is_reserved_early() in the big loop with add_one_highpage also remove page_is_reserved_early(), it is not needed anymore. v2: fix the build of other platforms Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | x86: replace shrink_active_range() with remove_active_range()Yinghai Lu2008-07-081-6/+23
| | | | | | | | | | | | | | | | | | | | | | in case we have kva before ramdisk on a node, we still need to use those ranges. v2: reserve_early kva ram area, in case there are holes in highmem, to avoid those area could be treat as free high pages. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | Merge branch 'linus' into tmp.x86.mpparse.newIngo Molnar2008-07-081-34/+9
|\ \
| * | Do not overwrite nr_zones on !NUMA when initialising zlcache_ptrMel Gorman2008-07-031-1/+0
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The non-NUMA case of build_zonelist_cache() would initialize the zlcache_ptr for both node_zonelists[] to NULL. Which is problematic, since non-NUMA only has a single node_zonelists[] entry, and trying to zero the non-existent second one just overwrote the nr_zones field instead. As kswapd uses this value to determine what reclaim work is necessary, the result is that kswapd never reclaims. This causes processes to stall frequently in low-memory situations as they always direct reclaim. This patch initialises zlcache_ptr correctly. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Dan Williams <dan.j.williams@intel.com> [ Simplified patch a bit ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: Minor clean-up of page flags in mm/page_alloc.cRuss Anderson2008-06-091-31/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Minor source code cleanup of page flags in mm/page_alloc.c. Move the definition of the groups of bits to page-flags.h. The purpose of this clean up is that the next patch will conditionally add a page flag to the groups. Doing that in a header file is cleaner than adding #ifdefs to the C code. Signed-off-by: Russ Anderson <rja@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * memory hotplug: fix early allocation handlingHeiko Carstens2008-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trying to add memory via add_memory() from within an initcall function results in bootmem alloc of 163840 bytes failed! Kernel panic - not syncing: Out of memory This is caused by zone_wait_table_init() which uses system_state to decide if it should use the bootmem allocator or not. When initcalls are handled the system_state is still SYSTEM_BOOTING but the bootmem allocator doesn't work anymore. So the allocation will fail. To fix this use slab_is_available() instead as indicator like we do it everywhere else. [akpm@linux-foundation.org: coding-style fix] Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * zonelists: handle a node zonelist with no applicable entriesAndy Whitcroft2008-05-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing panics when trying node local allocations: BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e *pdpt = 00000000013a7001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82) EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0 EIP is at get_page_from_freelist+0x4a/0x18e EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000) Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0 f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000 00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0 Call Trace: [<c10426e4>] __alloc_pages_internal+0x99/0x378 [<c10429ca>] __alloc_pages+0x7/0x9 [<c105e0e8>] kmem_getpages+0x66/0xef [<c105ec55>] cache_grow+0x8f/0x123 [<c105f117>] ____cache_alloc_node+0xb9/0xe4 [<c105f427>] kmem_cache_alloc_node+0x92/0xd2 [<c122118c>] setup_cpu_cache+0xaf/0x177 [<c105e6ca>] kmem_cache_create+0x2c8/0x353 [<c13853af>] kmem_cache_init+0x1ce/0x3ad [<c13755c5>] start_kernel+0x178/0x1ee This occurs when we are scanning the zonelists looking for a ZONE_NORMAL page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on node 0, all other nodes are mapped above 4GB physical. Here is a dump of the zonelists from this system: zonelists pgdat=c1400000 0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0 1: c14006c0:2 c1400360:1 c1400000:0 zonelists pgdat=f7c00000 0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0 1: f7c006c0:2 zonelists pgdat=f7e00000 0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0 1: f7e006c0:2 When performing a node local allocation we call get_page_from_freelist() looking for a page. It in turn calls first_zones_zonelist() which returns a preferred_zone. Where there are no applicable zones this will be NULL. However we use this unconditionally, leading to this panic. Where there are no applicable zones there is no possibility of a successful allocation, so simply fail the allocation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: don't drop a partial page in a zone's memory map sizeJohannes Weiner2008-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | In a zone's present pages number, account for all pages occupied by the memory map, including a partial. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, x86: shrink_active_range() should check allYinghai Lu2008-06-101-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we are using register_e820_active_regions() instead of add_active_range() directly. So end_pfn could be different between the value in early_node_map to node_end_pfn. So we need to make shrink_active_range() smarter. shrink_active_range() is a generic MM function in mm/page_alloc.c but it is only used on 32-bit x86. Should we move it back to some file in arch/x86? Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | x86, numa, 32-bit: print out debug info on all kvasYinghai Lu2008-06-031-0/+5
|/ | | | | | | also fix the print out of node_remap_end_vaddr Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* memory_hotplug: always initialize pageblock bitmapHeiko Carstens2008-05-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trying to online a new memory section that was added via memory hotplug sometimes results in crashes when the new pages are added via __free_page. Reason for that is that the pageblock bitmap isn't initialized and hence contains random stuff. That means that get_pageblock_migratetype() returns also random stuff and therefore list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); in __free_one_page() tries to do a list_add to something that isn't even necessarily a list. This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix usemap initialization") which makes sure that the pageblock bitmap gets only initialized for pages present in a zone. Unfortunately for hot-added memory the zones "grow" after the memmap and the pageblock memmap have been initialized. Which means that the new pages have an unitialized bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span() are moved to __add_zone() just before the initialization happens. The patch also moves the two functions since __add_zone() is the only caller and I didn't want to add a forward declaration. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* infrastructure to debug (dynamic) objectsThomas Gleixner2008-04-301-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: smarter retry of costly-order allocationsNishanth Aravamudan2008-04-291-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of page order checks in __alloc_pages(), hugepage (and similarly large order) allocations will not retry unless explicitly marked __GFP_REPEAT. However, the current retry logic is nearly an infinite loop (or until reclaim does no progress whatsoever). For these costly allocations, that seems like overkill and could potentially never terminate. Mel observed that allowing current __GFP_REPEAT semantics for hugepage allocations essentially killed the system. I believe this is because we may continue to reclaim small orders of pages all over, but never have enough to satisfy the hugepage allocation request. This is clearly only a problem for large order allocations, of which hugepages are the most obvious (to me). Modify try_to_free_pages() to indicate how many pages were reclaimed. Use that information in __alloc_pages() to eventually fail a large __GFP_REPEAT allocation when we've reclaimed an order of pages equal to or greater than the allocation's order. This relies on lumpy reclaim functioning as advertised. Due to fragmentation, lumpy reclaim may not be able to free up the order needed in one invocation, so multiple iterations may be requred. In other words, the more fragmented memory is, the more retry attempts __GFP_REPEAT will make (particularly for higher order allocations). This changes the semantics of __GFP_REPEAT subtly, but *only* for allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size allocations, we will try up to some point (at least 1<<order reclaimed pages), rather than forever (which is the case for allocations <= PAGE_ALLOC_COSTLY_ORDER). This change improves the /proc/sys/vm/nr_hugepages interface with a follow-on patch that makes pool allocations use __GFP_REPEAT. Rather than administrators repeatedly echo'ing a particular value into the sysctl, and forcing reclaim into action manually, this change allows for the sysctl to attempt a reasonable effort itself. Similarly, dynamic pool growth should be more successful under load, as lumpy reclaim can try to free up pages, rather than failing right away. Choosing to reclaim only up to the order of the requested allocation strikes a balance between not failing hugepage allocations and returning to the caller when it's unlikely to every succeed. Because of lumpy reclaim, if we have freed the order requested, hopefully it has been in big chunks and those chunks will allow our allocation to succeed. If that isn't the case after freeing up the current order, I don't think it is likely to succeed in the future, although it is possible given a particular fragmentation pattern. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>