summaryrefslogtreecommitdiffstats
path: root/mm/nobootmem.c
Commit message (Collapse)AuthorAgeFilesLines
* mem-hotplug: reset node managed pages when hot-adding a new pgdatTang Chen2014-11-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In free_area_init_core(), zone->managed_pages is set to an approximate value for lowmem, and will be adjusted when the bootmem allocator frees pages into the buddy system. But free_area_init_core() is also called by hotadd_new_pgdat() when hot-adding memory. As a result, zone->managed_pages of the newly added node's pgdat is set to an approximate value in the very beginning. Even if the memory on that node has node been onlined, /sys/device/system/node/nodeXXX/meminfo has wrong value: hot-add node2 (memory not onlined) cat /sys/device/system/node/node2/meminfo Node 2 MemTotal: 33554432 kB Node 2 MemFree: 0 kB Node 2 MemUsed: 33554432 kB Node 2 Active: 0 kB This patch fixes this problem by reset node managed pages to 0 after hot-adding a new node. 1. Move reset_managed_pages_done from reset_node_managed_pages() to reset_all_zones_managed_pages() 2. Make reset_node_managed_pages() non-static 3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat is initialized Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mem-hotplug: let memblock skip the hotpluggable memory regions in ↵Xishi Qiu2014-09-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __next_mem_range() Let memblock skip the hotpluggable memory regions in __next_mem_range(), it is used to to prevent memblock from allocating hotpluggable memory for the kernel at early time. The code is the same as __next_mem_range_rev(). Clear hotpluggable flag before releasing free pages to the buddy allocator. If we don't clear hotpluggable flag in free_low_memory_core_early(), the memory which marked hotpluggable flag will not free to buddy allocator. Because __next_mem_range() will skip them. free_low_memory_core_early for_each_free_mem_range for_each_mem_range __next_mem_range [akpm@linux-foundation.org: fix warning] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock.c: call kmemleak directly from memblock_(alloc|free)Catalin Marinas2014-06-061-2/+0
| | | | | | | | | | | | | | Kmemleak could ignore memory blocks allocated via memblock_alloc() leading to false positives during scanning. This patch adds the corresponding callbacks and removes kmemleak_free_* calls in mm/nobootmem.c to avoid duplication. The kmemleak_alloc() in mm/nobootmem.c is kept since __alloc_memory_core_early() does not use memblock_alloc() directly. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nobootmem.c: mark function as staticRashika Kheria2014-04-031-1/+1
| | | | | | | | | | | | | | | Mark function as static in nobootmem.c because it is not used outside this file. This eliminates the following warning in mm/nobootmem.c: mm/nobootmem.c:324:15: warning: no previous prototype for `___alloc_bootmem_node' [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nobootmem: free_all_bootmem againPhilipp Hachtmann2014-01-231-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | get_allocated_memblock_reserved_regions_info() should work if it is compiled in. Extended the ifdef around get_allocated_memblock_memory_regions_info() to include get_allocated_memblock_reserved_regions_info() as well. Similar changes in nobootmem.c/free_low_memory_core_early() where the two functions are called. [akpm@linux-foundation.org: cleanup] Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: qiuxishi <qiuxishi@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: Jiang Liu <liuj97@gmail.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: free memblock.memory in free_all_bootmemPhilipp Hachtmann2014-01-231-1/+9
| | | | | | | | | | | | | | | | | | When calling free_all_bootmem() the free areas under memblock's control are released to the buddy allocator. Additionally the reserved list is freed if it was reallocated by memblock. The same should apply for the memory list. Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nobootmem.c: add return value check in __alloc_memory_core_early()Philipp Hachtmann2014-01-231-1/+3
| | | | | | | | | | | | | | | | | | | When memblock_reserve() fails because memblock.reserved.regions cannot be resized, the caller (e.g. alloc_bootmem()) is not informed of the failed allocation. Therefore alloc_bootmem() silently returns the same pointer again and again. This patch adds a check for the return value of memblock_reserve() in __alloc_memory_core(). Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: switch to use NUMA_NO_NODE instead of MAX_NUMNODESGrygorii Strashko2014-01-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's recommended to use NUMA_NO_NODE everywhere to select "process any node" behavior or to indicate that "no node id specified". Hence, update __next_free_mem_range*() API's to accept both NUMA_NO_NODE and MAX_NUMNODES, but emit warning once on MAX_NUMNODES, and correct corresponding API's documentation to describe new behavior. Also, update other memblock/nobootmem APIs where MAX_NUMNODES is used dirrectly. The change was suggested by Tejun Heo. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: reorder parameters of memblock_find_in_range_nodeGrygorii Strashko2014-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Reorder parameters of memblock_find_in_range_node to be consistent with other memblock APIs. The change was suggested by Tejun Heo <tj@kernel.org>. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nobootmem.c: have __free_pages_memory() free in larger chunks.Robin Holt2013-11-131-17/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On large memory machines it can take a few minutes to get through free_all_bootmem(). Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally chosen to keep things similar to mm/nobootmem.c. But it is more efficient to limit it to MAX_ORDER. base new change 8TB 202s 172s 30s 16TB 401s 351s 50s That is around 1%-3% improvement on total boot time. This patch was spun off from the boot time rfc Robin and I had been working on. Signed-off-by: Robin Holt <robin.m.holt@gmail.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Robin Holt <robinmholt@linux.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: concentrate modification of totalram_pages into the mm coreJiang Liu2013-07-031-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Concentrate code to modify totalram_pages into the mm core, so the arch memory initialized code doesn't need to take care of it. With these changes applied, only following functions from mm core modify global variable totalram_pages: free_bootmem_late(), free_all_bootmem(), free_all_bootmem_node(), adjust_managed_page_count(). With this patch applied, it will be much more easier for us to keep totalram_pages and zone->managed_pages in consistence. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: David Howells <dhowells@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: accurately calculate zone->managed_pages for highmem zonesJiang Liu2013-07-031-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit "mm: introduce new field 'managed_pages' to struct zone" assumes that all highmem pages will be freed into the buddy system by function mem_init(). But that's not always true, some architectures may reserve some highmem pages during boot. For example PPC may allocate highmem pages for giagant HugeTLB pages, and several architectures have code to check PageReserved flag to exclude highmem pages allocated during boot when freeing highmem pages into the buddy system. So treat highmem pages in the same way as normal pages, that is to: 1) reset zone->managed_pages to zero in mem_init(). 2) recalculate managed_pages when freeing pages into the buddy system. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, nobootmem: do memset() after memblock_reserve()Joonsoo Kim2013-04-291-1/+1
| | | | | | | | | | | | Currently, we do memset() before reserving the area. This may not cause any problem, but it is somewhat weird. So change execution order. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, nobootmem: clean-up of free_low_memory_core_early()Joonsoo Kim2013-04-291-2/+2
| | | | | | | | | | | | Remove unused argument and make function static, because there is no user outside of nobootmem.c Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: Add alloc_bootmem_low_pages_nopanic()Yinghai Lu2013-01-291-0/+8
| | | | | | | | We don't need to panic in some case, like for swiotlb preallocating. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-35-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* Merge remote-tracking branch 'origin/x86/boot' into x86/mm2H. Peter Anvin2013-01-291-0/+21
|\ | | | | | | | | | | | | | | | | | | | | Coming patches to x86/mm2 require the changes and advanced baseline in x86/boot. Resolved Conflicts: arch/x86/kernel/setup.c mm/nobootmem.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * mm: introduce new field "managed_pages" to struct zoneJiang Liu2012-12-121-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a zone's present_pages is calcuated as below, which is inaccurate and may cause trouble to memory hotplug. spanned_pages - absent_pages - memmap_pages - dma_reserve. During fixing bugs caused by inaccurate zone->present_pages, we found zone->present_pages has been abused. The field zone->present_pages may have different meanings in different contexts: 1) pages existing in a zone. 2) pages managed by the buddy system. For more discussions about the issue, please refer to: http://lkml.org/lkml/2012/11/5/866 https://patchwork.kernel.org/patch/1346751/ This patchset tries to introduce a new field named "managed_pages" to struct zone, which counts "pages managed by the buddy system". And revert zone->present_pages to count "physical pages existing in a zone", which also keep in consistence with pgdat->node_present_pages. We will set an initial value for zone->managed_pages in function free_area_init_core() and will adjust it later if the initial value is inaccurate. For DMA/normal zones, the initial value is set to: (spanned_pages - absent_pages - memmap_pages - dma_reserve) Later zone->managed_pages will be adjusted to the accurate value when the bootmem allocator frees all free pages to the buddy system in function free_all_bootmem_node() and free_all_bootmem(). The bootmem allocator doesn't touch highmem pages, so highmem zones' managed_pages is set to the accurate value "spanned_pages - absent_pages" in function free_area_init_core() and won't be updated anymore. This patch also adds a new field "managed_pages" to /proc/zoneinfo and sysrq showmem. [akpm@linux-foundation.org: small comment tweaks] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: Kill NO_BOOTMEM version free_all_bootmem_node()Yinghai Lu2012-11-171-14/+0
|/ | | | | | | | | | | | | | | Now NO_BOOTMEM version free_all_bootmem_node() does not really do free_bootmem at all, and it only call register_page_bootmem_info_node for online nodes instead. That is confusing. We can kill that free_all_bootmem_node(), after we kill two callings in x86 and sparc. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1353123563-3103-46-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* revert "mm: fix-up zone present pages"Andrew Morton2012-11-161-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix-up zone present pagesJianguo Wu2012-10-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reported-by: Petr Tesarik <ptesarik@suse.cz> Tested-by: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: cleanup early_node_map[] related commentsWanpeng Li2012-10-091-2/+0
| | | | | | | | | | | | | | Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed early_node_map[]. Clean up the comments to comply with that change. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memblock: free allocated memblock_reserved_regions laterYinghai Lu2012-07-111-15/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memblock_free_reserved_regions() calls memblock_free(), but memblock_free() would double reserved.regions too, so we could free the old range for reserved.regions. Also tj said there is another bug which could be related to this. | I don't think we're saving any noticeable | amount by doing this "free - give it to page allocator - reserve | again" dancing. We should just allocate regions aligned to page | boundaries and free them later when memblock is no longer in use. in that case, when DEBUG_PAGEALLOC, will get panic: memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39 BUG: unable to handle kernel paging request at ffff88102febd948 IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447 RIP: 0010:[<ffffffff836a5774>] [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 See the discussion at https://lkml.org/lkml/2012/6/13/469 So try to allocate with PAGE_SIZE alignment and free it later. Reported-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: sparse: fix usemap allocation above node descriptor sectionYinghai Lu2012-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint in alloc_bootmem_section"), usemap allocations may easily be placed outside the optimal section that holds the node descriptor, even if there is space available in that section. This results in unnecessary hotplug dependencies that need to have the node unplugged before the section holding the usemap. The reason is that the bootmem allocator doesn't guarantee a linear search starting from the passed allocation goal but may start out at a much higher address absent an upper limit. Fix this by trying the allocation with the limit at the section end, then retry without if that fails. This keeps the fix from f5bf18fa22f8 of not panicking if the allocation does not fit in the section, but still makes sure to try to stay within the section at first. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove sparsemem allocation details from the bootmem allocatorJohannes Weiner2012-05-291-22/+0
| | | | | | | | | | | | | | | | | | | alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: unify allocation policy of (non-)panicking node allocationsJohannes Weiner2012-05-291-52/+54
| | | | | | | | | | | | | | | | | | While the panicking node-specific allocation function tries to satisfy node+goal, goal, node, anywhere, the non-panicking function still does node+goal, goal, anywhere. Make it simpler: define the panicking version in terms of the non-panicking one, like the node-agnostic interface, so they always behave the same way apart from how to deal with allocation failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: panic on node-specific allocation failureJohannes Weiner2012-05-291-4/+16
| | | | | | | | | | | | | __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims the functions panic on allocation failure. Do it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: fix sign extend problem in __free_pages_memory()Russ Anderson2012-05-101-2/+1
| | | | | | | | | | | | | | | | | | | | | | | Systems with 8 TBytes of memory or greater can hit a problem where only the the first 8 TB of memory shows up. This is due to "int i" being smaller than "unsigned long start_aligned", causing the high bits to be dropped. The fix is to change `i' to unsigned long to match start_aligned and end_aligned. Thanks to Jack Steiner for assistance tracking this down. Signed-off-by: Russ Anderson <rja@sgi.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: Correct alloc_bootmem semantics.David Miller2012-04-251-2/+8
| | | | | | | | | | | | | | | | | | | | | | | The comments above __alloc_bootmem_node() claim that the code will first try the allocation using 'goal' and if that fails it will try again but with the 'goal' requirement dropped. Unfortunately, this is not what the code does, so fix it to do so. This is important for nobootmem conversions to architectures such as sparc where MAX_DMA_ADDRESS is infinity. On such architectures all of the allocations done by generic spots, such as the sparse-vmemmap implementation, will pass in: __pa(MAX_DMA_ADDRESS) as the goal, and with the limit given as "-1" this will always fail unless we add the appropriate fallback logic here. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'master' into x86/memblockTejun Heo2011-11-281-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts & resolutions: * arch/x86/xen/setup.c dc91c728fd "xen: allow extra memory to be in multiple regions" 24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..." conflicted on xen_add_extra_mem() updates. The resolution is trivial as the latter just want to replace memblock_x86_reserve_range() with memblock_reserve(). * drivers/pci/intel-iommu.c 166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/" 5dfe8660a3d "bootmem: Replace work_with_active_regions() with..." conflicted as the former moved the file under drivers/iommu/. Resolved by applying the chnages from the latter on the moved file. * mm/Kconfig 6661672053a "memblock: add NO_BOOTMEM config symbol" c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option" conflicted trivially. Both added config options. Just letting both add their own options resolves the conflict. * mm/memblock.c d1f0ece6cdc "mm/memblock.c: small function definition fixes" ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()" confliected. The former updates function removed by the latter. Resolution is trivial. Signed-off-by: Tejun Heo <tj@kernel.org>
| * mm: Map most files to use export.h instead of module.hPaul Gortmaker2011-10-311-1/+1
| | | | | | | | | | | | | | | | The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* | memblock, x86: Replace memblock_x86_reserve/free_range() with generic onesTejun Heo2011-07-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Other than sanity check and debug message, the x86 specific version of memblock reserve/free functions are simple wrappers around the generic versions - memblock_reserve/free(). This patch adds debug messages with caller identification to the generic versions and replaces x86 specific ones and kills them. arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty after this change and removed. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | memblock, x86: Replace __get_free_all_memory_range() with ↵Tejun Heo2011-07-141-12/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for_each_free_mem_range() __get_free_all_memory_range() walks memblock, calculates free memory areas and fills in the specified range. It can be easily replaced with for_each_free_mem_range(). Convert free_low_memory_core_early() and add_highpages_with_active_regions() to for_each_free_mem_range(). This leaves __get_free_all_memory_range() without any user. Kill it and related functions. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310462166-31469-10-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | memblock, x86: Make free_all_memory_core_early() explicitly free lowmem onlyTejun Heo2011-07-141-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nomemblock is currently used only by x86 and on x86_32 free_all_memory_core_early() silently freed only the low mem because get_free_all_memory_range() in arch/x86/mm/memblock.c implicitly limited range to max_low_pfn. Rename free_all_memory_core_early() to free_low_memory_core_early() and make it call __get_free_all_memory_range() and limit the range to max_low_pfn explicitly. This makes things clearer and also is consistent with the bootmem behavior. This leaves get_free_all_memory_range() without any user. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310462166-31469-9-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | memblock, x86: Replace memblock_x86_find_in_range_node() with generic ↵Tejun Heo2011-07-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memblock calls With the previous changes, generic NUMA aware memblock API has feature parity with memblock_x86_find_in_range_node(). There currently are two users - x86 setup_node_data() and __alloc_memory_core_early() in nobootmem.c. This patch converts the former to use memblock_alloc_nid() and the latter memblock_find_range_in_node(), and kills memblock_x86_find_in_range_node() and related functions including find_memory_early_core_early() in page_alloc.c. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | memblock: Kill MEMBLOCK_ERRORTejun Heo2011-07-131-1/+1
|/ | | | | | | | | | | | | 25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made MEMBLOCK_ERROR 0 and there already are codes which expect error return to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its misery. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high()Yinghai Lu2011-05-251-23/+0
| | | | | | | | | | | | | | The bootmem wrapper with memblock supports top-down now, so we no longer need this trick. Signed-off-by: Yinghai LU <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Olaf Hering <olaf@aepfle.de> Cc: Tejun Heo <tj@kernel.org> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix common misspellingsLucas De Marchi2011-03-311-1/+1
| | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, ↵Olaf Hering2011-03-231-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | setup_elfcorehdr and saved_max_pfn The Xen PV drivers in a crashed HVM guest can not connect to the dom0 backend drivers because both frontend and backend drivers are still in connected state. To run the connection reset function only in case of a crashdump, the is_kdump_kernel() function needs to be available for the PV driver modules. Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel() usable for modules. Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup() to early_param(). It adds an address range check from x86 also on ia64 and powerpc. [akpm@linux-foundation.org: additional #includes] [akpm@linux-foundation.org: remove elfcorehdr_addr export] [akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes] Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bootmem: Move __alloc_memory_core_early() to nobootmem.cYinghai Lu2011-02-241-0/+25
| | | | | | | | | | | | | | | Now that bootmem.c and nobootmem.c are separate, there's no reason to define __alloc_memory_core_early(), which is used only by nobootmem, inside #ifdef in page_alloc.c. Move it to nobootmem.c and make it static. This patch doesn't introduce any behavior change. -tj: Updated commit description. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* bootmem: Move contig_page_data definition to bootmem.c/nobootmem.cYinghai Lu2011-02-241-0/+5
| | | | | | | | | | | | | | | Now that bootmem.c and nobootmem.c are separate, it's cleaner to define contig_page_data in each file than in page_alloc.c with #ifdef. Move it. This patch doesn't introduce any behavior change. -v2: According to Andrew, fixed the struct layout. -tj: Updated commit description. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.cYinghai Lu2011-02-241-0/+405
mm/bootmem.c contained code paths for both bootmem and no bootmem configurations. They implement about the same set of APIs in different ways and as a result bootmem.c contains massive amount of #ifdef CONFIG_NO_BOOTMEM. Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c. As the common part is relatively small, duplicate them in nobootmem.c instead of creating a common file or ifdef'ing in bootmem.c. The followings are duplicated. * {min|max}_low_pfn, max_pfn, saved_max_pfn * free_bootmem_late() * ___alloc_bootmem() * __alloc_bootmem_low() The followings are applicable only to nobootmem and moved verbatim. * __free_pages_memory() * free_all_memory_core_early() The followings are not applicable to nobootmem and omitted in nobootmem.c. * reserve_bootmem_node() * reserve_bootmem() The rest split function bodies according to CONFIG_NO_BOOTMEM. Makefile is updated so that only either bootmem.c or nobootmem.c is built according to CONFIG_NO_BOOTMEM. This patch doesn't introduce any behavior change. -tj: Rewrote commit description. Suggested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>