summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* mm/page_owner: keep track of page ownersJoonsoo Kim2014-12-1311-3/+554
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg <alexn@dsv.su.se> Mel Gorman <mgorman@suse.de> Dave Hansen <dave@linux.vnet.ibm.com> Minchan Kim <minchan@kernel.org> Michal Nazarewicz <mina86@mina86.com> Andrew Morton <akpm@linux-foundation.org> Jungsoo Son <jungsoo.son@lge.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* stacktrace: introduce snprint_stack_trace for buffer outputJoonsoo Kim2014-12-132-0/+37
| | | | | | | | | | | | | | | | | | | Current stacktrace only have the function for console output. page_owner that will be introduced in following patch needs to print the output of stacktrace into the buffer for our own output format so so new function, snprint_stack_trace(), is needed. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nommu: use alloc_pages_exact() rather than its own implementationJoonsoo Kim2014-12-131-22/+11
| | | | | | | | | | | | | | | | | | | | | do_mmap_private() in nommu.c try to allocate physically contiguous pages with arbitrary size in some cases and we now have good abstract function to do exactly same thing, alloc_pages_exact(). So, change to use it. There is no functional change. This is the preparation step for support page owner feature accurately. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/debug-pagealloc: make debug-pagealloc boottime configurableJoonsoo Kim2014-12-139-7/+57
| | | | | | | | | | | | | | | | | | | | | | Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/debug-pagealloc: prepare boottime configurable on/offJoonsoo Kim2014-12-138-44/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, debug-pagealloc needs extra flags in struct page, so we need to recompile whole source code when we decide to use it. This is really painful, because it takes some time to recompile and sometimes rebuild is not possible due to third party module depending on struct page. So, we can't use this good feature in many cases. Now, we have the page extension feature that allows us to insert extra flags to outside of struct page. This gets rid of third party module issue mentioned above. And, this allows us to determine if we need extra memory for this page extension in boottime. With these property, we can avoid using debug-pagealloc in boottime with low computational overhead in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our development process greatly. This patch is the preparation step to achive above goal. debug-pagealloc originally uses extra field of struct page, but, after this patch, it will use field of struct page_ext. Because memory for page_ext is allocated later than initialization of page allocator in CONFIG_SPARSEMEM, we should disable debug-pagealloc feature temporarily until initialization of page_ext. This patch implements this. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_ext: resurrect struct page extending code for debuggingJoonsoo Kim2014-12-137-0/+485
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we debug something, we'd like to insert some information to every page. For this purpose, we sometimes modify struct page itself. But, this has drawbacks. First, it requires re-compile. This makes us hesitate to use the powerful debug feature so development process is slowed down. And, second, sometimes it is impossible to rebuild the kernel due to third party module dependency. At third, system behaviour would be largely different after re-compile, because it changes size of struct page greatly and this structure is accessed by every part of kernel. Keeping this as it is would be better to reproduce errornous situation. This feature is intended to overcome above mentioned problems. This feature allocates memory for extended data per page in certain place rather than the struct page itself. This memory can be accessed by the accessor functions provided by this code. During the boot process, it checks whether allocation of huge chunk of memory is needed or not. If not, it avoids allocating memory at all. With this advantage, we can include this feature into the kernel in default and can avoid rebuild and solve related problems. Until now, memcg uses this technique. But, now, memcg decides to embed their variable to struct page itself and it's code to extend struct page has been removed. I'd like to use this code to develop debug feature, so this patch resurrect it. To help these things to work well, this patch introduces two callbacks for clients. One is the need callback which is mandatory if user wants to avoid useless memory allocation at boot-time. The other is optional, init callback, which is used to do proper initialization after memory is allocated. Detailed explanation about purpose of these functions is in code comment. Please refer it. Others are completely same with previous extension code in memcg. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLEJianyu Zhan2014-12-131-5/+2
| | | | | | | | | | | | | | | | | | | | GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined defined, also implied by their names: GFP_USER = GFP_USER GFP_USER + __GFP_HIGHMEM = GFP_HIGHUSER GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE = GFP_HIGHUSER_MOVABLE So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to reflect this fact. It also makes the definition clear and texturally warn on any furture break-up of this escalated relastionship. Signed-off-by: Jianyu Zhan <jianyu.zhan@emc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/kmemleak.h: needs slab.hAndrew Morton2014-12-131-0/+2
| | | | | | | | | include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive': include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function) Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache()Zhang Zhen2014-12-132-4/+3
| | | | | | | | The gfp was passed in but never used in this function. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move swp_entry_t definition to include/linux/mm_types.hTejun Heo2014-12-132-8/+8
| | | | | | | | | | | | | | | swp_entry_t being defined in include/linux/swap.h instead of include/linux/mm_types.h causes cyclic include dependency later when include/linux/page_cgroup.h is included from writeback path. Move the definition to include/linux/mm_types.h. While at it, reformat the comment above it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: remove redundant call of page_to_pfnZhang Zhen2014-12-131-2/+2
| | | | | | | | | | | | | | This is just a small optimization. The start_pfn can be obtained directly by phys_index << PFN_SECTION_SHIFT. So the call of page_to_pfn() is redundant and remove it. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@sr71.net> Cc: Wang Nan <wangnan0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* iommu/amd: use handle_mm_fault directlyJesse Barnes2014-12-131-31/+53
| | | | | | | | | | | | | | | | This could be useful for debug in the future if we want to track major/minor faults more closely, and also avoids the put_page trick we used with gup. In order to do this, we also track the task struct in the PASID state structure. This lets us update the appropriate task stats after the fault has been handled, and may aid with debug in the future as well. Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Tested-by: Oded Gabbay <oded.gabbay@amd.com> Cc: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: export find_extend_vma() and handle_mm_fault() for driver useJesse Barnes2014-12-132-0/+3
| | | | | | | | | | | This lets drivers like the AMD IOMMUv2 driver handle faults a bit more simply, rather than doing tricks with page refs and get_user_pages(). Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Oded Gabbay <oded.gabbay@amd.com> Cc: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: hugetlb_register_all_nodes(): add __init markerLuiz Capitulino2014-12-131-1/+1
| | | | | | | | | | | | | | | This function is only called during initialization. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: alloc_bootmem_huge_page(): use IS_ALIGNED()Luiz Capitulino2014-12-131-1/+1
| | | | | | | | | | | | | | | No reason to duplicate the code of an existing macro. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix hugepages= entry in kernel-parameters.txtLuiz Capitulino2014-12-131-3/+1
| | | | | | | | | | | | | | | | | | | | | | | The hugepages= entry in kernel-parameters.txt states that 1GB pages can only be allocated at boot time and not freed afterwards. This is not true since commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime"), at least for x86_64. Instead of adding arch-specifc observations to the hugepages= entry, this commit just drops the out of date information. Further information about arch-specific support and available features can be obtained in the hugetlb documentation. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: turn memcg_kmem_skip_account into a bit fieldVladimir Davydov2014-12-132-35/+7
| | | | | | | | | | | | | | | | | It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on the task_struct. Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to set/clear the flag inline. Regarding the overwhelming comment to the helpers, which is removed by this patch too, we already have a compact yet accurate explanation in memcg_schedule_cache_create, no need in yet another one. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cacheVladimir Davydov2014-12-131-28/+0
| | | | | | | | | | | | | | | | | __memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if the cgroup's kmem cache doesn't exist), because kmalloc may call __memcg_kmem_get_cache internally again. To avoid the recursion, we use the task_struct->memcg_kmem_skip_account flag. However, there's no need checking the flag in memcg_kmem_newpage_charge, because there's no way how this function could result in recursion, if called from memcg_kmem_get_cache. So let's remove the redundant code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: zap kmem_account_flagsVladimir Davydov2014-12-131-21/+10
| | | | | | | | | | | | The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of having a separate flags field. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: mincore: add hwpoison page handleWeijie Yang2014-12-131-2/+5
| | | | | | | | | | | | | | | | | | | When the encountered pte is a swap entry, the current code handles two cases: migration and normal swapentry, but we have a third case: hwpoison page. This patch adds hwpoison page handle, consider hwpoison page incore as same as migration. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/rmap: calculate page offset when neededDavidlohr Bueso2014-12-131-2/+4
| | | | | | | | | | Call page_to_pgoff() to get the page offset once we are sure we actually need it, and any very obvious initial function checks have passed. Trivial micro-optimization, and potentially save some cycles. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/debug-pagealloc: cleanup page guard codeJoonsoo Kim2014-12-131-19/+19
| | | | | | | | | | | | | | Page guard is used by debug-pagealloc feature. Currently, it is open-coded, but, I think that more abstraction of it makes core page allocator code more readable. There is no functional difference. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Gioh Kim <gioh.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUGTony Luck2014-12-131-23/+20
| | | | | | | | | | | | | | | | | | | | | | There is a lot of duplication in the rubric around actually setting or clearing a mem region flag. Create a new helper function to do this and reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a single line. This will be useful if someone were to add a new mem region flag - which I hope to be doing some day soon. But it looks like a plausible cleanup even without that - so I'd like to get it out of the way now. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Emil Medve <Emilian.Medve@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: do not abuse memcg_kmem_skip_accountVladimir Davydov2014-12-131-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | task_struct->memcg_kmem_skip_account was initially introduced to avoid recursion during kmem cache creation: memcg_kmem_get_cache, which is called by kmem_cache_alloc to determine the per-memcg cache to account allocation to, may issue lazy cache creation if the needed cache doesn't exist, which means issuing yet another kmem_cache_alloc. We can't just pass a flag to the nested kmem_cache_alloc disabling kmem accounting, because there are hidden allocations, e.g. in INIT_WORK. So we introduced a flag on the task_struct, memcg_kmem_skip_account, making memcg_kmem_get_cache return immediately. By its nature, the flag may also be used to disable accounting for allocations shared among different cgroups, and currently it is used this way in memcg_activate_kmem. Using it like this looks like abusing it to me. If we want to disable accounting for some allocations (which we will definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG flag in order to blacklist/whitelist some allocations. For now, let's simply remove memcg_stop/resume_kmem_account from memcg_activate_kmem. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge}Vladimir Davydov2014-12-131-2/+2
| | | | | | | | | | | We already assured the current task has mm in memcg_kmem_should_charge, no need to double check. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: __mem_cgroup_free: remove stale disarm_static_keys commentVladimir Davydov2014-12-131-11/+0
| | | | | | | | | | cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: cma: align to physical address, not CMA region positionGregory Fong2014-12-131-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | The alignment in cma_alloc() was done w.r.t. the bitmap. This is a problem when, for example: - a device requires 16M (order 12) alignment - the CMA region is not 16 M aligned In such a case, can result with the CMA region starting at, say, 0x2f800000 but any allocation you make from there will be aligned from there. Requesting an allocation of 32 M with 16 M alignment will result in an allocation from 0x2f800000 to 0x31800000, which doesn't work very well if your strange device requires 16M alignment. Change to use bitmap_find_next_zero_area_off() to account for the difference in alignment at reserve-time and alloc-time. Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kukjin Kim <kgene.kim@samsung.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib: bitmap: add alignment offset for bitmap_find_next_zero_area()Michal Nazarewicz2014-12-132-16/+44
| | | | | | | | | | | | | | | | | | | Add a bitmap_find_next_zero_area_off() function which works like bitmap_find_next_zero_area() function except it allows an offset to be specified when alignment is checked. This lets caller request a bit such that its number plus the offset is aligned according to the mask. [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation] Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kukjin Kim <kgene.kim@samsung.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory.c: share the i_mmap_rwsemDavidlohr Bueso2014-12-131-2/+2
| | | | | | | | | | | | | | | | | The unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without touching the actual interval tree, thus share the lock. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/nommu: share the i_mmap_rwsemDavidlohr Bueso2014-12-131-5/+4
| | | | | | | | | | | | | | | | | | | | Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify that any shared mappings of the inode in question aren't broken (dead zone). afaict the only user being ramfs to handle the size change attribute. Pretty much a no-brainer to share the lock. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: share the i_mmap_rwsemDavidlohr Bueso2014-12-131-2/+2
| | | | | | | | | | | | | | | | No brainer conversion: collect_procs_file() only schedules a process for later kill, share the lock, similarly to the anon vma variant. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/xip: share the i_mmap_rwsemDavidlohr Bueso2014-12-131-14/+9
| | | | | | | | | | | | | | | | | | | | __xip_unmap() will remove the xip sparse page from the cache and take down pte mapping, without altering the interval tree, thus share the i_mmap_rwsem when searching for the ptes to unmap. Additionally, tidy up the function a bit and make variables only local to the interval tree walk loop. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* uprobes: share the i_mmap_rwsemDavidlohr Bueso2014-12-131-2/+2
| | | | | | | | | | | | | | | | | | | Both register and unregister call build_map_info() in order to create the list of mappings before installing or removing breakpoints for every mm which maps file backed memory. As such, there is no reason to hold the i_mmap_rwsem exclusively, so share it and allow concurrent readers to build the mapping data. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/rmap: share the i_mmap_rwsemDavidlohr Bueso2014-12-132-3/+13
| | | | | | | | | | | | | | | | | Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: convert i_mmap_mutex to rwsemDavidlohr Bueso2014-12-1310-29/+30
| | | | | | | | | | | | | | | | | | | | | | | | The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [sfr@canb.auug.org.au: update fremap.c] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use new helper functions around the i_mmap_mutexDavidlohr Bueso2014-12-1312-40/+40
| | | | | | | | | | | | | | | | Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm,fs: introduce helpers around the i_mmap_mutexDavidlohr Bueso2014-12-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This series is a continuation of the conversion of the i_mmap_mutex to rwsem, following what we have for the anon memory counterpart. With Hugh's feedback from the first iteration. Ultimately, the most obvious paths that require exclusive ownership of the lock is when we modify the VMA interval tree, via vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases such as unmapping, where the ptes content is changed but the tree remains untouched should make it safe to share the i_mmap_rwsem. As such, the code of course is straightforward, however the devil is very much in the details. While its been tested on a number of workloads without anything exploding, I would not be surprised if there are some less documented/known assumptions about the lock that could suffer from these changes. Or maybe I'm just missing something, but either way I believe its at the point where it could use more eyes and hopefully some time in linux-next. Because the lock type conversion is the heart of this patchset, its worth noting a few comparisons between mutex vs rwsem (xadd): (i) Same size, no extra footprint. (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for exclusive lock ownership. (iii) Both can be slightly unfair wrt exclusive ownership, with writer lock stealing properties, not necessarily respecting FIFO order for granting the lock when contended. (iv) Mutexes can be slightly faster than rwsems when the lock is non-contended. (v) Both suck at performance for debug (slowpaths), which shouldn't matter anyway. Sharing the lock is obviously beneficial, and sem writer ownership is close enough to mutexes. The biggest winner of these changes is migration. As for concrete numbers, the following performance results are for a 4-socket 60-core IvyBridge-EX with 130Gb of RAM. Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well with this set, with a steady ~60% throughput (jpm) increase for alltests and up to ~30% for disk for high amounts of concurrency. Lower counts of workload users (< 100) does not show much difference at all, so at least no regressions. 3.18-rc1 3.18-rc1-i_mmap_rwsem alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%) alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%) alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%) alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%) alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%) alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%) alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%) alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%) alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%) alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%) alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%) alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%) alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%) alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%) alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%) alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%) alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%) alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%) alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%) alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%) disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%) disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%) disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%) disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%) disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%) disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%) disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%) disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%) disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%) disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%) disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%) disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%) disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%) disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%) disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%) disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%) disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%) disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%) disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%) disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%) In addition, a 67.5% increase in successfully migrated NUMA pages, thus improving node locality. The patch layout is simple but designed for bisection (in case reversion is needed if the changes break upstream) and easier review: o Patches 1-4 convert the i_mmap lock from mutex to rwsem. o Patches 5-10 share the lock in specific paths, each patch details the rationale behind why it should be safe. This patchset has been tested with: postgres 9.4 (with brand new hugetlb support), hugetlbfs test suite (all tests pass, in fact more tests pass with these changes than with an upstream kernel), ltp, aim7 benchmarks, memcached and iozone with the -B option for mmap'ing. *Untested* paths are nommu, memory-failure, uprobes and xip. This patch (of 8): Various parts of the kernel acquire and release this mutex, so add i_mmap_lock_write() and immap_unlock_write() helper functions that will encapsulate this logic. The next patch will make use of these. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* MAINTAINERS: update Xiubo's email addressXiubo Li2014-12-131-1/+1
| | | | | | | | | | | | My current email address will be gone shortly, update my email to be a gmail one. Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com> Cc: Timur Tabi <timur@tabi.org> Cc: Takashi Iwai <tiwai@suse.de> Acked-by: Nicolin Chen <nicoleotsuka@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: snvs: fix build with CONFIG_PM_SLEEP disabledGuenter Roeck2014-12-131-2/+9
| | | | | | | | | | | | | | Commit 7654e9d4fd8f ("drivers/rtc/rtc-snvs: fix suspend/resume") replaces SIMPLE_DEV_PM_OPS with direct declaration of snvs_rtc_pm_ops, but does so outside #ifdef CONFIG_PM_SLEEP. This causes the driver build to fail if CONFIG_PM_SLEEP is not configured. Fixes: 7654e9d4fd8f ("drivers/rtc/rtc-snvs: fix suspend/resume") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Sanchayan Maity <maitysanchayan@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'please-pull-morepstore' of ↵Linus Torvalds2014-12-124-14/+47
|\ | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore update #2 from Tony Luck: "Couple of pstore-ram enhancements to allow use of different memory attributes" * tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore-ram: Allow optional mapping with pgprot_noncached pstore-ram: Fix hangs by using write-combine mappings
| * pstore-ram: Allow optional mapping with pgprot_noncachedTony Lindgren2014-12-114-14/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some ARMs the memory can be mapped pgprot_noncached() and still be working for atomic operations. As pointed out by Colin Cross <ccross@android.com>, in some cases you do want to use pgprot_noncached() if the SoC supports it to see a debug printk just before a write hanging the system. On ARMs, the atomic operations on strongly ordered memory are implementation defined. So let's provide an optional kernel parameter for configuring pgprot_noncached(), and use pgprot_writecombine() by default. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rob Herring <robherring2@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Olof Johansson <olof@lixom.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: stable@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
| * pstore-ram: Fix hangs by using write-combine mappingsRob Herring2014-12-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently trying to use pstore on at least ARMs can hang as we're mapping the peristent RAM with pgprot_noncached(). On ARMs, pgprot_noncached() will actually make the memory strongly ordered, and as the atomic operations pstore uses are implementation defined for strongly ordered memory, they may not work. So basically atomic operations have undefined behavior on ARM for device or strongly ordered memory types. Let's fix the issue by using write-combine variants for mappings. This corresponds to normal, non-cacheable memory on ARM. For many other architectures, this change does not change the mapping type as by default we have: #define pgprot_writecombine pgprot_noncached The reason why pgprot_noncached() was originaly used for pstore is because Colin Cross <ccross@android.com> had observed lost debug prints right before a device hanging write operation on some systems. For the platforms supporting pgprot_noncached(), we can add a an optional configuration option to support that. But let's get pstore working first before adding new features. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Anton Vorontsov <cbouatmailru@gmail.com> Cc: Colin Cross <ccross@android.com> Cc: Olof Johansson <olof@lixom.net> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rob Herring <rob.herring@calxeda.com> [tony@atomide.com: updated description] Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-12-1232-641/+2740
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "From a feature point of view, most of the code here comes from Miao Xie and others at Fujitsu to implement scrubbing and replacing devices on raid56. This has been in development for a while, and it's a big improvement. Filipe and Josef have a great assortment of fixes, many of which solve problems corruptions either after a crash or in error conditions. I still have a round two from Filipe for next week that solves corruptions with discard and block group removal" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits) Btrfs: make get_caching_control unconditionally return the ctl Btrfs: fix unprotected deletion from pending_chunks list Btrfs: fix fs mapping extent map leak Btrfs: fix memory leak after block remove + trimming Btrfs: make btrfs_abort_transaction consider existence of new block groups Btrfs: fix race between writing free space cache and trimming Btrfs: fix race between fs trimming and block group remove/allocation Btrfs, replace: enable dev-replace for raid56 Btrfs: fix freeing used extents after removing empty block group Btrfs: fix crash caused by block group removal Btrfs: fix invalid block group rbtree access after bg is removed Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Btrfs, replace: write raid56 parity into the replace target device Btrfs, replace: write dirty pages into the replace target device Btrfs, raid56: support parity scrub on raid56 Btrfs, raid56: use a variant to record the operation type Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Btrfs, raid56: don't change bbio and raid_map Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Btrfs: remove noused bbio_ret in __btrfs_map_block in condition ...
| * \ Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs ↵Chris Mason2014-12-02212-910/+2916
| |\ \ | | | | | | | | | | | | into for-linus
| | * | Btrfs, replace: enable dev-replace for raid56Zhao Lei2014-12-031-5/+0
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| | * | Btrfs, raid56: fix use-after-free problem in the final device replace ↵Miao Xie2014-12-036-20/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | procedure on raid56 The commit c404e0dc (Btrfs: fix use-after-free in the finishing procedure of the device replace) fixed a use-after-free problem which happened when removing the source device at the end of device replace, but at that time, btrfs didn't support device replace on raid56, so we didn't fix the problem on the raid56 profile. Currently, we implemented device replace for raid56, so we need kick that problem out before we enable that function for raid56. The fix method is very simple, we just increase the bio per-cpu counter before we submit a raid56 io, and decrease the counter when the raid56 io ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| | * | Btrfs, replace: write raid56 parity into the replace target deviceMiao Xie2014-12-032-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function reused the code of parity scrub, and we just write the right parity or corrected parity into the target device before the parity scrub end. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| | * | Btrfs, replace: write dirty pages into the replace target deviceMiao Xie2014-12-033-43/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| | * | Btrfs, raid56: support parity scrub on raid56Miao Xie2014-12-033-20/+1115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The implementation is: - Read and check all the data with checksum in the same stripe. All the data which has checksum is COW data, and we are sure that it is not changed though we don't lock the stripe. because the space of that data just can be reclaimed after the current transction is committed, and then the fs can use it to store the other data, but when doing scrub, we hold the current transaction, that is that data can not be recovered, it is safe that read and check it out of the stripe lock. - Lock the stripe - Read out all the data without checksum and parity The data without checksum and the parity may be changed if we don't lock the stripe, so we need read it in the stripe lock context. - Check the parity - Re-calculate the new parity and write back it if the old parity is not right - Unlock the stripe If we can not read out the data or the data we read is corrupted, we will try to repair it. If the repair fails. we will mark the horizontal sub-stripe(pages on the same horizontal) as corrupted sub-stripe, and we will skip the parity check and repair of that horizontal sub-stripe. And in order to skip the horizontal sub-stripe that has no data, we introduce a bitmap. If there is some data on the horizontal sub-stripe, we will the relative bit to 1, and when we check and repair the parity, we will skip those horizontal sub-stripes that the relative bits is 0. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| | * | Btrfs, raid56: use a variant to record the operation typeMiao Xie2014-12-031-14/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will introduce new operation type later, if we still use integer variant as bool variant to record the operation type, we would add new variant and increase the size of raid bio structure. It is not good, by this patch, we define different number for different operation, and we can just use a variant to record the operation type. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>