summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm, slab/slub: introduce kmalloc-reclaimable cachesVlastimil Babka2018-10-262-18/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which indicates they contain objects which can be reclaimed under memory pressure (typically through a shrinker). This makes the slab pages accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the MemAvailable meminfo counter and in overcommit decisions. The slab pages are also allocated with __GFP_RECLAIMABLE, which is good for anti-fragmentation through grouping pages by mobility. The generic kmalloc-X caches are created without this flag, but sometimes are used also for objects that can be reclaimed, which due to varying size cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A prominent example are dcache external names, which prompted the creation of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES in commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory"). To better handle this and any other similar cases, this patch introduces SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X. They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among gfp flags. They are added to the kmalloc_caches array as a new type. Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type cache. This change only applies to SLAB and SLUB, not SLOB. This is fine, since SLOB's target are tiny system and this patch does add some overhead of kmem management objects. Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, slab: combine kmalloc_caches and kmalloc_dma_cachesVlastimil Babka2018-10-264-38/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaultsAndrea Arcangeli2018-10-261-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would not be waiting for userfaults before failing and it would hit on a SIGBUS instead. Using get_user_pages_locked/unlocked instead will allow get_mempolicy to allow userfaults to resolve the fault and fill the hole, before grabbing the node id of the page. If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an address inside an area managed by uffd and there is no page at that address, the page allocation from within get_mempolicy() will fail because get_user_pages() does not allow for page fault retry required for uffd; the user will get SIGBUS. With this patch, the page fault will be resolved by the uffd and the get_mempolicy() will continue normally. Background: Via code review, previously the syscall would have returned -EFAULT (vm_fault_to_errno), now it will block and wait for an userfault (if it's waken before the fault is resolved it'll still -EFAULT). This way get_mempolicy will give a chance to an "unaware" app to be compliant with userfaults. The reason this visible change is that becoming "userfault compliant" cannot regress anything: all other syscalls including read(2)/write(2) had to become "userfault compliant" long time ago (that's one of the things userfaultfd can do that PROT_NONE and trapping segfaults can't). So this is just one more syscall that become "userfault compliant" like all other major ones already were. This has been happening on virtio-bridge dpdk process which just called get_mempolicy on the guest space post live migration, but before the memory had a chance to be migrated to destination. I didn't run an strace to be able to show the -EFAULT going away, but I've the confirmation of the below debug aid information (only visible with CONFIG_DEBUG_VM=y) going away with the patch: [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0 [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1 [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017 [20116.371466] Call Trace: [20116.371473] dump_stack+0x5c/0x80 [20116.371476] handle_userfault.cold.37+0x1b/0x22 [20116.371479] ? remove_wait_queue+0x20/0x60 [20116.371481] ? poll_freewait+0x45/0xa0 [20116.371483] ? do_sys_poll+0x31c/0x520 [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50 [20116.371488] shmem_getpage_gfp+0xce7/0xe50 [20116.371491] ? page_add_file_rmap+0x1a/0x2c0 [20116.371493] shmem_fault+0x78/0x1e0 [20116.371495] ? filemap_map_pages+0x3a1/0x450 [20116.371498] __do_fault+0x1f/0xc0 [20116.371500] __handle_mm_fault+0xe2e/0x12f0 [20116.371502] handle_mm_fault+0xda/0x200 [20116.371504] __get_user_pages+0x238/0x790 [20116.371506] get_user_pages+0x3e/0x50 [20116.371510] kernel_get_mempolicy+0x40b/0x700 [20116.371512] ? vfs_write+0x170/0x1a0 [20116.371515] __x64_sys_get_mempolicy+0x21/0x30 [20116.371517] do_syscall_64+0x5b/0x160 [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The above harmless debug message (not a kernel crash, just a dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify and improve kernel spots that may have to become "userfaultfd compliant" like this one (without having to run an strace and search for syscall misbehavior). Spots like the above are more closer to a kernel bug for the non-cooperative usages that Mike focuses on, than for for dpdk qemu-cooperative usages that reproduced it, but it's still nicer to get this fixed for dpdk too. The part of the patch that caused me to think is only the implementation issue of mpol_get, but it looks like it should work safe no matter the kind of mempolicy structure that is (the default static policy also starts at 1 so it'll go to 2 and back to 1 without crashing everything at 0). [rppt@linux.vnet.ibm.com: changelog addition] http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* alpha: switch to NO_BOOTMEMMike Rapoport2018-10-264-188/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | Replace bootmem allocator with memblock and enable use of NO_BOOTMEM like on most other architectures. Alpha gets the description of the physical memory from the firmware as an array of memory clusters. Each cluster that is not reserved by the firmware is added to memblock.memory. Once the memblock.memory is set up, we reserve the kernel and initrd pages with memblock reserve. Since we don't need the bootmem bitmap anymore, the code that finds an appropriate place is removed. The conversion does not take care of NUMA support which is marked broken for more than 10 years now. Link: http://lkml.kernel.org/r/1535952894-10967-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* unicore32: switch to NO_BOOTMEMMike Rapoport2018-10-262-53/+2
| | | | | | | | | | | | | | | | | | | | | | The unicore32 architecture already supports memblock and uses it for some early memory reservations, e.g initrd and the page tables. At some point unicore32 allocates the bootmem bitmap from the memblock and then hands over the memory reservations from memblock to bootmem. This patch removes the bootmem initialization and leaves memblock as the only boot time memory manager for unicore32. Link: http://lkml.kernel.org/r/1533326330-31677-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Guan Xuetao <gxt@pku.edu.cn> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* um: switch to NO_BOOTMEMMike Rapoport2018-10-262-11/+11
| | | | | | | | | | | | | | | | Replace bootmem initialization with memblock_add and memblock_reserve calls and explicit initialization of {min,max}_low_pfn. Link: http://lkml.kernel.org/r/1533326330-31677-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Richard Weinberger <richard@nod.at> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* um: setup_physmem: stop using global variablesMike Rapoport2018-10-261-3/+3
| | | | | | | | | | | | | | | | | The setup_physmem() function receives uml_physmem and uml_reserved as parameters and still used these global variables. Replace such usage with local variables. Link: http://lkml.kernel.org/r/1533326330-31677-6-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Richard Weinberger <richard@nod.at> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* nios2: switch to NO_BOOTMEMMike Rapoport2018-10-263-39/+7
| | | | | | | | | | | | | | | | Remove bootmem bitmap initialization and replace reserve_bootmem() with memblock_reserve(). Link: http://lkml.kernel.org/r/1533326330-31677-5-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* nios2: use generic early_init_dt_add_memory_archMike Rapoport2018-10-263-10/+3
| | | | | | | | | | | | | | | | All we have to do is to enable memblock, the generic FDT code will take care of the rest. Link: http://lkml.kernel.org/r/1533326330-31677-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* of: ignore sub-page memory regionsMike Rapoport2018-10-261-5/+6
| | | | | | | | | | | | | | | | | | | | | | | Memory region size is rounded down to page boundary and with sub-page region it becomes 0 and there is no point to add an empty region. Moreover, when the base is less than PAGE_SIZE we get a bogus size as (base + size - 1) evaluates to -1. 8cccffc52694 ("of: check for size < 0 after rounding in early_init_dt_add_memory_arch") introduced a test for wrap around for the case when base is not page aligned, the same test can be used to ignore sub-page region sizes. Link: http://lkml.kernel.org/r/1533326330-31677-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Rob Herring <robh@kernel.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hexagon: switch to NO_BOOTMEMMike Rapoport2018-10-262-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | Patch series "switch several architectures NO_BOOTMEM". These patches perform conversion to NO_BOOTMEM of hexagon, nios2, uml and unicore32. This patch (of 7): Add registration of the system memory with memblock, eliminate bootmem initialization and convert early memory reservations from bootmem to memblock. Link: http://lkml.kernel.org/r/1533326330-31677-2-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Richard Kuo <rkuo@codeaurora.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Herring <robh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: convert insert_pfn() to vm_fault_tMatthew Wilcox2018-10-261-19/+5
| | | | | | | | | | | | | | All callers convert its errno into a vm_fault_t, so convert it to return a vm_fault_t directly. Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: convert __vm_insert_mixed() to vm_fault_tMatthew Wilcox2018-10-261-21/+15
| | | | | | | | | | | | | | Both of its callers currently convert its errno return into a vm_fault_t, so move the conversion into __vm_insert_mixed(). Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: inline vm_insert_pfn_prot() into callerMatthew Wilcox2018-10-261-31/+24
| | | | | | | | | | | | | | vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline it and convert some of the errnos into vm_fault codes earlier. Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove vm_insert_pfn()Matthew Wilcox2018-10-262-39/+30
| | | | | | | | | | | | | | | All callers are now converted to vmf_insert_pfn() so convert vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn() to being a compatibility wrapper around vmf_insert_pfn_prot(). Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove references to vm_insert_pfn()Matthew Wilcox2018-10-263-5/+5
| | | | | | | | | | | | | Documentation and comments. Link: http://lkml.kernel.org/r/20180828145728.11873-7-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make vm_insert_pfn_prot() staticMatthew Wilcox2018-10-262-27/+25
| | | | | | | | | | | | | Now this is no longer used outside mm/memory.c, make it static. Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: convert vdso to use vm_fault_tMatthew Wilcox2018-10-261-15/+9
| | | | | | | | | | | | | | | | Return vm_fault_t codes directly from the appropriate mm routines instead of converting from errnos ourselves. Fixes a minor bug where we'd return SIGBUS instead of the correct OOM code if we ran out of memory allocating page tables. Link: http://lkml.kernel.org/r/20180828145728.11873-5-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce vmf_insert_pfn_prot()Matthew Wilcox2018-10-262-16/+33
| | | | | | | | | | | | | | Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno. Also unexport vm_insert_pfn_prot as it has no modular users. Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove vm_insert_mixed()Matthew Wilcox2018-10-262-18/+11
| | | | | | | | | | | | | | | All callers are now converted to vmf_insert_mixed() so convert vmf_insert_mixed() from being a compatibility wrapper into the real function. Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cramfs: convert to use vmf_insert_mixedNicolas Pitre2018-10-261-1/+4
| | | | | | | | | | | | | | | cramfs is the only remaining user of vm_insert_mixed() and should be converted to vmf_insert_mixed(). Based on a previous patch from Matthew Wilcox. Link: http://lkml.kernel.org/r/nycvar.YSQ.7.76.1808290945450.10215@knanqh.ubzr Signed-off-by: Nicolas Pitre <nico@linaro.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com>a Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: convert to use vm_fault_tSouptick Joarder2018-10-261-2/+2
| | | | | | | | | | | | As part of vm_fault_t conversion filemap_page_mkwrite() for the NOMMU case was missed. Now converted. Link: http://lkml.kernel.org/r/20180828174952.GA29229@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: clean up check_for_memory()Oscar Salvador2018-10-261-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check_for_memory() looks a bit confusing. First of all, we have this: if (N_MEMORY == N_NORMAL_MEMORY) return; Checking the ENUM declaration, looks like N_MEMORY canot be equal to N_NORMAL_MEMORY. I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other way around either, so unless I am missing something, this condition will never evaluate to true. It makes sense to get rid of it. Moving forward, the operations within the loop look a bit confusing as well. We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY in case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY) and zone <= ZONE_NORMAL. (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on !CONFIG_HIGHMEM systems, and that is why we can just go ahead and set N_HIGH_MEMORY unconditionally) Although this works, it is a bit subtle. I think that this could be easier to follow: First, we should only set N_HIGH_MEMORY in case we have CONFIG_HIGHMEM. And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL, without further checking whether we have CONFIG_HIGHMEM or not. Link: http://lkml.kernel.org/r/20180828210158.4617-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michael Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()Huang Ying2018-10-261-3/+1
| | | | | | | | | | | | | | | | | | | | si->swap_map[] of the swap entries in cluster needs to be cleared during freeing. Previously, this is done in the caller of swap_free_cluster(). This may cause code duplication (one user now, will add more users later) and lock/unlock cluster unnecessarily. In this patch, the clearing code is moved to swap_free_cluster() to avoid the downside. Link: http://lkml.kernel.org/r/20180827075535.17406-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/swapfile.c: call free_swap_slot() in __swap_entry_free()Huang Ying2018-10-261-6/+4
| | | | | | | | | | | | | | | | | | | | | This is a code cleanup patch without functionality change. Originally, when __swap_entry_free() is called, and its return value is 0, free_swap_slot() will always be called to free the swap entry to the per-CPU pool. So move the call to free_swap_slot() to __swap_entry_free() to simplify the code. Link: http://lkml.kernel.org/r/20180827075535.17406-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/swapfile.c: use __try_to_reclaim_swap() in free_swap_and_cache()Huang Ying2018-10-261-32/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code path to reclaim the swap entry in free_swap_and_cache() is almost same as that of __try_to_reclaim_swap(). The largest difference is just coding style. So the support to the additional requirement of free_swap_and_cache() is added into __try_to_reclaim_swap(). free_swap_and_cache() is changed to call __try_to_reclaim_swap(), and delete the duplicated code. This will improve code readability and reduce the potential bugs. There are 2 functionality differences between __try_to_reclaim_swap() and swap entry reclaim code of free_swap_and_cache(). - free_swap_and_cache() only reclaims the swap entry if the page is unmapped or swap is getting full. The support has been added into __try_to_reclaim_swap(). - try_to_free_swap() (called by __try_to_reclaim_swap()) checks pm_suspended_storage(), while free_swap_and_cache() not. I think this is OK. Because the page and the swap entry can be reclaimed later eventually. Link: http://lkml.kernel.org/r/20180827075535.17406-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kmemleak: add module param to print warnings to dmesgVincent Whitchurch2018-10-261-7/+35
| | | | | | | | | | | | | | | | | | | | | | | | | Currently, kmemleak only prints the number of suspected leaks to dmesg but requires the user to read a debugfs file to get the actual stack traces of the objects' allocation points. Add a module option to print the full object information to dmesg too. It can be enabled with kmemleak.verbose=1 on the kernel command line, or "echo 1 > /sys/module/kmemleak/parameters/verbose": This allows easier integration of kmemleak into test systems: We have automated test infrastructure to test our Linux systems. With this option, running our tests with kmemleak is as simple as enabling kmemleak and passing this command line option; the test infrastructure knows how to save kernel logs, which will now include kmemleak reports. Without this option, the test infrastructure needs to be specifically taught to read out the kmemleak debugfs file. Removing this need for special handling makes kmemleak more similar to other kernel debug options (slab debugging, debug objects, etc). Link: http://lkml.kernel.org/r/20180903144046.21023-1-vincent.whitchurch@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate ↵Michal Hocko2018-10-267-59/+0
| | | | | | | | | | | | | | | | | | | | | | | callbacks" Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"). MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers"). We now have a full support for per range !blocking behavior so we can drop the stop gap workaround which the per notifier flag was used for. Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mmu_notifier: be explicit about range invalition non-blocking modeMichal Hocko2018-10-261-1/+3
| | | | | | | | | | | | | | | | | | If invalidate_range_start() is called for !blocking mode then all callbacks have to guarantee they will no block/sleep. The same obviously applies to invalidate_range_end because this operation pairs with the former and they are called from the same context. Make sure this is appropriately documented. Link: http://lkml.kernel.org/r/20180827112623.8992-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry()Michal Hocko2018-10-261-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tetsuo Handa has reported that it is possible to bypass the short sleep for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") and lock up the system if OOM. The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run even when they have a rescuer available. Those workers might be essential for reclaim to make a forward progress, however. If we are too unlucky all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM condition without much hope to move on. Tetsuo has seen the reclaim stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might be others. Since should_reclaim_retry() should be a natural reschedule point, let's do the short sleep for PF_WQ_WORKER threads unconditionally in order to guarantee that other pending work items are started. This will workaround this problem and it is less fragile than hunting down when the sleep is missed. Having a single sleeping point is more robust. [akpm@linux-foundation.org: reflow comment to 80 cols to save a couple of lines] Link: http://lkml.kernel.org/r/20180827135101.15700-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: don't miss the last page because of round-off errorRoman Gushchin2018-10-262-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've noticed, that dying memory cgroups are often pinned in memory by a single pagecache page. Even under moderate memory pressure they sometimes stayed in such state for a long time. That looked strange. My investigation showed that the problem is caused by applying the LRU pressure balancing math: scan = div64_u64(scan * fraction[lru], denominator), where denominator = fraction[anon] + fraction[file] + 1. Because fraction[lru] is always less than denominator, if the initial scan size is 1, the result is always 0. This means the last page is not scanned and has no chances to be reclaimed. Fix this by rounding up the result of the division. In practice this change significantly improves the speed of dying cgroups reclaim. [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments] Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: drain memcg stocks on css offliningRoman Gushchin2018-10-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | Memcg charge is batched using per-cpu stocks, so an offline memcg can be pinned by a cached charge up to a moment, when a process belonging to some other cgroup will charge some memory on the same cpu. In other words, cached charges can prevent a memory cgroup from being reclaimed for some time, without any clear need. Let's optimize it by explicit draining of all stocks on css offlining. As draining is performed asynchronously, and is skipped if any parallel draining is happening, it's cheap. Link: http://lkml.kernel.org/r/20180827162621.30187-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rework memcg kernel stack accountingRoman Gushchin2018-10-262-7/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If CONFIG_VMAP_STACK is set, kernel stacks are allocated using __vmalloc_node_range() with __GFP_ACCOUNT. So kernel stack pages are charged against corresponding memory cgroups on allocation and uncharged on releasing them. The problem is that we do cache kernel stacks in small per-cpu caches and do reuse them for new tasks, which can belong to different memory cgroups. Each stack page still holds a reference to the original cgroup, so the cgroup can't be released until the vmap area is released. To make this happen we need more than two subsequent exits without forks in between on the current cpu, which makes it very unlikely to happen. As a result, I saw a significant number of dying cgroups (in theory, up to 2 * number_of_cpu + number_of_tasks), which can't be released even by significant memory pressure. As a cgroup structure can take a significant amount of memory (first of all, per-cpu data like memcg statistics), it leads to a noticeable waste of memory. Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com Fixes: ac496bf48d97 ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y") Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: extend slub debug to handle multiple slabsAaron Tomlin2018-10-262-9/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend the slub_debug syntax to "slub_debug=<flags>[,<slub>]*", where <slub> may contain an asterisk at the end. For example, the following would poison all kmalloc slabs: slub_debug=P,kmalloc* and the following would apply the default flags to all kmalloc and all block IO slabs: slub_debug=,bio*,kmalloc* Please note that a similar patch was posted by Iliyan Malchev some time ago but was never merged: https://marc.info/?l=linux-mm&m=131283905330474&w=2 Link: http://lkml.kernel.org/r/20180928111139.27962-1-atomlin@redhat.com Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Iliyan Malchev <malchev@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: don't warn about large allocations for slabDmitry Vyukov2018-10-262-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE, instead it falls back to kmalloc_large(). For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls kmalloc_slab() for all allocations relying on NULL return value for over-sized allocations. This inconsistency leads to unwanted warnings from kmalloc_slab() for over-sized allocations for slab. Returning NULL for failed allocations is the expected behavior. Make slub and slab code consistent by checking size > KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab(). While we are here also fix the check in kmalloc_slab(). We should check against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda worked because for slab the constants are the same, and slub always checks the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will happen. For example, in case of a newly introduced bug in slub code. Also move the check in kmalloc_slab() from function entry to the size > 192 case. This partially compensates for the additional check in slab code and makes slub code a bit faster (at least theoretically). Also drop __GFP_NOWARN in the warning check. This warning means a bug in slab code itself, user-passed flags have nothing to do with it. Nothing of this affects slob. Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slub.c: switch to bitmap_zalloc()Andy Shevchenko2018-10-261-13/+7
| | | | | | | | | | | | | | | Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Link: http://lkml.kernel.org/r/20180830104301.61649-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xtensa: use generic vga.hJiri Slaby2018-10-262-19/+1
| | | | | | | | | | | | What xtensa has in asm/vga.h is the same as what can be found in asm-generic/vga.h. So use the latter header. Link: http://lkml.kernel.org/r/20180907132219.12979-1-jslaby@suse.cz Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/iomap.c: change return type to vm_fault_tSouptick Joarder2018-10-262-2/+4
| | | | | | | | | | | | | | Change iomap_page_mkwrite() return type to vm_fault_t. see commit 1c8f422059ae ("mm: change return type to vm_fault_t") for reference. Link: http://lkml.kernel.org/r/20180827172050.GA18673@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove set but not used variable 'rb'YueHaibing2018-10-261-2/+0
| | | | | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/refcounttree.c: In function 'ocfs2_create_reflink_node': fs/ocfs2/refcounttree.c:4138:31: warning: variable 'rb' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1536198443-113047-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2/dlm/dlmdebug.c: fix a sleep-in-atomic-context bug in ↵Jia-Ju Bai2018-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dlm_print_one_mle() The kernel module may sleep with holding a spinlock. The function call paths (from bottom to top) in Linux-4.16 are: [FUNC] get_zeroed_page(GFP_NOFS) fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 255: __dlm_put_mle in dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 254: spin_lock in dlm_put_ml [FUNC] get_zeroed_page(GFP_NOFS) fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle fs/ocfs2/dlm/dlmmaster.c, 222: __dlm_put_mle in dlm_put_mle_inuse fs/ocfs2/dlm/dlmmaster.c, 219: spin_lock in dlm_put_mle_inuse To fix this bug, GFP_NOFS is replaced with GFP_ATOMIC. This bug is found by my static analysis tool DSAC. Link: http://lkml.kernel.org/r/20180901112528.27025-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove unneeded null checkDing Xiang2018-10-261-2/+1
| | | | | | | | | | | | | | | Null check for kfree is unnecessary, so remove it. Link: http://lkml.kernel.org/r/1535704514-26559-1-git-send-email-dingxiang@cmss.chinamobile.com Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove unused pointer 'eb'Colin Ian King2018-10-261-4/+0
| | | | | | | | | | | | | | | | | | | Pointer 'eb' is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'eb' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180828141907.10826-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2/dlm: remove unnecessary parenthesesNathan Chancellor2018-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | Clang warns when more than one set of parentheses is used for a single conditional statement: fs/ocfs2/dlm/dlmthread.c:534:18: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if ((res->owner == dlm->node_num)) { ~~~~~~~~~~~^~~~~~~~~~~~~~~~ fs/ocfs2/dlm/dlmthread.c:534:18: note: remove extraneous parentheses around the comparison to silence this warning if ((res->owner == dlm->node_num)) { ~ ^ ~ Link: http://lkml.kernel.org/r/20180924181929.6853-1-natechancellor@gmail.com Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* scripts/tags.sh: add DECLARE_HASHTABLE()Kirill Tkhai2018-10-261-1/+1
| | | | | | | | | | | | | | | | In addition to DEFINE_HASHTABLE() add DECLARE_ variant. Link: http://lkml.kernel.org/r/153683203215.13678.11468076350083405643.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Constantine Shulyupin <const@MakeLinux.com> Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/test_kasan.c: add tests for several string/memory API functionsAndrey Ryabinin2018-10-261-0/+70
| | | | | | | | | | | | | | | | | | | | Arch code may have asm implementation of string/memory API functions instead of using generic one from lib/string.c. KASAN don't see memory accesses in asm code, thus can miss many bugs. E.g. on ARM64 KASAN don't see bugs in memchr(), memcmp(), str[r]chr(), str[n]cmp(), str[n]len(). Add tests for these functions to be sure that we notice the problem on other architectures. Link: http://lkml.kernel.org/r/20180920135631.23833-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kyeongdon Kim <kyeongdon.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* arm64: lib: use C string functions with KASAN enabledAndrey Ryabinin2018-10-2610-16/+21
| | | | | | | | | | | | | | | | | | | | | | | ARM64 has asm implementation of memchr(), memcmp(), str[r]chr(), str[n]cmp(), str[n]len(). KASAN don't see memory accesses in asm code, thus it can potentially miss many bugs. Ifdef out __HAVE_ARCH_* defines of these functions when KASAN is enabled, so the generic implementations from lib/string.c will be used. We can't just remove the asm functions because efistub uses them. And we can't have two non-weak functions either, so declare the asm functions as weak. Link: http://lkml.kernel.org/r/20180920135631.23833-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Kyeongdon Kim <kyeongdon.kim@lge.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/linkage.h: align weak symbolsAndrey Ryabinin2018-10-261-0/+1
| | | | | | | | | | | | | | | | | Since WEAK() supposed to be used instead of ENTRY() to define weak symbols, but unlike ENTRY() it doesn't have ALIGN directive. It seems there is no actual reason to not have, so let's add ALIGN to WEAK() too. Link: http://lkml.kernel.org/r/20180920135631.23833-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Will Deacon <will.deacon@arm.com>, Catalin Marinas <catalin.marinas@arm.com> Cc: Kyeongdon Kim <kyeongdon.kim@lge.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/pfn_t.h: force '~' to be parsed as an unary operatorSebastien Boisvert2018-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tracing the event "fs_dax:dax_pmd_insert_mapping" with perf produces this warning: [fs_dax:dax_pmd_insert_mapping] unknown op '~' It is printed in process_op (tools/lib/traceevent/event-parse.c) because '~' is parsed as a binary operator. perf reads the format of fs_dax:dax_pmd_insert_mapping ("print fmt") from /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_insert_mapping/format . The format contains: ~(((u64) ~(~(((1UL) << 12)-1))) ^ \ interpreted as a binary operator by process_op(). This part is generated in the declaration of the event class dax_pmd_insert_mapping_class in include/trace/events/fs_dax.h : __print_flags_u64(__entry->pfn_val & PFN_FLAGS_MASK, "|", PFN_FLAGS_TRACE), This patch adds a pair of parentheses in the declaration of PFN_FLAGS_MASK to make sure that '~' is parsed as a unary operator by perf. The part of the format that was problematic is now: ~(((u64) (~(~(((1UL) << 12)-1)))) Now, all the '~' are parsed as unary operators. Link: http://lkml.kernel.org/r/20181021145939.8760-1-sebhtml@videotron.qc.ca Signed-off-by: Sebastien Boisvert <sebhtml@videotron.qc.ca> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Elenie Godzaridis <arangradient@gmail.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: disable irqs when taking the waitqueue lockChristoph Hellwig2018-10-261-4/+4
| | | | | | | | | | | | | | | | userfaultfd contains howe-grown locking of the waitqueue lock, and does not disable interrupts. This relies on the fact that no one else takes it from interrupt context and violates an invariat of the normal waitqueue locking scheme. With aio poll it is easy to trigger other locks that disable interrupts (or are called from interrupt context). Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [4.19.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range()Vlastimil Babka2018-10-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Leonardo reports an apparent regression in 4.19-rc7: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 RIP: 0010:smaps_pte_range+0x32d/0x540 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __walk_page_range+0x3c2/0x6f0 walk_page_vma+0x42/0x60 smap_gather_stats+0x79/0xe0 ? gather_pte_stats+0x320/0x320 ? gather_hugetlb_stats+0x70/0x70 show_smaps_rollup+0xcd/0x1c0 seq_read+0x157/0x400 __vfs_read+0x3a/0x180 ? security_file_permission+0x93/0xc0 ? security_file_permission+0x93/0xc0 vfs_read+0x8f/0x140 ksys_read+0x55/0xc0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Decoded code matched to local compilation+disassembly points to smaps_pte_entry(): } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = find_get_entry(vma->vm_file->f_mapping, linear_page_index(vma, addr)); Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that case, however for smaps_rollup, smap_gather_stats() can set the flag true for one vma and leave it true for subsequent vma's where it should be false. To fix, reset the check_shmem_swap flag to false. There's also related bug which sets mss->swap to shmem_swapped, which in the context of smaps_rollup overwrites any value accumulated from previous vma's. Fix that as well. Note that the report suggests a regression between 4.17.19 and 4.19-rc7, which makes the 4.19 series ending with commit 258f669e7e88 ("mm: /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But the mss was reused for rollup since 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") so let's play it safe with the stable backport. Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Tested-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Colascione <dancol@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>