summaryrefslogtreecommitdiffstats
path: root/mm/madvise.c
Commit message (Collapse)AuthorAgeFilesLines
* ksm: the mm interface to ksmHugh Dickins2009-09-221-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ksm: first tidy up madvise_vma()Hugh Dickins2009-09-221-26/+13
| | | | | | | | | | | | | | | | | | | | | | | | madvise.c has several levels of switch statements, what to do in which? Move MADV_DOFORK code down from madvise_vma() to madvise_behavior(), so madvise_vma() can be a simple router, to madvise_behavior() by default. vma->vm_flags is an unsigned long so use the same type for new_flags. Add missing comment lines to describe MADV_DONTFORK and MADV_DOFORK. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: madvise(): correct return codeNick Piggin2009-06-161-1/+22
| | | | | | | | | | | | | | | | | The posix_madvise() function succeeds (and does nothing) when called with parameters (NULL, 0, -1); according to LSB tests, it should fail with EINVAL because -1 is not a valid flag. When called with a valid address and size, it correctly fails. So perform an initial check for valid flags first. Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: move max_sane_readahead() calls into force_page_cache_readahead()Wu Fengguang2009-06-161-2/+1
| | | | | | | | | | Impact: code simplification. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regions"Linus Torvalds2009-05-131-8/+0
| | | | | | | | | | | This reverts commit a425a638c858fd10370b573bde81df3ba500e271. Now that the previous commit removed the "readpage" actor for hugetlb files, read-ahead will no longer mess up the mapping, and there's no longer any reason to treat hugetlbfs mappings specially. Tested-and-acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regionsMel Gorman2009-05-051-0/+8
| | | | | | | | | | | | | | | | | | madvise(MADV_WILLNEED) forces page cache readahead on a range of memory backed by a file. The assumption is made that the page required is order-0 and "normal" page cache. On hugetlbfs, this assumption is not true and order-0 pages are allocated and inserted into the hugetlbfs page cache. This leaks hugetlbfs page reservations and can cause BUGs to trigger related to corrupted page tables. This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed regions. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [CVE-2009-0029] System call wrappers part 14Heiko Carstens2009-01-141-1/+1
| | | | Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* madvise: update function comment of madvise_dontneedFernando Luis Vazquez Cao2008-07-301-2/+2
| | | | | | | | | Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xip: support non-struct page backed memoryNick Piggin2008-04-281-1/+1
| | | | | | | | | | | | | | | | | | | Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for the user mappings. This requires the get_xip_page API to be changed to an address based one. Improve the API layering a little bit too, while we're here. This is required in order to support XIP filesystems on memory that isn't backed with struct page (but memory with struct page is still supported too). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* speed up madvise_need_mmap_write() usageJason Baron2007-07-161-2/+4
| | | | | | | | | | | | In the new madvise_need_mmap_write() call we can avoid an extra case statement and function call as follows. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Detach sched.h from mm.hAlexey Dobriyan2007-05-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: madvise avoid exclusive mmap_semNick Piggin2007-05-071-4/+29
| | | | | | | | | | Avoid down_write of the mmap_sem in madvise when we can help it. Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] holepunch: fix mmap_sem i_mutex deadlockHugh Dickins2007-03-291-5/+14
| | | | | | | | | | | | | | | | | | | sys_madvise has down_write of mmap_sem, then madvise_remove calls vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise deadlocks from that ordering. madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since madvise_remove doesn't split or merge vmas, it's easy to handle this case with a NULL prev, without restructuring sys_madvise. (Though sad to retake mmap_sem when it's unlikely to be needed, and certainly down_read is sufficient for MADV_REMOVE, unlike the other madvices.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] mm: fix madvise infinine loopNick Piggin2007-03-161-1/+4
| | | | | | | | | | | | madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the call covers a region from the start of a vma, and extending past that vma. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Fix MADV_REMOVE protection checkingHugh Dickins2006-04-171-0/+3
| | | | | | | | madvise_remove needs to respect file and mmap protections. Signed-off-by: Hugh Dickins <hugh@veritas.com> [ Will the real CVE-2006-1524 stand up, please.. ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise MADV_DONTFORK/MADV_DOFORKMichael S. Tsirkin2006-02-141-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMA's into this page after the COW. In case of mlock'd memory, the parent is not getting the realtime/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing storeBadari Pulavarty2006-01-061-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* mm: re-architect the VM_UNPAGED logicLinus Torvalds2005-11-281-1/+1
| | | | | | | | | | | | | | | | This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] unpaged: VM_UNPAGEDHugh Dickins2005-11-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] core remove PageReservedNick Piggin2005-10-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise: Avoid returning error code -EBADF for anonymous mappingsSuzuki2005-10-111-7/+4
| | | | | | | | | | | | | | Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com> reported that it broke an existing application, and that madvise() works without error on anonymous mappings on Solaris. This means that madvise() will remain non-standards-compliant: we should return -EBADF for all requests against non-file-backed vma's, but Linux only does this for MADV_WILLNEED requests. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: fix madvise vma mergingHugh Dickins2005-09-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | Better late than never, I've at last reviewed the madvise vma merging going into 2.6.13. Remove a pointless check and fix two little bugs - a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed both mismerges in practice: though being madvise, neither was disastrous. 1. Correct placement of the success label in madvise_behavior: as in mprotect_fixup and mlock_fixup, it is necessary to update vm_flags when vma_merge succeeds (to handle the exceptional Case 8 noted in the comments above vma_merge itself). 2. Correct initial value of prev when starting part way into a vma: as in sys_mprotect and do_mlock, it needs to be set to vma in this case (vma_merge handles only that minimum of cases shown in its comments). 3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next, so it's pointless to make that same assignment again in sys_madvise. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise() does not always return -EBADF on non-file mapped areasuzuki2005-07-271-5/+8
| | | | | | | | | | | | | | | The madvise() system call returns -EBADF for areas which does not map to files, only for *behaviour* request MADV_WILLNEED. According to man pages, madvise returns : EBADF - the map exists, but the area maps something that isn't a file. Fixes bug 2995. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xip: madvice/fadvice: execute in placeCarsten Otte2005-06-241-0/+5
| | | | | | | | Make sys_madvice/fadvice return sane with xip. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove redundant vm_flags clearing from madvise.cPekka Enberg2005-06-231-1/+0
| | | | | | | | | This patch removes redundant VM_ClearReadHint from mm/madvice.c which was left there by Prasanna's patch. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise: merge the mapsPrasanna Meda2005-06-211-29/+51
| | | | | | | | | | | | | This attempts to merge back the split maps. This code is mostly copied from Chrisw's mlock merging from post 2.6.11 trees. The only difference is in munmapped_error handling. Also passed prev to willneed/dontneed, eventhogh they do not handle it now, since I felt it will be cleaner, instead of handling prev in madvise_vma in some cases and in subfunction in some cases. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] madvise: do not split the mapsPrasanna Meda2005-06-211-11/+16
| | | | | | | | | | This attempts to avoid splittings when it is not needed, that is when vm_flags are same as new flags. The idea is from the <2.6.11 mlock_fixup and others. This will provide base for the next madvise merging patch. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds2005-04-161-0/+242
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!