summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* Clean up and export cancel_dirty_page() to modulesLinus Torvalds2006-12-231-4/+8
| | | | | | | | | | | Make cancel_dirty_page() act more like all the other dirty and writeback accounting functions: test for "mapping" being NULL, and do the NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()). Also, add it to the exports, so that modular filesystems can use it. Acked-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix up page_mkclean_one(): virtual caches, s390Peter Zijlstra2006-12-221-10/+13
| | | | | | | | | | - add flush_cache_page() for all those virtual indexed cache architectures. - handle s390. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: more rmap debuggingNick Piggin2006-12-224-7/+14
| | | | | | | | | Add more debugging in the rmap code in an attempt to locate to source of the occasional "mapcount went negative" assertions. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix swapped parameters in mm/vmscan.cNigel Cunningham2006-12-221-2/+2
| | | | | | | | | | | | | The version of mm/vmscan.c in Linus' current tree has swapped parameters in the shrink_all_zones declaration and call, used by the various suspend-to-disk implementations. This doesn't seem to have any great adverse effect, but it's clearly wrong. Signed-off-by: Nigel Cunningham <nigel@suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix kernel-doc warnings in 2.6.20-rc1Randy Dunlap2006-12-221-0/+1
| | | | | | | | Fix kernel-doc warnings in 2.6.20-rc1. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix kmem_ptr_validate definitionChristoph Lameter2006-12-221-1/+1
| | | | | | | | | | The declaration of kmem_ptr_validate in slab.h does not match the one in slab.c. Remove the fastcall attribute (this is the only use in slab.c). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix for shmem_truncate_range() BUG_ON()Badari Pulavarty2006-12-221-1/+6
| | | | | | | | | | | | | | Ran into BUG() while doing madvise(REMOVE) testing. If we are punching a hole into shared memory segment using madvise(REMOVE) and the entire hole is below the indirect blocks, we hit following assert. BUG_ON(limit <= SHMEM_NR_DIRECT); Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] truncate: dirty memory accounting fixAndrew Morton2006-12-221-1/+2
| | | | | | | | | Only (un)account for IO and page-dirtying for devices which have real backing store (ie: not tmpfs or ramdisks). Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] truncate: clear page dirtiness before running try_to_free_buffers()Andrew Morton2006-12-211-4/+5
| | | | | | | | | | | | | | | | | | | | | truncate presently invalidates the dirty page's buffer_heads then shoots down the page. But try_to_free_buffers() will now bale out because the page is dirty. Net effect: the LRU gets filled with dirty pages which have invalidated buffer_heads attached. They have no ->mapping and hence cannot be cleaned. The machine leaks memory at an enormous rate. Fix this by cleaning the page before running try_to_free_buffers(), so try_to_free_buffers() can do its work. Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the machine won't wedge up trying to write non-existent dirty pages. Probably still wrong, but now less so. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()" functionsLinus Torvalds2006-12-212-40/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They were horribly easy to mis-use because of their tempting naming, and they also did way more than any users of them generally wanted them to do. A dirty page can become clean under two circumstances: (a) when we write it out. We have "clear_page_dirty_for_io()" for this, and that function remains unchanged. In the "for IO" case it is not sufficient to just clear the dirty bit, you also have to mark the page as being under writeback etc. (b) when we actually remove a page due to it becoming inaccessible to users, notably because it was truncate()'d away or the file (or metadata) no longer exists, and we thus want to cancel any outstanding dirty state. For the (b) case, we now introduce "cancel_dirty_page()", which only touches the page state itself, and verifies that the page is not mapped (since cancelling writes on a mapped page would be actively wrong as it is still accessible to users). Some filesystems need to be fixed up for this: CIFS, FUSE, JFS, ReiserFS, XFS all use the old confusing functions, and will be fixed separately in subsequent commits (with some of them just removing the offending logic, and others using clear_page_dirty_for_io()). This was confirmed by Martin Michlmayr to fix the apt database corruption on ARM. Cc: Martin Michlmayr <tbm@cyrius.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Andrei Popa <andrei.popa@i-neo.ro> Cc: Andrew Morton <akpm@osdl.org> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Gordon Farquharson <gordonfarquharson@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sys_mincore: s/max/min/Oleg Nesterov2006-12-171-1/+1
| | | | | | | fix a typo, sys_mincore() needs min(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus "I'm a moron" Torvalds <torvalds@osdl.org>
* Fix up mm/mincore.c error value casesLinus Torvalds2006-12-161-19/+10
| | | | | | | | | | | | | | | | | | | Hugh Dickins correctly points out that mincore() is actually _supposed_ to fail on an unmapped hole in the user address space, rather than return valid ("empty") information about the hole. This just simplifies the problem further (I had been misled by our previous confusing and complicated way of doing mincore()). Also, in the unlikely situation that we can't allocate a temporary kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the "unmapped hole" and "allocation failure" error cases separate. Finally, add a comment about our stupid historical lack of support for anonymous mappings. I'll fix that if somebody reminds me after 2.6.20 is out. Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Fix incorrect user space access locking in mincore()Linus Torvalds2006-12-161-104/+86
| | | | | | | | | | | | | | | | | | Doug Chapman noticed that mincore() will doa "copy_to_user()" of the result while holding the mmap semaphore for reading, which is a big no-no. While a recursive read-lock on a semaphore in the case of a page fault happens to work, we don't actually allow them due to deadlock schenarios with writers due to fairness issues. Doug and Marcel sent in a patch to fix it, but I decided to just rewrite the mess instead - not just fixing the locking problem, but making the code smaller and (imho) much easier to understand. Cc: Doug Chapman <dchapman@redhat.com> Cc: Marcel Holtmann <holtmann@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Pass vma argument to copy_user_highpage().Atsushi Nemoto2006-12-132-8/+8
| | | | | | | | | | | | | To allow a more effective copy_user_highpage() on certain architectures, a vma argument is added to the function and cow_user_page() allowing the implementation of these functions to check for the VM_EXEC bit. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SLAB: use a multiply instead of a divide in obj_to_index()Eric Dumazet2006-12-131-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When some objects are allocated by one CPU but freed by another CPU we can consume lot of cycles doing divides in obj_to_index(). (Typical load on a dual processor machine where network interrupts are handled by one particular CPU (allocating skbufs), and the other CPU is running the application (consuming and freeing skbufs)) Here on one production server (dual-core AMD Opteron 285), I noticed this divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are quite modern cpus and the divide is much more expensive on oldest architectures : On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle for a multiply. Doing some math, we can use a reciprocal multiplication instead of a divide. If we want to compute V = (A / B) (A and B being u32 quantities) we can instead use : V = ((u64)A * RECIPROCAL(B)) >> 32 ; where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B Note : I wrote pure C code for clarity. gcc output for i386 is not optimal but acceptable : mull 0x14(%ebx) mov %edx,%eax // part of the >> 32 xor %edx,%edx // useless mov %eax,(%esp) // could be avoided mov %edx,0x4(%esp) // useless mov (%esp),%ebx [akpm@osdl.org: small cleanups] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset: rework cpuset_zone_allowed apiPaul Jackson2006-12-135-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] More slab.h cleanupsChristoph Lameter2006-12-132-2/+2
| | | | | | | | | | | | | | | | | | | | | More cleanups for slab.h 1. Remove tabs from weird locations as suggested by Pekka 2. Drop the check for NUMA and SLAB_DEBUG from the fallback section as suggested by Pekka. 3. Uses static inline for the fallback defs as also suggested by Pekka. 4. Make kmem_ptr_valid take a const * argument. 5. Separate the NUMA fallback definitions from the kmalloc_track fallback definitions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Cleanup slab headers / API to allow easy addition of new slab allocatorsChristoph Lameter2006-12-131-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is a response to an earlier discussion on linux-mm about splitting slab.h components per allocator. Patch is against 2.6.19-git11. See http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2 This patch cleans up the slab header definitions. We define the common functions of slob and slab in slab.h and put the extra definitions needed for slab's kmalloc implementations in <linux/slab_def.h>. In order to get a greater set of common functions we add several empty functions to slob.c and also rename slob's kmalloc to __kmalloc. Slob does not need any special definitions since we introduce a fallback case. If there is no need for a slab implementation to provide its own kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions. That is sufficient for SLOB. Sort the function in slab.h according to their functionality. First the functions operating on struct kmem_cache * then the kmalloc related functions followed by special debug and fallback definitions. Also redo a lot of comments. Signed-off-by: Christoph Lameter <clameter@sgi.com>? Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix sleeping in atomic bugChristoph Lameter2006-12-131-0/+6
| | | | | | | | | | | | | Fallback_alloc() does not do the check for GFP_WAIT as done in cache_grow(). Thus interrupts are disabled when we call kmem_getpages() which results in the failure. Duplicate the handling of GFP_WAIT in cache_grow(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Jay Cliburn <jacliburn@bellsouth.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] user of the jiffies rounding patch: SlabArjan van de Ven2006-12-101-3/+5
| | | | | | | | | | | | This patch introduces users of the round_jiffies() function in the slab code. The slab code has a few "run every second" timers for background work; these are obviously not timing critical as long as they happen roughly at the right frequency. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUEDZach Brown2006-12-101-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only time it is safe to call aio_complete() is when the ->ki_retry function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has historically done this by relying on its caller to translate positive return codes into -EIOCBQUEUED for the aio case. It did this by trying to keep conditionals in sync. direct_io_worker() knew when finished_one_bio() was going to call aio_complete(). It would reverse the test and wait and free the dio in the cases it thought that finished_one_bio() wasn't going to. Not surprisingly, it ended up getting it wrong. 'ret' could be a negative errno from the submission path but it failed to communicate this to finished_one_bio(). direct_io_worker() would return < 0, it's callers wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the future finished_one_bio()'s tests wouldn't reflect this and aio_complete() would be called for a second time which can manifest as an oops. The previous cleanups have whittled the sync and async completion paths down to the point where we can collapse them and clearly reassert the invariant that we must only call aio_complete() after returning -EIOCBQUEUED. direct_io_worker() will only return -EIOCBQUEUED when it is not the last to drop the dio refcount and the aio bio completion path will only call aio_complete() when it is the last to drop the dio refcount. direct_io_worker() can ensure that it is the last to drop the reference count by waiting for bios to drain. It does this for sync ops, of course, and for partial dio writes that must fall back to buffered and for aio ops that saw errors during submission. This means that operations that end up waiting, even if they were issued as aio ops, will not call aio_complete() from dio. Instead we return the return code of the operation and let the aio core call aio_complete(). This is purposely done to fix a bug where AIO DIO file extensions would call aio_complete() before their callers have a chance to update i_size. Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers no longer have to translate for it. XFS needs to be careful not to free resources that will be used during AIO completion if -EIOCBQUEUED is returned. We maintain the previous behaviour of trying to write fs metadata for O_SYNC aio+dio writes. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: <xfs-masters@oss.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] io-accounting-read-accounting nfs fixAndrew Morton2006-12-101-0/+2
| | | | | | | | | | | | | | | | nfs's ->readpages uses read_cache_pages(). Wire it up there. [wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads] Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] io-accounting: write-cancel accountingAndrew Morton2006-12-101-1/+3
| | | | | | | | | | | | | | | Account for the number of byte writes which this process caused to not happen after all. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] io-accounting: write accountingAndrew Morton2006-12-101-1/+4
| | | | | | | | | | | | | | | | | | | | Accounting writes is fairly simple: whenever a process flips a page from clean to dirty, we accuse it of having caused a write to underlying storage of PAGE_CACHE_SIZE bytes. This may overestimate the amount of writing: the page-dirtying may cause only one buffer_head's worth of writeout. Fixing that is possible, but probably a bit messy and isn't obviously important. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] clean up __set_page_dirty_nobuffers()Andrew Morton2006-12-101-45/+43
| | | | | | | | | | | | | | | Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers() and a few other places. No functional changes. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] read_zero_pagealigned() locking fixHugh Dickins2006-12-101-11/+21
| | | | | | | | | | | | | | | | | | | | | | | | | Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem, but another thread's racing read of /dev/zero, or a normal fault, can easily set that pte again, in between zap_page_range and zeromap_page_range getting there. It's been wrong ever since 2.4.3. The simple fix is to use down_write instead, but that would serialize reads of /dev/zero more than at present: perhaps some app would be badly affected. So instead let zeromap_page_range return the error instead of BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in that case - there's no need to optimize for it. Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of zeromap_page_range), though it really isn't interesting there. And since mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that than -ENOMEM. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Ramiro Voicu: <Ramiro.Voicu@cern.ch> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fault-injection: defaults likely to please a new userDon Mullis2006-12-082-0/+3
| | | | | | | | | | | | | Assign defaults most likely to please a new user: 1) generate some logging output (verbose=2) 2) avoid injecting failures likely to lock up UI (ignore_gfp_wait=1, ignore_gfp_highmem=1) Signed-off-by: Don Mullis <dwm@meer.net> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fault-injection capability for alloc_pages()Akinobu Mita2006-12-081-0/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch provides fault-injection capability for alloc_pages() Boot option: fail_page_alloc=<interval>,<probability>,<space>,<times> <interval> -- specifies the interval of failures. <probability> -- specifies how often it should fail in percent. <space> -- specifies the size of free space where memory can be allocated safely in pages. <times> -- specifies how many times failures may happen at most. Debugfs: /debug/fail_page_alloc/interval /debug/fail_page_alloc/probability /debug/fail_page_alloc/specifies /debug/fail_page_alloc/times /debug/fail_page_alloc/ignore-gfp-highmem /debug/fail_page_alloc/ignore-gfp-wait Example: fail_page_alloc=10,100,0,-1 The page allocation (alloc_pages(), ...) fails once per 10 times. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fault-injection capability for kmallocAkinobu Mita2006-12-081-0/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch provides fault-injection capability for kmalloc. Boot option: failslab=<interval>,<probability>,<space>,<times> <interval> -- specifies the interval of failures. <probability> -- specifies how often it should fail in percent. <space> -- specifies the size of free space where memory can be allocated safely in bytes. <times> -- specifies how many times failures may happen at most. Debugfs: /debug/failslab/interval /debug/failslab/probability /debug/failslab/specifies /debug/failslab/times /debug/failslab/ignore-gfp-highmem /debug/failslab/ignore-gfp-wait Example: failslab=10,100,0,-1 slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] LOG2: Implement a general integer log2 facility in the kernelDavid Howells2006-12-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This facility provides three entry points: ilog2() Log base 2 of unsigned long ilog2_u32() Log base 2 of u32 ilog2_u64() Log base 2 of u64 These facilities can either be used inside functions on dynamic data: int do_something(long q) { ...; y = ilog2(x) ...; } Or can be used to statically initialise global variables with constant values: unsigned n = ilog2(27); When performing static initialisation, the compiler will report "error: initializer element is not constant" if asked to take a log of zero or of something not reducible to a constant. They treat negative numbers as unsigned. When not dealing with a constant, they fall back to using fls() which permits them to use arch-specific log calculation instructions - such as BSR on x86/x86_64 or SCAN on FRV - if available. [akpm@osdl.org: MMC fix] Signed-off-by: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David Howells <dhowells@redhat.com> Cc: Wojtek Kaniewski <wojtekka@toxygen.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] struct path: convert mmJosef Sipek2006-12-084-10/+10
| | | | | | Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_pathJosef "Jeff" Sipek2006-12-086-20/+20
| | | | | | | | Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: fallback_alloc cpuset_zone_allowed irq fixPaul Jackson2006-12-081-1/+1
| | | | | | | | | | | | | | | | | | | fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL set, leading to a possible call of a sleeping function with interrupts disabled. This results in the BUG report: BUG: sleeping function called from invalid context at kernel/cpuset.c:1520 in_atomic():0, irqs_disabled():1 Thanks to Paul Menage for catching this one. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] struct seq_operations and struct file_operations constificationHelge Deller2006-12-076-13/+13
| | | | | | | | | | | | | | - move some file_operations structs into the .rodata section - move static strings from policy_types[] array into the .rodata section - fix generic seq_operations usages, so that those structs may be defined as "const" as well [akpm@osdl.org: couple of fixes] Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbolsAdrian Bunk2006-12-073-8/+0
| | | | | | | | In time for 2.6.20, we can get rid of this junk. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kernel core: replace kmalloc+memset with kzallocBurman Yan2006-12-071-4/+2
| | | | | Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] make mm/shmem.c:shmem_xattr_security_handler staticAdrian Bunk2006-12-071-1/+1
| | | | | | Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hotplug CPU: clean up hotcpu_notifier() useIngo Molnar2006-12-073-6/+2
| | | | | | | | | | | | | | | | | | There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn, prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus generating compiler warnings of unused symbols, hence forcing people to add #ifdefs. the compiler can skip truly unused functions just fine: text data bss dec hex filename 1624412 728710 3674856 6027978 5bfaca vmlinux.before 1624412 728710 3674856 6027978 5bfaca vmlinux.after [akpm@osdl.org: topology.c fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove HASH_HIGHMEMAndrew Morton2006-12-071-1/+1
| | | | | | | | It has no users and it's doubtful that we'll need it again. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] read_cache_pages() cleanupOGAWA Hirofumi2006-12-071-7/+1
| | | | | | | | Use put_pages_list() instead of opencoding it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: use probe_kernel_address()Andrew Morton2006-12-071-5/+2
| | | | | Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Add include/linux/freezer.h and move definitions from sched.hNigel Cunningham2006-12-072-0/+2
| | | | | | | | | | | | | Move process freezing functions from include/linux/sched.h to freezer.h, so that modifications to the freezer or the kernel configuration don't require recompiling just about everything. [akpm@osdl.org: fix ueagle driver] Signed-off-by: Nigel Cunningham <nigel@suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: Improve handling of highmemRafael J. Wysocki2006-12-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently swsusp saves the contents of highmem pages by copying them to the normal zone which is quite inefficient (eg. it requires two normal pages to be used for saving one highmem page). This may be improved by using highmem for saving the contents of saveable highmem pages. Namely, during the suspend phase of the suspend-resume cycle we try to allocate as many free highmem pages as there are saveable highmem pages. If there are not enough highmem image pages to store the contents of all of the saveable highmem pages, some of them will be stored in the "normal" memory. Next, we allocate as many free "normal" pages as needed to store the (remaining) image data. We use a memory bitmap to mark the allocated free pages (ie. highmem as well as "normal" image pages). Now, we use another memory bitmap to mark all of the saveable pages (highmem as well as "normal") and the contents of the saveable pages are copied into the image pages. Then, the second bitmap is used to save the pfns corresponding to the saveable pages and the first one is used to save their data. During the resume phase the pfns of the pages that were saveable during the suspend are loaded from the image and used to mark the "unsafe" page frames. Next, we try to allocate as many free highmem page frames as to load all of the image data that had been in the highmem before the suspend and we allocate so many free "normal" page frames that the total number of allocated free pages (highmem and "normal") is equal to the size of the image. While doing this we have to make sure that there will be some extra free "normal" and "safe" page frames for two lists of PBEs constructed later. Now, the image data are loaded, if possible, into their "original" page frames. The image data that cannot be written into their "original" page frames are loaded into "safe" page frames and their "original" kernel virtual addresses, as well as the addresses of the "safe" pages containing their copies, are stored in one of two lists of PBEs. One list of PBEs is for the copies of "normal" suspend pages (ie. "normal" pages that were saveable during the suspend) and it is used in the same way as previously (ie. by the architecture-dependent parts of swsusp). The other list of PBEs is for the copies of highmem suspend pages. The pages in this list are restored (in a reversible way) right before the arch-dependent code is called. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: use block device offsets to identify swap locationsRafael J. Wysocki2006-12-072-45/+17
| | | | | | | | | | | | | | Make swsusp use block device offsets instead of swap offsets to identify swap locations and make it use the same code paths for writing as well as for reading data. This allows us to use the same code for handling swap files and swap partitions and to simplify the code, eg. by dropping rw_swap_page_sync(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: use partition device and offset to identify swap areasRafael J. Wysocki2006-12-071-12/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux kernel handles swap files almost in the same way as it handles swap partitions and there are only two differences between these two types of swap areas: (1) swap files need not be contiguous, (2) the header of a swap file is not in the first block of the partition that holds it. From the swsusp's point of view (1) is not a problem, because it is already taken care of by the swap-handling code, but (2) has to be taken into consideration. In principle the location of a swap file's header may be determined with the help of appropriate filesystem driver. Unfortunately, however, it requires the filesystem holding the swap file to be mounted, and if this filesystem is journaled, it cannot be mounted during a resume from disk. For this reason we need some other means by which swap areas can be identified. For example, to identify a swap area we can use the partition that holds the area and the offset from the beginning of this partition at which the swap header is located. The following patch allows swsusp to identify swap areas this way. It changes swap_type_of() so that it takes an additional argument representing an offset of the swap header within the partition represented by its first argument. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] radix-tree: RCU lockless readsideNick Piggin2006-12-071-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make radix tree lookups safe to be performed without locks. Readers are protected against nodes being deleted by using RCU based freeing. Readers are protected against new node insertion by using memory barriers to ensure the node itself will be properly written before it is visible in the radix tree. Each radix tree node keeps a record of their height (above leaf nodes). This height does not change after insertion -- when the radix tree is extended, higher nodes are only inserted in the top. So a lookup can take the pointer to what is *now* the root node, and traverse down it even if the tree is concurrently extended and this node becomes a subtree of a new root. "Direct" pointers (tree height of 0, where root->rnode points directly to the data item) are handled by using the low bit of the pointer to signal whether rnode is a direct pointer or a pointer to a radix tree node. When a reader wants to traverse the next branch, they will take a copy of the pointer. This pointer will be either NULL (and the branch is empty) or non-NULL (and will point to a valid node). [akpm@osdl.org: cleanups] [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications] [clameter@sgi.com: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: make compound page destructor handling explicitAndy Whitcroft2006-12-073-4/+4
| | | | | | | | | | | | | | | Currently we we use the lru head link of the second page of a compound page to hold its destructor. This was ok when it was purely an internal implmentation detail. However, hugetlbfs overrides this destructor violating the layering. Abstract this out as explicit calls, also introduce a type for the callback function allowing them to be type checked. For each callback we pre-declare the function, causing a type error on definition rather than on use elsewhere. [akpm@osdl.org: cleanups] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: better fallback allocation behaviorChristoph Lameter2006-12-071-22/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] GFP_THISNODE must not trigger global reclaimChristoph Lameter2006-12-071-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | The intent of GFP_THISNODE is to make sure that an allocation occurs on a particular node. If this is not possible then NULL needs to be returned so that the caller can choose what to do next on its own (the slab allocator depends on that). However, GFP_THISNODE currently triggers reclaim before returning a failure (GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node then we will currently do some reclaim before returning NULL. The caller may want memory from other nodes before reclaim should be triggered. (If the caller wants reclaim then he can directly use __GFP_THISNODE instead). There is no flag to avoid reclaim in the page allocator and adding yet another GFP_xx flag would be difficult given that we are out of available flags. So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE, __GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before waking up kswapd. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_nodeChristoph Lameter2006-12-071-12/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This addresses two issues: 1. Kmalloc_node() may intermittently return NULL if we are allocating from the current node and are unable to obtain memory for the current node from the page allocator. This is because we call ___cache_alloc() if nodeid == numa_node_id() and ____cache_alloc is not able to fallback to other nodes. This was introduced in the 2.6.19 development cycle. <= 2.6.18 in that case does not do a restricted allocation and blindly trusts the page allocator to have given us memory from the indicated node. It inserts the page regardless of the node it came from into the queues for the current node. 2. If kmalloc_node() is used on a node that has not been bootstrapped yet then we may try to pass an invalid node number to ____cache_alloc_node() triggering a BUG(). Change the function to call fallback_alloc() instead. Only call fallback_alloc() if we are allowed to fallback at all. The need to handle a node not bootstrapped yet also first surfaced in the 2.6.19 cycle. Update the comments since they were still describing the old kmalloc_node from 2.6.12. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>