summaryrefslogtreecommitdiffstats
path: root/mm/oom_kill.c
Commit message (Collapse)AuthorAgeFilesLines
* oom: add sysctl to enable task memory dumpDavid Rientjes2008-02-071-5/+44
| | | | | | | | | | | | | | | | | | | | | | | | | Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcontrol: move oom task exclusion to tasklist scanDavid Rientjes2008-02-071-7/+2
| | | | | | | | | | | | | | | | | | | | Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Memory controller: OOM handlingPavel Emelianov2008-02-071-4/+39
| | | | | | | | | | | | | | | | | | | | | | | Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom_kill: remove uid==0 checksSerge E. Hallyn2008-02-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means "give me extra resources" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add 64-bit capability support to the kernelAndrew Morgan2008-02-051-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: sched_rt_entityPeter Zijlstra2008-01-251-1/+1
| | | | | | | | | | Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* oom_kill bugAl Viro2007-10-201-1/+1
| | | | | | | Wrong order of arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Use helpers to obtain task pid in printksPavel Emelyanov2007-10-191-2/+3
| | | | | | | | | | | | | | | | The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Isolate some explicit usage of task->tgidPavel Emelyanov2007-10-191-1/+1
| | | | | | | | | | | | | | | | | | | | With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/oom_kill.c: Use list_for_each_entry instead of list_for_eachMatthias Kaehlcke2007-10-191-3/+1
| | | | | | | | | | mm/oom_kill.c: Convert list_for_each to list_for_each_entry in oom_kill_process() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: define is_global_init() and is_container_init()Serge E. Hallyn2007-10-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: convert zone_scan_lock from mutex to spinlockDavid Rientjes2007-10-171-5/+5
| | | | | | | | | | | | | | There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done. All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: do not take callback_mutexDavid Rientjes2007-10-171-3/+0
| | | | | | | | | | | | Since no task descriptor's 'cpuset' field is dereferenced in the execution of the OOM killer anymore, it is no longer necessary to take callback_mutex. [akpm@linux-foundation.org: restore cpuset_lock for other patches] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: compare cpuset mems_allowed instead of exclusive ancestorsDavid Rientjes2007-10-171-1/+1
| | | | | | | | | | | | | | | | | | | Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: suppress extraneous stack and memory dumpDavid Rientjes2007-10-171-13/+14
| | | | | | | | | | | | | Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found. There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: add oom_kill_allocating_task sysctlDavid Rientjes2007-10-171-5/+8
| | | | | | | | | | | | | | Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: add per-zone lockingDavid Rientjes2007-10-171-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of "trylocks." The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the "trylock" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: move constraints to enumDavid Rientjes2007-10-171-9/+3
| | | | | | | | | | | The OOM killer's CONSTRAINT definitions are really more appropriate in an enum, so define them in include/linux/oom.h. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on ↵Christoph Lameter2007-10-161-8/+1
| | | | | | | | | | | | | | | | the fly constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: print points as unsigned longDavid Rientjes2007-07-311-1/+1
| | | | | | | | | In badness(), the automatic variable 'points' is unsigned long. Print it as such. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove fs.h from mm.hAlexey Dobriyan2007-07-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: fix constraint deadlockDavid Rientjes2007-05-071-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix handling of panic_on_oom when cpusets are in useYasunori Goto2007-05-071-0/+3
| | | | | | | | | | | | | | | | | | The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* allow oom_adj of saintly processesJoshua N Pritikin2007-05-071-2/+4
| | | | | | | | | | | If the badness of a process is zero then oom_adj>0 has no effect. This patch makes sure that the oom_adj shift actually increases badness points appropriately. Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix OOM killing processes wrongly thought MPOL_BINDHugh Dickins2007-04-241-0/+2
| | | | | | | | | | | | | | | I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with "No available memory (MPOL_BIND)". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: kill all threads that share mm with killed taskDavid Rientjes2007-04-241-1/+1
| | | | | | | | | | | | | | | | oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] oom fix: prevent oom from killing a process with children/sibling ↵Ankita Garg2007-03-161-1/+1
| | | | | | | | | | | | | | unkillable Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] fix OOM killing of swapoffHugh Dickins2007-01-051-6/+6
| | | | | | | | | | | | | These days, if you swapoff when there isn't enough memory, OOM killer gives "BUG: scheduling while atomic" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix oom killer kills current every time if there is memory-less-node ↵KAMEZAWA Hiroyuki2006-12-301-1/+6
| | | | | | | | | | | | | | | | | | | | | take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset: rework cpuset_zone_allowed apiPaul Jackson2006-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: less memdieNick Piggin2006-12-071-2/+3
| | | | | | | | | | | Don't cause all threads in all other thread groups to gain TIF_MEMDIE otherwise we'll get a thundering herd eating our memory reserve. This may not be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE in the system at once. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: cleanup messagesNick Piggin2006-12-071-14/+13
| | | | | | | | Clean up the OOM killer messages to be more consistent. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: don't kill unkillable children or siblingsNick Piggin2006-12-071-2/+11
| | | | | | | | | | Abort the kill if any of our threads have OOM_DISABLE set. Having this test here also prevents any OOM_DISABLE child of the "selected" process from being killed. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] OOM killer meets userspace headersAlexey Dobriyan2006-10-201-0/+1
| | | | | | | | | | | | | | | | | | | Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: don't kill current when another OOM in progressNick Piggin2006-09-291-6/+17
| | | | | | | | | | | | | | | A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom_kill_task(): cleanup ->mm checksOleg Nesterov2006-09-291-5/+2
| | | | | | | | | | | - It is not possible to have task->mm == &init_mm. - task_lock() buys nothing for 'if (!p->mm)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] select_bad_process(): cleanup 'releasing' checkOleg Nesterov2006-09-291-10/+9
| | | | | | | | | No logic changes, but imho easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD checkOleg Nesterov2006-09-291-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce TASK_DEAD stateOleg Nesterov2006-09-291-1/+1
| | | | | | | | | | | | | | | | I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kill PF_DEAD flagOleg Nesterov2006-09-291-2/+2
| | | | | | | | | | After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pidspace: is_init()Sukadev Bhattiprolu2006-09-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NUMA: Add zone_to_nid functionChristoph Lameter2006-09-261-2/+1
| | | | | | | | | | | There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom-kill: update comments to reflect current codeRam Gupta2006-09-261-3/+3
| | | | | | | | Update the comments for __oom_kill_task() to reflect the code changes. Signed-off-by: Ram Gupta <r.gupta@astronautics.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: more printkNick Piggin2006-09-261-2/+3
| | | | | | | | | Print the name of the task invoking the OOM killer. Could make debugging easier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: kthread infinite loop fixNick Piggin2006-09-261-0/+3
| | | | | | | | | | Skip kernel threads, rather than having them return 0 from badness. Theoretically, badness might truncate all results to 0, thus a kernel thread might be picked first, causing an infinite loop. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: swapoff tasks tweakNick Piggin2006-09-261-2/+6
| | | | | | | | | | PF_SWAPOFF processes currently cause select_bad_process to return straight away. Instead, give them high priority, so we will kill them first, however we also first ensure no parallel OOM kills are happening at the same time. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: handle oom_disable exitingNick Piggin2006-09-261-2/+2
| | | | | | | | | | | | | | | Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to "OOM-kill" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: handle current exitingNick Piggin2006-09-261-4/+31
| | | | | | | | | | | | | | | | | | | If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can "kill" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: cpuset hintNick Piggin2006-09-261-3/+8
| | | | | | | | | | | | | cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us. For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset. Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] out of memory notifierMartin Schwidefsky2006-09-261-0/+22
| | | | | | | | | | | | | | | | | | Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>