summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movableNaoya Horiguchi2013-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pagesXishi Qiu2013-09-111-6/+6
| | | | | | | | | | | Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages". Simplify the code, no functional change. [akpm@linux-foundation.org: fix build] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: mempolicy: turn vma_set_policy() into vma_dup_policy()Oleg Nesterov2013-09-111-6/+3
| | | | | | | | | | | | | | | Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checksOleg Nesterov2013-09-111-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID. Then later copy_process() denies CLONE_SIGHAND if the new process will be in a different pid namespace (task_active_pid_ns() doesn't match current->nsproxy->pid_ns). This looks confusing and inconsistent. CLONE_NEWPID is very similar to the case when ->pid_ns was already unshared, we want the same restrictions so copy_process() should also nack CLONE_PARENT. And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well just for consistency. Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change copy_process() to do the same check along with ->pid_ns check we already have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pidns: kill the unnecessary CLONE_NEWPID in copy_process()Oleg Nesterov2013-09-111-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does the same check. Remove the CLONE_NEWPID check to cleanup the code and prepare for the next change. Test-case: static int child(void *arg) { return 0; } static char stack[16 * 1024]; int main(void) { pid_t pid; assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); pid = clone(child, stack + sizeof(stack) / 2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid < 0 && errno == EINVAL); return 0; } clone(CLONE_NEWPID) correctly fails with or without this change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pidns: fix vfork() after unshare(CLONE_NEWPID)Oleg Nesterov2013-09-111-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared pid_ns, this obviously breaks vfork: int main(void) { assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); assert(vfork() >= 0); _exit(0); return 0; } fails without this patch. Change this check to use CLONE_SIGHAND instead. This also forbids CLONE_THREAD automatically, and this is what the comment implies. We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it would be safer to not do this. The current check denies CLONE_SIGHAND implicitely and there is no reason to change this. Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better. Having shared signal handling between two different pid namespaces is the case that we are fundamentally guarding against." Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Colin Walters <walters@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-101-16/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks
| * Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.cAl Viro2013-09-071-16/+17
| | | | | | | | | | | | | | | | | | kernel/cgroup.c is the only place in the tree that relies on eventfd.h pulling file.h; move that include there. Switch from eventfd_fget()/fput() to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail on non-eventfd descriptors just fine, no need to do that check twice... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'trace-3.12' of ↵Linus Torvalds2013-09-096-209/+65
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Not much changes for the 3.12 merge window. The major tracing changes are still in flux, and will have to wait for 3.13. The changes for 3.12 are mostly clean ups and minor fixes. H Peter Anvin added a check to x86_32 static function tracing that helps a small segment of the kernel community. Oleg Nesterov had a few changes from 3.11, but were mostly clean ups and not worth pushing in the -rc time frame. Li Zefan had small clean up with annotating a raw_init with __init. I fixed a slight race in updating function callbacks, but the race is so small and the bug that happens when it occurs is so minor it's not even worth pushing to stable. The only real enhancement is from Alexander Z Lam that made the tracing_cpumask work for trace buffer instances, instead of them all sharing a global cpumask" * tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/rcu: Do not trace debug_lockdep_rcu_enabled() x86-32, ftrace: Fix static ftrace when early microcode is enabled ftrace: Fix a slight race in modifying what function callback gets traced tracing: Make tracing_cpumask available for all instances tracing: Kill the !CONFIG_MODULES code in trace_events.c tracing: Don't pass file_operations array to event_create_dir() tracing: Kill trace_create_file_ops() and friends tracing/syscalls: Annotate raw_init function with __init
| * | ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()Steven Rostedt (Red Hat)2013-09-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function debug_lockdep_rcu_enabled() is part of the RCU lockdep debugging, and is called very frequently. I found that if I enable a lot of debugging and run the function graph tracer, this function can cause a live lock of the system. We don't usually trace lockdep infrastructure, no need to trace this either. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | ftrace: Fix a slight race in modifying what function callback gets tracedSteven Rostedt (Red Hat)2013-09-031-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a slight race when going from a list function to a non list function. That is, when only one callback is registered to the function tracer, it gets called directly by the mcount trampoline. But if this function has filters, it may be called by the wrong functions. As the list ops callback that handles multiple callbacks that are registered to ftrace, it also handles what functions they call. While the transaction is taking place, use the list function always, and after all the updates are finished (only the functions that should be traced are being traced), then we can update the trampoline to call the function directly. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing: Make tracing_cpumask available for all instancesAlexander Z Lam2013-08-222-17/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow tracer instances to disable tracing by cpu by moving the static global tracing_cpumask into trace_array. Link: http://lkml.kernel.org/r/921622317f239bfc2283cac2242647801ef584f2.1375980149.git.azl@google.com Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: David Sharp <dhsharp@google.com> Cc: Alexander Z Lam <lambchop468@gmail.com> Signed-off-by: Alexander Z Lam <azl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing: Kill the !CONFIG_MODULES code in trace_events.cOleg Nesterov2013-08-211-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move trace_module_nb under CONFIG_MODULES and kill the dummy trace_module_notify(). Imho it doesn't make sense to define "struct notifier_block" and its .notifier_call just to avoid "ifdef" in event_trace_init(), and all other !CONFIG_MODULES code has already gone away. Link: http://lkml.kernel.org/r/20130731173137.GA31043@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing: Don't pass file_operations array to event_create_dir()Oleg Nesterov2013-08-211-34/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that event_create_dir() and __trace_add_new_event() always use the same file_operations we can kill these arguments and simplify the code. Link: http://lkml.kernel.org/r/20130731173135.GA31040@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing: Kill trace_create_file_ops() and friendsOleg Nesterov2013-08-211-144/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trace_create_file_ops() allocates the copy of id/filter/format/enable file_operations to set "f_op->owner = mod" for fops_get(). However after the recent changes there is no reason to prevent rmmod even if one of these files is opened. A file operation can do nothing but fail after remove_event_file_dir() clears ->i_private for every file removed by trace_module_remove_events(). Kill "struct ftrace_module_file_ops" and fix the compilation errors. Link: http://lkml.kernel.org/r/20130731173132.GA31033@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing/syscalls: Annotate raw_init function with __initLi Zefan2013-08-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | init_syscall_trace() can only be called during kernel bootup only, so we can mark it and the functions it calls as __init. Link: http://lkml.kernel.org/r/51528E89.6080508@huawei.com Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds2013-09-091-0/+1
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs updates from Ben Myers: "For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage of shared code between libxfs and the kernel, the rest of the work to enable project and group quotas to be used simultaneously, performance optimisations in the log and the CIL, directory entry file type support, fixes for log space reservations, some spelling/grammar cleanups, and the addition of user namespace support. - introduce readahead to log recovery - add directory entry file type support - fix a number of spelling errors in comments - introduce new Q_XGETQSTATV quotactl for project quotas - add USER_NS support - log space reservation rework - CIL optimisations - kernel/userspace libxfs rework" * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits) xfs: XFS_MOUNT_QUOTA_ALL needed by userspace xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino Fix wrong flag ASSERT in xfs_attr_shortform_getvalue xfs: finish removing IOP_* macros. xfs: inode log reservations are too small xfs: check correct status variable for xfs_inobt_get_rec() call xfs: inode buffers may not be valid during recovery readahead xfs: check LSN ordering for v5 superblocks during recovery xfs: btree block LSN escaping to disk uninitialised XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 xfs: fix bad dquot buffer size in log recovery readahead xfs: don't account buffer cancellation during log recovery readahead xfs: check for underflow in xfs_iformat_fork() xfs: xfs_dir3_sfe_put_ino can be static xfs: introduce object readahead to log recovery xfs: Simplify xfs_ail_min() with list_first_entry_or_null() xfs: Register hotcpu notifier after initialization xfs: add xfs sb v4 support for dirent filetype field xfs: Add write support for dirent filetype field xfs: Add read-only support for dirent filetype field ...
| * | xfs: ioctl check for capabilities in the current user namespaceDwight Engen2013-08-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use inode_capable() to check if SUID|SGID bits should be cleared to match similar check in inode_change_ok(). The check for CAP_LINUX_IMMUTABLE was not modified since all other file systems also check against init_user_ns rather than current_user_ns. Only allow changing of projid from init_user_ns. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-0711-59/+27
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users
| * | | userns: Kill nsown_capable it makes the wrong thing easyEric W. Biederman2013-08-306-26/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and CAP_SETGID. For the existing users it doesn't noticably simplify things and from the suggested patches I have seen it encourages people to do the wrong thing. So remove nsown_capable. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREADEric W. Biederman2013-08-301-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I goofed when I made unshare(CLONE_NEWPID) only work in a single-threaded process. There is no need for that requirement and in fact I analyzied things right for setns. The hard requirement is for tasks that share a VM to all be in the pid namespace and we properly prevent that in do_fork. Just to be certain I took a look through do_wait and forget_original_parent and there are no cases that make it any harder for children to be in the multiple pid namespaces than it is for children to be in the same pid namespace. I also performed a check to see if there were in uses of task->nsproxy_pid_ns I was not familiar with, but it is only used when allocating a new pid for a new task, and in checks to prevent craziness from happening. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | namespaces: Simplify copy_namespaces so it is clear what is going on.Eric W. Biederman2013-08-301-24/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the test for the impossible case where tsk->nsproxy == NULL. Fork will never be called with tsk->nsproxy == NULL. Only call get_nsproxy when we don't need to generate a new_nsproxy, and mark the case where we don't generate a new nsproxy as likely. Remove the code to drop an unnecessarily acquired nsproxy value. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeupEric W. Biederman2013-08-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Serge Hallyn <serge.hallyn@ubuntu.com> writes: > Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been > possible to get into a situation where a pidns reaper is > <defunct>, reparented to host pid 1, but never reaped. How to > reproduce this is documented at > > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526 > (and see > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13) > In short, run repeated starts of a container whose init is > > Process.exit(0); > > sysrq-t when such a task is playing zombie shows: > > [ 131.132978] init x ffff88011fc14580 0 2084 2039 0x00000000 > [ 131.132978] ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580 > [ 131.132978] ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000 > [ 131.132978] ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0 > [ 131.132978] Call Trace: > [ 131.132978] [<ffffffff816f6159>] schedule+0x29/0x70 > [ 131.132978] [<ffffffff81064591>] do_exit+0x6e1/0xa40 > [ 131.132978] [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30 > [ 131.132978] [<ffffffff8106496f>] do_group_exit+0x3f/0xa0 > [ 131.132978] [<ffffffff810649e4>] SyS_exit_group+0x14/0x20 > [ 131.132978] [<ffffffff8170102f>] tracesys+0xe1/0xe6 > > Further debugging showed that every time this happened, zap_pid_ns_processes() > started with nr_hashed being 3, while we were expecting it to drop to 2. > Any time it didn't happen, nr_hashed was 1 or 2. So the reaper was > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper > if nr_hashed hits 1. The issue is that when the task group leader of an init process exits before other tasks of the init process when the init process finally exits it will be a secondary task sleeping in zap_pid_ns_processes and waiting to wake up when the number of hashed pids drops to two. This case waits forever as free_pid only sends a wake up when the number of hashed pids drops to 1. To correct this the simple strategy of sending a possibly unncessary wake up when the number of hashed pids drops to 2 is adopted. Sending one extraneous wake up is relatively harmless, at worst we waste a little cpu time in the rare case when a pid namespace appropaches exiting. We can detect the case when the pid namespace drops to just two pids hashed race free in free_pid. Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe without out the tasklist_lock because it is guaranteed that the detach_pid will be called on the child_reaper before it is freed and detach_pid calls __change_pid which calls free_pid which takes the pidmap_lock. __change_pid only calls free_pid if this is the last use of the pid. For a thread that is not the thread group leader the threads pid will only ever have one user because a threads pid is not allowed to be the pid of a process, of a process group or a session. For a thread that is a thread group leader all of the other threads of that process will be reaped before it is allowed for the thread group leader to be reaped ensuring there will only be one user of the threads pid as a process pid. Furthermore because the thread is the init process of a pid namespace all of the other processes in the pid namespace will have also been already freed leading to the fact that the pid will not be used as a session pid or a process group pid for any other running process. CC: stable@vger.kernel.org Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | userns: Better restrictions on when proc and sysfs can be mountedEric W. Biederman2013-08-262-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rely on the fact that another flavor of the filesystem is already mounted and do not rely on state in the user namespace. Verify that the mounted filesystem is not covered in any significant way. I would love to verify that the previously mounted filesystem has no mounts on top but there are at least the directories /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly for other filesystems to mount on top of. Refactor the test into a function named fs_fully_visible and call that function from the mount routines of proc and sysfs. This makes this test local to the filesystems involved and the results current of when the mounts take place, removing a weird threading of the user namespace, the mount namespace and the filesystems themselves. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | kernel/nsproxy.c: Improving a snippet of code.Raphael S.Carvalho2013-08-261-1/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | It seems GCC generates a better code in that way, so I changed that statement. Btw, they have the same semantic, so I'm sending this patch due to performance issues. Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2013-09-071-22/+10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull crypto update from Herbert Xu: "Here is the crypto update for 3.12: - Added MODULE_SOFTDEP to allow pre-loading of modules. - Reinstated crct10dif driver using the module softdep feature. - Allow via rng driver to be auto-loaded. - Split large input data when necessary in nx. - Handle zero length messages correctly for GCM/XCBC in nx. - Handle SHA-2 chunks bigger than block size properly in nx. - Handle unaligned lengths in omap-aes. - Added SHA384/SHA512 to omap-sham. - Added OMAP5/AM43XX SHAM support. - Added OMAP4 TRNG support. - Misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (66 commits) Reinstate "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework" hwrng: via - Add MODULE_DEVICE_TABLE crypto: fcrypt - Fix bitoperation for compilation with clang crypto: nx - fix SHA-2 for chunks bigger than block size crypto: nx - fix GCM for zero length messages crypto: nx - fix XCBC for zero length messages crypto: nx - fix limits to sg lists for AES-CCM crypto: nx - fix limits to sg lists for AES-XCBC crypto: nx - fix limits to sg lists for AES-GCM crypto: nx - fix limits to sg lists for AES-CTR crypto: nx - fix limits to sg lists for AES-CBC crypto: nx - fix limits to sg lists for AES-ECB crypto: nx - add offset to nx_build_sg_lists() padata - Register hotcpu notifier after initialization padata - share code between CPU_ONLINE and CPU_DOWN_FAILED, same to CPU_DOWN_PREPARE and CPU_UP_CANCELED hwrng: omap - reorder OMAP TRNG driver code crypto: omap-sham - correct dma burst size crypto: omap-sham - Enable Polling mode if DMA fails crypto: tegra-aes - bitwise vs logical and crypto: sahara - checking the wrong variable ...
| * \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linuxHerbert Xu2013-09-0769-1908/+2886
| |\ \ \ | | | | | | | | | | | | | | | Merge upstream tree in order to reinstate crct10dif.
| * | | | padata - Register hotcpu notifier after initializationRichard Weinberger2013-08-291-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | padata_cpu_callback() takes pinst->lock, to avoid taking an uninitialized lock, register the notifier after it's initialization. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
| * | | | padata - share code between CPU_ONLINE and CPU_DOWN_FAILED, same to ↵Chen Gang2013-08-291-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPU_DOWN_PREPARE and CPU_UP_CANCELED Share code between CPU_ONLINE and CPU_DOWN_FAILED, same to CPU_DOWN_PREPARE and CPU_UP_CANCELED. It will fix 2 bugs: "not check the return value of __padata_remove_cpu() and __padata_add_cpu()". "need add 'break' between CPU_UP_CANCELED and CPU_DOWN_FAILED". Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-061-41/+66
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
| * | | | | workqueue: fix some scripts/kernel-doc warningsYacine Belkadi2013-08-201-41/+66
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building the htmldocs (in verbose mode), scripts/kernel-doc reports the following type of warnings: Warning(kernel/workqueue.c:653): No description found for return value of 'get_work_pool' Fix them by: - Using "Return:" sections to introduce descriptions of return values - Adding some missing descriptions Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2013-09-051-8/+11
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cputime fix from Ingo Molnar: "This fixes a longer-standing cputime accounting bug that Stanislaw Gruszka finally managed to track down" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Do not scale when utime == 0
| * | | | | sched/cputime: Do not scale when utime == 0Stanislaw Gruszka2013-09-041-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux | grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259 Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-051-7/+6
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
| * | | | | | switch copy_module_from_fd() to fdgetAl Viro2013-09-031-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-09-041-15/+0
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Gleb Natapov: "The highlights of the release are nested EPT and pv-ticketlocks support (hypervisor part, guest part, which is most of the code, goes through tip tree). Apart of that there are many fixes for all arches" Fix up semantic conflicts as discussed in the pull request thread.. * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits) ARM: KVM: Add newlines to panic strings ARM: KVM: Work around older compiler bug ARM: KVM: Simplify tracepoint text ARM: KVM: Fix kvm_set_pte assignment ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs ARM: KVM: vgic: fix GICD_ICFGRn access ARM: KVM: vgic: simplify vgic_get_target_reg KVM: MMU: remove unused parameter KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate() KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX KVM: x86: update masterclock when kvmclock_offset is calculated (v2) KVM: PPC: Book3S: Fix compile error in XICS emulation KVM: PPC: Book3S PR: return appropriate error when allocation fails arch: powerpc: kvm: add signed type cast for comparation KVM: x86: add comments where MMIO does not return to the emulator KVM: vmx: count exits to userspace during invalid guest emulation KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp kvm: optimize away THP checks in kvm_is_mmio_pfn() ...
| * | | | | | | remove sched notifier for cross-cpu migrationsMarcelo Tosatti2013-07-181-15/+0
| | |_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux as a guest on KVM hypervisor, the only user of the pvclock vsyscall interface, does not require notification on task migration because: 1. cpu ID number maps 1:1 to per-CPU pvclock time info. 2. per-CPU pvclock time info is updated if the underlying CPU changes. 3. that version is increased whenever underlying CPU changes. Which is sufficient to guarantee nanoseconds counter is calculated properly. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* | | | | | | Merge tag 'modules-next-for-linus' of ↵Linus Torvalds2013-09-042-10/+29
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Minor fixes mainly, including a potential use-after-free on remove found by CONFIG_DEBUG_KOBJECT_RELEASE which may be theoretical" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: Fix mod->mkobj.kobj potentially freed too early kernel/params.c: use scnprintf() instead of sprintf() kernel/module.c: use scnprintf() instead of sprintf() module/lsm: Have apparmor module parameters work with no args module: Add NOARG flag for ops with param_set_bool_enable_only() set function module: Add flag to allow mod params to have no arguments modules: add support for soft module dependencies scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture. module: fix sprintf format specifier in param_get_byte()
| * | | | | | | module: Fix mod->mkobj.kobj potentially freed too earlyLi Zhong2013-09-032-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(ffffffffa01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. This patch tries to fix it by waiting for the mod->mkobj.kobj to be really released in the module removing process (and some error code paths). [ 1844.175287] kobject: 'holders' (ffff88007c1f1600): kobject_release, parent ffffffffa01600d0 (delayed) [ 1844.178991] kobject: 'notes' (ffff8800370b2a00): kobject_release, parent ffffffffa01600d0 (delayed) [ 1845.180118] kobject: 'holders' (ffff88007c1f1600): kobject_cleanup, parent ffffffffa01600d0 [ 1845.182130] kobject: 'holders' (ffff88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at ffffffffa01601d0 [ 1845.185026] IP: [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: 0000 [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: ffff88007ca51f00 ti: ffff88007ca5c000 task.ti: ffff88007ca5c000 [ 1845.185026] RIP: 0010:[<ffffffff812cda81>] [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:ffff88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 0000000000002000 RBX: ffffffffa01600d0 RCX: ffffffff8177d638 [ 1845.185026] RDX: ffff88007ca5dc18 RSI: 0000000000000000 RDI: ffffffffa01600d0 [ 1845.185026] RBP: ffff88007ca5dd18 R08: ffffffff824e9810 R09: ffffffffffffffff [ 1845.185026] R10: ffff8800ffffffff R11: dead4ead00000001 R12: ffffffff81a95040 [ 1845.185026] R13: ffff88007b27a960 R14: ffff88007c1f1600 R15: 0000000000000000 [ 1845.185026] FS: 0000000000000000(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000 [ 1845.185026] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1845.185026] CR2: ffffffffa01601d0 CR3: 0000000037207000 CR4: 00000000000006b0 [ 1845.185026] Stack: [ 1845.185026] ffff88007c1f1600 ffff88007c1f1600 ffff88007ca5dd38 ffffffff812cdb7e [ 1845.185026] 0000000000000000 ffff88007c1f1640 ffff88007ca5dd68 ffffffff812cdbfe [ 1845.185026] ffff88007c974800 ffff88007c1f1640 ffff88007ff61a00 0000000000000000 [ 1845.185026] Call Trace: [ 1845.185026] [<ffffffff812cdb7e>] kobject_del+0x2e/0x40 [ 1845.185026] [<ffffffff812cdbfe>] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [<ffffffff81063a45>] process_one_work+0x1e5/0x670 [ 1845.185026] [<ffffffff810639e3>] ? process_one_work+0x183/0x670 [ 1845.185026] [<ffffffff810642b3>] worker_thread+0x113/0x370 [ 1845.185026] [<ffffffff810641a0>] ? rescuer_thread+0x290/0x290 [ 1845.185026] [<ffffffff8106bfba>] kthread+0xda/0xe0 [ 1845.185026] [<ffffffff814ff0f0>] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [<ffffffff8150751a>] ret_from_fork+0x7a/0xb0 [ 1845.185026] [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d <f6> 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] RSP <ffff88007ca5dd08> [ 1845.185026] CR2: ffffffffa01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | | kernel/params.c: use scnprintf() instead of sprintf()Chen Gang2013-08-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some strings (e.g. version string), they are permitted to be larger than PAGE_SIZE (although meaningless), so recommend to use scnprintf() instead of sprintf(). Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | | kernel/module.c: use scnprintf() instead of sprintf()Chen Gang2013-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some strings, they are permitted to be larger than PAGE_SIZE, so need use scnprintf() instead of sprintf(), or it will cause issue. One case is: if a module version is crazy defined (length more than PAGE_SIZE), 'modinfo' command is still OK (print full contents), but for "cat /sys/modules/'modname'/version", will cause issue in kernel. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | | module: Add NOARG flag for ops with param_set_bool_enable_only() set functionSteven Rostedt2013-08-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ops that uses param_set_bool_enable_only() as its set function can easily handle being used without an argument. There's no reason to fail the loading of the module if it does not have one. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | | module: Add flag to allow mod params to have no argumentsSteven Rostedt2013-08-201-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the params.c code allows only two "set" functions to have no arguments. If a parameter does not have an argument, then it looks at the set function and tests if it is either param_set_bool() or param_set_bint(). If it is not one of these functions, then it fails the loading of the module. But there may be module parameters that have different set functions and still allow no arguments. But unless each of these cases adds their function to the if statement, it wont be allowed to have no arguments. This method gets rather messing and does not scale. Instead, introduce a flags field to the kernel_param_ops, where if the flag KERNEL_PARAM_FL_NOARG is set, the parameter will not fail if it does not contain an argument. It will be expected that the corresponding set function can handle a NULL pointer as "val". Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | | module: fix sprintf format specifier in param_get_byte()Christoph Jaeger2013-08-201-1/+1
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In param_get_byte(), to which the macro STANDARD_PARAM_DEF(byte, ...) expands, "%c" is used to print an unsigned char. So it gets printed as a character what is not intended here. Use "%hhu" instead. [Rusty: note drivers which would be effected: drivers/net/wireless/cw1200/main.c drivers/ntb/ntb_transport.c:68 drivers/scsi/lpfc/lpfc_attr.c drivers/usb/atm/speedtch.c drivers/usb/gadget/g_ffs.c ] Acked-by: Jon Mason <jon.mason@intel.com> (for ntb) Acked-by: Michal Nazarewicz <mina86@mina86.com> (for g_ffs.c) Signed-off-by: Christoph Jaeger <christophjaeger@linux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* | | | | | | Merge branch 'x86-spinlocks-for-linus' of ↵Linus Torvalds2013-09-041-0/+1
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 spinlock changes from Ingo Molnar: "The biggest change here are paravirtualized ticket spinlocks (PV spinlocks), which bring a nice speedup on various benchmarks. The KVM host side will come to you via the KVM tree" * 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static" kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor kvm guest: Add configuration support to enable debug information for KVM Guests kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi xen, pvticketlock: Allow interrupts to be enabled while blocking x86, ticketlock: Add slowpath logic jump_label: Split jumplabel ratelimit x86, pvticketlock: When paravirtualizing ticket locks, increment by 2 x86, pvticketlock: Use callee-save for lock_spinning xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks xen, pvticketlock: Xen implementation for PV ticket locks xen: Defer spinlock setup until boot CPU setup x86, ticketlock: Collapse a layer of functions x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks x86, spinlock: Replace pv spinlocks with pv ticketlocks
| * | | | | | | jump_label: Split jumplabel ratelimitAndrew Jones2013-08-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b202952075f62603bea9bfb6ebc6b0420db11949 ("perf, core: Rate limit perf_sched_events jump_label patching") introduced rate limiting for jump label disabling. The changes were made in the jump label code in order to be more widely available and to keep things tidier. This is all fine, except now jump_label.h includes linux/workqueue.h, which makes it impossible to include jump_label.h from anything that workqueue.h needs. For example, it's now impossible to include jump_label.h from asm/spinlock.h, which is done in proposed pv-ticketlock patches. This patch splits out the rate limiting related changes from jump_label.h into a new file, jump_label_ratelimit.h, to resolve the issue. Signed-off-by: Andrew Jones <drjones@redhat.com> Link: http://lkml.kernel.org/r/1376058122-8248-10-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | | | | | | | Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds2013-09-045-127/+117
|\ \ \ \ \ \ \ \ | | |_|_|/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers/nohz changes from Ingo Molnar: "It mostly contains fixes and full dynticks off-case optimizations, by Frederic Weisbecker" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) nohz: Include local CPU in full dynticks global kick nohz: Optimize full dynticks's sched hooks with static keys nohz: Optimize full dynticks state checks with static keys nohz: Rename a few state variables vtime: Always debug check snapshot source _before_ updating it vtime: Always scale generic vtime accounting results vtime: Optimize full dynticks accounting off case with static keys vtime: Describe overriden functions in dedicated arch headers m68k: hardirq_count() only need preempt_mask.h hardirq: Split preempt count mask definitions context_tracking: Split low level state headers vtime: Fix racy cputime delta update vtime: Remove a few unneeded generic vtime state checks context_tracking: User/kernel broundary cross trace events context_tracking: Optimize context switch off case with static keys context_tracking: Optimize guest APIs off case with static key context_tracking: Optimize main APIs off case with static key context_tracking: Ground setup for static key use context_tracking: Remove full dynticks' hacky dependency on wide context tracking nohz: Only enable context tracking on full dynticks CPUs ...
| * | | | | | | nohz: Include local CPU in full dynticks global kickFrederic Weisbecker2013-08-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tick_nohz_full_kick_all() is useful to notify all full dynticks CPUs that there is a system state change to checkout before re-evaluating the need for the tick. Unfortunately this is implemented using smp_call_function_many() that ignores the local CPU. This CPU also needs to re-evaluate the tick. on_each_cpu_mask() is not useful either because we don't want to re-evaluate the tick state in place but asynchronously from an IPI to avoid messing up with any random locking scenario. So lets call tick_nohz_full_kick() from tick_nohz_full_kick_all() so that the usual irq work takes care of it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kevin Hilman <khilman@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375460996-16329-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | Merge branch 'timers/nohz-v3' of ↵Ingo Molnar2013-08-146-128/+116
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz Pull nohz improvements from Frederic Weisbecker: " It mostly contains fixes and full dynticks off-case optimizations. I believe that distros want to enable this feature so it seems important to optimize the case where the "nohz_full=" parameter is empty. ie: I'm trying to remove any performance regression that comes with NO_HZ_FULL=y when the feature is not used. This patchset improves the current situation a lot (off-case appears to be around 11% faster with hackbench, although I guess it may vary depending on the configuration but it should be significantly faster in any case) now there is still some work to do: I can still observe a remaining loss of 1.6% throughput seen with hackbench compared to CONFIG_NO_HZ_FULL=n. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | | | | nohz: Optimize full dynticks's sched hooks with static keysFrederic Weisbecker2013-08-141-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scheduler IPIs and task context switches are serious fast path. Let's try to hide as much as we can the impact of full dynticks APIs' off case that are called on these sites through the use of static keys. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kevin Hilman <khilman@linaro.org>