summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* bug: Make BUG() always stop the machineJosh Triplett2014-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When !CONFIG_BUG and !HAVE_ARCH_BUG, define the generic BUG() as an infinite loop rather than a no-op. This avoids undefined behavior if execution ever actually reaches BUG(), and avoids warnings about code after BUG() (such as on non-void functions calling BUG() and then not returning). bloat-o-meter results: add/remove: 0/0 grow/shrink: 43/10 up/down: 235/-98 (137) function old new delta umount_collect 119 138 +19 notify_change 306 324 +18 xstate_enable_boot_cpu 252 269 +17 kunmap 54 70 +16 balloon_page_dequeue 112 126 +14 mm_take_all_locks 223 233 +10 list_lru_walk_node 143 152 +9 vma_adjust 1059 1067 +8 pcpu_setup_first_chunk 1130 1138 +8 mm_drop_all_locks 143 151 +8 ns_capable 55 62 +7 anon_transport_class_unregister 8 15 +7 srcu_init_notifier_head 35 41 +6 shrink_dcache_for_umount 174 180 +6 kunmap_high 99 105 +6 end_page_writeback 43 49 +6 do_exit 1339 1345 +6 __kfifo_dma_out_prepare_r 86 92 +6 __kfifo_dma_in_prepare_r 90 96 +6 fixup_user_fault 120 125 +5 repair_env_string 73 77 +4 read_cache_pages_invalidate_page 56 60 +4 isolate_lru_pages.isra 142 146 +4 do_notify_parent_cldstop 255 259 +4 cpu_init 370 374 +4 utimes_common 270 272 +2 tasklet_hi_action 91 93 +2 tasklet_action 91 93 +2 set_pte_vaddr 46 48 +2 find_get_pages_tag 202 204 +2 early_iounmap 185 187 +2 __native_set_fixmap 36 38 +2 __get_user_pages 822 824 +2 __early_ioremap 299 301 +2 yield_task_stop 1 2 +1 tick_resume 37 38 +1 switched_to_stop 1 2 +1 switched_to_idle 1 2 +1 prio_changed_stop 1 2 +1 prio_changed_idle 1 2 +1 pm_qos_power_read 111 112 +1 arch_cpu_idle_dead 1 2 +1 __insert_vmap_area 140 141 +1 sys_renameat 614 612 -2 mm_fault_error 297 295 -2 SyS_renameat 614 612 -2 sys_linkat 416 413 -3 SyS_linkat 416 413 -3 chmod_common 129 122 -7 proc_cap_handler 240 225 -15 __schedule 849 831 -18 sys_madvise 1077 1054 -23 SyS_madvise 1077 1054 -23 Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: when !CONFIG_BUG, make WARN call no_printk to check format and argsJosh Triplett2014-04-071-0/+1
| | | | | | | | | | | | The stub version of WARN for !CONFIG_BUG completely ignored its format string and subsequent arguments; make it check them instead, using no_printk. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/asm-generic/bug.h: style fix: s/while(0)/while (0)/Josh Triplett2014-04-071-3/+3
| | | | | | | | Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: when !CONFIG_BUG, simplify WARN_ON_ONCE and familyJosh Triplett2014-04-071-27/+30
| | | | | | | | | | | | | | | | | When !CONFIG_BUG, WARN_ON and family become simple passthroughs of their condition argument; however, WARN_ON_ONCE and family still have conditions and a boolean to detect one-time invocation, even though the warning they'd emit doesn't exist. Make the existing definitions conditional on CONFIG_BUG, and add definitions for !CONFIG_BUG that map to the passthrough versions of WARN and WARN_ON. This saves 4.4k on a minimized configuration (smaller than allnoconfig), and 20.6k with defconfig plus CONFIG_BUG=n. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: rework device hierarchy and introduce mport class of devicesAlexandre Bounine2014-04-071-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes an artificial RapidIO bus root device and establishes actual device hierarchy by providing reference to real parent devices. It also introduces device class for RapidIO controller devices (on-chip or an eternal bridge, known as "mport"). Existing implementation was sufficient for SoC-based platforms that have a single RapidIO controller. With introduction of devices using multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme is very limiting or does not work at all. The implemented changes allow to properly reference platform's local RapidIO mport devices and provide device details needed for upper layers. This change to RapidIO device hierarchy does not break any known existing kernel or user space interfaces. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com> Cc: Stef van Os <stef.van.os@prodrive-technologies.com> Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com> Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* idr: remove dead codeStephen Hemminger2014-04-071-63/+0
| | | | | | | | | | | | | | | | Remove no longer used deprecated code, and make local functions static. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Jean Delvare <jdelvare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jeff Layton <jlayton@redhat.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: George Spelvin <linux@horizon.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/crash_dump.h: add vmcore_cleanup() prototypeRashika Kheria2014-04-071-0/+1
| | | | | | | | | | | | | Eliminate the following warning in proc/vmcore.c: fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes] [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-spaceOleg Nesterov2014-04-071-2/+2
| | | | | | | | | | | | | | | | | | | | | get_task_state() uses the most significant bit to report the state to user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition can be noticed via /proc as Z -> X -> Z change. Note that this was possible even before EXIT_TRACE was introduced. This is not really bad but imho it make sense to hide EXIT_TRACE from user-space completely. So the patch simply swaps EXIT_ZOMBIE and EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transitionOleg Nesterov2014-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and drops tasklist_lock. If this task is not the natural child and it is traced, we change its state back to EXIT_ZOMBIE for ->real_parent. The last transition is racy, this is even documented in 50b8d257486a "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race". wait_consider_task() tries to detect this transition and clear ->notask_error but we can't rely on ptrace_reparented(), debugger can exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE. And there is another problem which were missed before: this transition can also race with reparent_leader() which doesn't reset >exit_signal if EXIT_DEAD, assuming that this task must be reaped by someone else. So the tracee can be re-parented with ->exit_signal != SIGCHLD, and if /sbin/init doesn't use __WALL it becomes unreapable. This was fixed by the previous commit, but it was the temporary hack. 1. Add the new exit_state, EXIT_TRACE. It means that the task is the traced zombie, debugger is going to detach and notify its natural parent. This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we can avoid the changes in proc/kgdb code, get_task_state() still reports "X (dead)" in this case. Note: with or without this change userspace can see Z -> X -> Z transition. Not really bad, but probably makes sense to fix. 2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD if we need to notify the ->real_parent. 3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD is always the final state we can safely ignore such a task. 4. Change wait_consider_task() to check EXIT_TRACE separately and kill the racy and no longer needed ptrace_reparented() case. If ptrace == T an EXIT_TRACE thread should be simply ignored, the owner of this state is going to ptrace_unlink() this task. We can pretend that it was already removed from ->ptraced list. Otherwise we should skip this thread too but clear ->notask_error, we must be the natural parent and debugger is going to untrace and notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace" even if the task was already untraced. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Reported-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exec: kill bprm->tcomm[], simplify the "basename" logicOleg Nesterov2014-04-073-3/+2
| | | | | | | | | | | | | | | | | | | | Starting from commit c4ad8f98bef7 ("execve: use 'struct filename *' for executable name passing") bprm->filename can not go away after flush_old_exec(), so we do not need to save the binary name in bprm->tcomm[] added by 96e02d158678 ("exec: fix use-after-free bug in setup_new_exec()"). And there was never need for filename_to_taskname-like code, we can simply do set_task_comm(kbasename(filename). This patch has to change set_task_comm() and trace_task_rename() to accept "const char *", but I think this change is also good. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASKSrikar Dronamraju2014-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is set). In such a case LAST_CPUPID_MASK turns out to be 0. But with recent commit 1ae71d0319: (mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0, page_cpupid_xchg_last() and page_cpupid_reset_last() causes page->_last_cpupid to be set to 0. This causes performance regression. Its almost as if numa_balancing is off. Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of LAST_CPUPID_WIDTH. Some performance numbers and perf stats with and without the fix. (3.14-rc6) ---------- numa01 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01': 12,27,462 cs [100.00%] 2,41,957 migrations [100.00%] 1,68,01,713 faults [100.00%] 7,99,35,29,041 cache-misses 98,808 migrate:mm_migrate_pages [100.00%] 1407.690148814 seconds time elapsed numa02 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02': 63,065 cs [100.00%] 14,364 migrations [100.00%] 2,08,118 faults [100.00%] 25,32,59,404 cache-misses 12 migrate:mm_migrate_pages [100.00%] 63.840827219 seconds time elapsed (3.14-rc6 with fix) ------------------- numa01 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01': 9,68,911 cs [100.00%] 1,01,414 migrations [100.00%] 88,38,697 faults [100.00%] 4,42,92,51,042 cache-misses 4,25,060 migrate:mm_migrate_pages [100.00%] 685.965331189 seconds time elapsed numa02 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02': 17,543 cs [100.00%] 2,962 migrations [100.00%] 1,17,843 faults [100.00%] 11,80,61,644 cache-misses 12,358 migrate:mm_migrate_pages [100.00%] 20.380132343 seconds time elapsed Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* madvise: correct the comment of MADV_DODUMP flagZhang Yanfei2014-04-071-1/+1
| | | | | | | | s/MADV_NODUMP/MADV_DONTDUMP/ Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/readahead.c: inline ra_submitFabian Frederick2014-04-071-3/+0
| | | | | | | | | | | | | Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove unused arg of set_page_dirty_balance()Miklos Szeredi2014-04-071-1/+1
| | | | | | | | | | | | There's only one caller of set_page_dirty_balance() and that will call it with page_mkwrite == 0. The page_mkwrite argument was unused since commit b827e496c893 "mm: close page_mkwrite races". Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: rename high level charging functionsMichal Hocko2014-04-071-4/+4
| | | | | | | | | | | | | mem_cgroup_newpage_charge is used only for charging anonymous memory so it is better to rename it to mem_cgroup_charge_anon. mem_cgroup_cache_charge is used for file backed memory so rename it to mem_cgroup_charge_file. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: get_mem_cgroup_from_mm()Johannes Weiner2014-04-071-6/+0
| | | | | | | | | | | | | Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm owner is exiting, just return root_mem_cgroup. This makes sense for all callsites and gets rid of some of them having to fallback manually. [fengguang.wu@intel.com: fix warnings] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use 'const char *' insted of 'char *' for reason in dump_page()Kirill A. Shutemov2014-04-071-2/+2
| | | | | | | | | | | | | | | I tried to use 'dump_page(page, __func__)' for debugging, but it triggers warning: warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default] Let's convert 'reason' to 'const char *' in dump_page() and friends: we shouldn't modify it anyway. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* res_counter: remove interface for locked charging and unchargingDavid Rientjes2014-04-071-5/+1
| | | | | | | | | | | | | | | | | | | | | | The res_counter_{charge,uncharge}_locked() variants are not used in the kernel outside of the resource counter code itself, so remove the interface. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Tim Hockin <thockin@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mempolicy: remove per-process flagDavid Rientjes2014-04-072-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users. There's no significant performance degradation to checking current->mempolicy rather than current->flags & PF_MEMPOLICY in the allocation path, especially since this is considered unlikely(). Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with 64GB of memory and without a mempolicy: threads before after 16 1249409 1244487 32 1281786 1246783 48 1239175 1239138 64 1244642 1241841 80 1244346 1248918 96 1266436 1254316 112 1307398 1312135 128 1327607 1326502 Per-process flags are a scarce resource so we should free them up whenever possible and make them available. We'll be using it shortly for memcg oom reserves. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Tim Hockin <thockin@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mempolicy: rename slab_node for clarityDavid Rientjes2014-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | slab_node() is actually a mempolicy function, so rename it to mempolicy_slab_node() to make it clearer that it used for processes with mempolicies. At the same time, cleanup its code by saving numa_mem_id() in a local variable (since we require a node with memory, not just any node) and remove an obsolete comment that assumes the mempolicy is actually passed into the function. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Tim Hockin <thockin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: per-thread vma cachingDavidlohr Bueso2014-04-073-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: implement ->map_pages for page cacheKirill A. Shutemov2014-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce vm_ops->map_pages()Kirill A. Shutemov2014-04-071-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here's new version of faultaround patchset. It took a while to tune it and collect performance data. First patch adds new callback ->map_pages to vm_operations_struct. ->map_pages() is called when VM asks to map easy accessible pages. Filesystem should find and map pages associated with offsets from "pgoff" till "max_pgoff". ->map_pages() is called with page table locked and must not block. If it's not possible to reach a page without blocking, filesystem should skip it. Filesystem should use do_set_pte() to setup page table entry. Pointer to entry associated with offset "pgoff" is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". Currently VM use ->map_pages only on read page fault path. We try to map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now. Performance data for different FAULT_AROUND_ORDER is below. TODO: - implement ->map_pages() for shmem/tmpfs; - modify get_user_pages() to be able to use ->map_pages() and implement mmap(MAP_POPULATE|MAP_NONBLOCK) on top. ========================================================================= Tested on 4-socket machine (120 threads) with 128GiB of RAM. Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is somewhere between 3 and 5. Let's say 4 :) Linux build (make -j60) FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9 minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485 time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358 Linux rebuild (make -j60) FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9 minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643 time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095 Git test suite (make -j60 test) FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9 minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515 time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413 Two synthetic tests: access every word in file in sequential/random order. It doesn't improve much after FAULT_AROUND_ORDER == 4. Sequential access 16GiB file FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9 1 thread minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282 time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856 8 threads minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339 time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528 32 threads minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200 time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797 60 threads minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217 time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902 120 threads minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862 time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622 Random access 1GiB file 1 thread minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222 time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994 8 threads minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384 time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804 32 threads minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881 time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690 60 threads minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691 time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273 120 threads minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010 time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594 Touch only one page in page table in 16GiB file FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9 1 thread minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237 time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386 8 threads minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596 time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984 32 threads minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183 time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559 60 threads minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825 time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938 120 threads minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164 time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125 This patch (of 2): Introduce new vm_ops callback ->map_pages() and uses it for mapping easy accessible pages around fault address. On read page fault, if filesystem provides ->map_pages(), we try to map up to FAULT_AROUND_PAGES pages around page fault address in hope to reduce number of minor page faults. We call ->map_pages first and use ->fault() as fallback if page by the offset is not ready to be mapped (cold page cache or something). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLEAlex Thorlton2014-04-072-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It also adds a prctl control which allows us to set the THP disable bit in mm->def_flags so that VMs will pick up the setting as they are created. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-04-063-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: - Stable fix for a use after free issue in the NFSv4.1 open code - Fix the SUNRPC bi-directional RPC code to account for TCP segmentation - Optimise usage of readdirplus when confronted with 'ls -l' situations - Soft mount bugfixes - NFS over RDMA bugfixes - NFSv4 close locking fixes - Various NFSv4.x client state management optimisations - Rename/unlink code cleanups" * tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) nfs: pass string length to pr_notice message about readdir loops NFSv4: Fix a use-after-free problem in open() SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status SUNRPC: Don't let rpc_delay() clobber non-timeout errors SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks NFSv4: Ensure we respect soft mount timeouts during trunking discovery NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted NFS: advertise only supported callback netids SUNRPC: remove KERN_INFO from dprintk() call sites SUNRPC: Fix large reads on NFS/RDMA NFS: Clean up: revert increase in READDIR RPC buffer max size SUNRPC: Ensure that call_bind times out correctly SUNRPC: Ensure that call_connect times out correctly nfs: emit a fsnotify_nameremove call in sillyrename codepath nfs: remove synchronous rename code nfs: convert nfs_rename to use async_rename infrastructure nfs: make nfs_async_rename non-static nfs: abstract out code needed to complete a sillyrename NFSv4: Clear the open state flags if the new stateid does not match ...
| * Merge branch 'devel' into linux-nextTrond Myklebust2014-03-173-4/+4
| |\
| | * nfs: remove synchronous rename codeJeff Layton2014-03-171-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | Now that nfs_rename uses the async infrastructure, we can remove this. Signed-off-by: Jeff Layton <jlayton@redhat.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * nfs: make nfs_async_rename non-staticJeff Layton2014-03-171-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | ...and move the prototype for nfs_sillyrename to internal.h. Signed-off-by: Jeff Layton <jlayton@redhat.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * nfs: abstract out code needed to complete a sillyrenameJeff Layton2014-03-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The async rename code is currently "polluted" with some parts that are really just for sillyrenames. Add a new "complete" operation vector to the nfs_renamedata to separate out the stuff that just needs to be done for a sillyrename. Signed-off-by: Jeff Layton <jlayton@redhat.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * NFS: Be more aggressive in using readdirplus for 'ls -l' situationsTrond Myklebust2014-02-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Try to detect 'ls -l' by having nfs_getattr() look at whether or not there is an opendir() file descriptor for the parent directory. If so, then assume that we want to force use of readdirplus in order to avoid the multiple GETATTR calls over the wire. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * SUNRPC: RPC callbacks may be split across several TCP segmentsTrond Myklebust2014-02-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since TCP is a stream protocol, our callback read code needs to take into account the fact that RPC callbacks are not always confined to a single TCP segment. This patch adds support for multiple TCP segments by ensuring that we only remove the rpc_rqst structure from the 'free backchannel requests' list once the data has been completely received. We rely on the fact that TCP data is ordered for the duration of the connection. Reported-by: shaobingqing <shaobingqing@bwstor.com.cn> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | Merge tag 'modules-next-for-linus' of ↵Linus Torvalds2014-04-066-23/+26
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Nothing major: the stricter permissions checking for sysfs broke a staging driver; fix included. Greg KH said he'd take the patch but hadn't as the merge window opened, so it's included here to avoid breaking build" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: staging: fix up speakup kobject mode Use 'E' instead of 'X' for unsigned module taint flag. VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms. kallsyms: fix percpu vars on x86-64 with relocation. kallsyms: generalize address range checking module: LLVMLinux: Remove unused function warning from __param_check macro Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE module: remove MODULE_GENERIC_TABLE module: allow multiple calls to MODULE_DEVICE_TABLE() per module module: use pr_cont
| * | | VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.Rusty Russell2014-03-243-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary of http://lkml.org/lkml/2014/3/14/363 : Ted: module_param(queue_depth, int, 444) Joe: 0444! Rusty: User perms >= group perms >= other perms? Joe: CLASS_ATTR, DEVICE_ATTR, SENSOR_ATTR and SENSOR_ATTR_2? Side effect of stricter permissions means removing the unnecessary S_IFREG from several callers. Note that the BUILD_BUG_ON_ZERO((perm) & 2) test was removed: a fair number of drivers fail this test, so that will be the debate for a future patch. Suggested-by: Joe Perches <joe@perches.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> for drivers/pci/slot.c Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | module: LLVMLinux: Remove unused function warning from __param_check macroMark Charlebois2014-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code makes a compile time type check that is optimized away. Clang complains that it generates an unused function: linux/kernel/panic.c:471:1: warning: unused function '__check_panic' [-Wunused-function] core_param(panic, panic_timeout, int, 0644); ^ linux/moduleparam.h:283:2: note: expanded from macro 'core_param' param_check_##type(name, &(var)); \ ^ <scratch space>:87:1: note: expanded from here param_check_int ^ linux/moduleparam.h:369:34: note: expanded from macro 'param_check_int' #define param_check_int(name, p) __param_check(name, p, int) ^ linux/moduleparam.h:349:22: note: expanded from macro '__param_check' static inline type *__check_##name(void) { return(p); } ^ <scratch space>:88:1: note: expanded from here __check_panic GCC won't complain for a static inline function but would if it was just a static function. Adding the unused attribute to the function declaration removes the warning. Per request from Rusty Russell it is marked as __always_unused as the code is meant to be optimized away. This code works for both GCC and clang. Signed-off-by: Mark Charlebois <charlebm@gmail.com> Signed-off-by: Behan Webster <behanw@converseincode.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULEMathieu Desnoyers2014-03-132-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users have reported being unable to trace non-signed modules loaded within a kernel supporting module signature. This is caused by tracepoint.c:tracepoint_module_coming() refusing to take into account tracepoints sitting within force-loaded modules (TAINT_FORCED_MODULE). The reason for this check, in the first place, is that a force-loaded module may have a struct module incompatible with the layout expected by the kernel, and can thus cause a kernel crash upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y. Tracepoints, however, specifically accept TAINT_OOT_MODULE and TAINT_CRAP, since those modules do not lead to the "very likely system crash" issue cited above for force-loaded modules. With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed module is tainted re-using the TAINT_FORCED_MODULE taint flag. Unfortunately, this means that Tracepoints treat that module as a force-loaded module, and thus silently refuse to consider any tracepoint within this module. Since an unsigned module does not fit within the "very likely system crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag to specifically address this taint behavior, and accept those modules within Tracepoints. We use the letter 'X' as a taint flag character for a module being loaded that doesn't know how to sign its name (proposed by Steven Rostedt). Also add the missing 'O' entry to trace event show_module_flags() list for the sake of completeness. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> NAKed-by: Ingo Molnar <mingo@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: David Howells <dhowells@redhat.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | module: remove MODULE_GENERIC_TABLERusty Russell2014-03-132-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MODULE_DEVICE_TABLE() calles MODULE_GENERIC_TABLE(); make it do the work directly. This also removes a wart introduced in the last patch, where the alias is defined to be an unknown struct type "struct type##__##name##_device_id" instead of "struct type##_device_id" (it's an extern so GCC doesn't care, but it's wrong). The other user of MODULE_GENERIC_TABLE (ISAPNP_CARD_TABLE) is unused, so delete it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | module: allow multiple calls to MODULE_DEVICE_TABLE() per moduleTom Gundersen2014-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 78551277e4df5: "Input: i8042 - add PNP modaliases" had a bug, where the second call to MODULE_DEVICE_TABLE() overrode the first resulting in not all the modaliases being exposed. This fixes the problem by including the name of the device_id table in the __mod_*_device_table alias, allowing us to export several device_id tables per module. Suggested-by: Kay Sievers <kay@vrfy.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* | | | Merge tag 'dm-3.15-changes' of ↵Linus Torvalds2014-04-051-3/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Fix dm-cache corruption caused by discard_block_size > cache_block_size - Fix a lock-inversion detected by LOCKDEP in dm-cache - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios error path - Fix corruption due to non-atomic transaction commit which allowed a metadata superblock to be written before all other metadata was successfully written -- this is common to all targets that use the persistent-data library's transaction manager (dm-thinp, dm-cache and dm-era). - Various small cleanups in the DM core - Add the dm-era target which is useful for keeping track of which blocks were written within a user defined period of time called an 'era'. Use cases include tracking changed blocks for backup software, and partially invalidating the contents of a cache to restore cache coherency after rolling back a vendor snapshot. - Improve the on-disk layout of multithreaded writes to the dm-thin-pool by splitting the pool's deferred bio list to be a per-thin device list and then sorting that list using an rb_tree. The subsequent read throughput of the data written via multiple threads improved by ~70%. - Simplify the multipath target's handling of queuing IO by pushing requests back to the request queue rather than queueing the IO internally. * tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm cache: fix a lock-inversion dm thin: sort the per thin deferred bios using an rb_tree dm thin: use per thin device deferred bio lists dm thin: simplify pool_is_congested dm thin: fix dangling bio in process_deferred_bios error path dm mpath: print more useful warnings in multipath_message() dm-mpath: do not activate failed paths dm mpath: remove extra nesting in map function dm mpath: remove map_io() dm mpath: reduce memory pressure when requeuing dm mpath: remove process_queued_ios() dm mpath: push back requests instead of queueing dm table: add dm_table_run_md_queue_async dm mpath: do not call pg_init when it is already running dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind dm: stop using bi_private dm: remove dm_get_mapinfo dm: make dm_table_alloc_md_mempools static dm: take care to copy the space map roots before locking the superblock dm transaction manager: fix corruption due to non-atomic transaction commit ...
| * | | | dm table: add dm_table_run_md_queue_asyncMike Snitzer2014-03-271-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce dm_table_run_md_queue_async() to run the request_queue of the mapped_device associated with a request-based DM table. Also add dm_md_get_queue() wrapper to extract the request_queue from a mapped_device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | | | dm: remove dm_get_mapinfoMikulas Patocka2014-03-271-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove dm_get_mapinfo() because no target uses it. Targets can allocate per-bio data using ti->per_bio_data_size, this is much more flexible than union map_info. Leave union map_info only for the request-based multipath target's use. Also delete the unused "unsigned long long ll" field of union map_info. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | | | | Merge tag 'iommu-updates-v3.15' of ↵Linus Torvalds2014-04-054-33/+67
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU upates from Joerg Roedel: "This time a few more updates queued up. - Rework VT-d code to support ACPI devices - Improvements for memory and PCI hotplug support in the VT-d driver - Device-tree support for OMAP IOMMU - Convert OMAP IOMMU to use devm_* interfaces - Fixed PASID support for AMD IOMMU - Other random cleanups and fixes for OMAP, ARM-SMMU and SHMOBILE IOMMU Most of the changes are in the VT-d driver because some rework was necessary for better hotplug and ACPI device support" * tag 'iommu-updates-v3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (75 commits) iommu/vt-d: Fix error handling in ANDD processing iommu/vt-d: returning free pointer in get_domain_for_dev() iommu/vt-d: Only call dmar_acpi_dev_scope_init() if DRHD units present iommu/vt-d: Check for NULL pointer in dmar_acpi_dev_scope_init() iommu/amd: Fix logic to determine and checking max PASID iommu/vt-d: Include ACPI devices in iommu=pt iommu/vt-d: Finally enable translation for non-PCI devices iommu/vt-d: Remove to_pci_dev() in intel_map_page() iommu/vt-d: Remove pdev from intel_iommu_attach_device() iommu/vt-d: Remove pdev from iommu_no_mapping() iommu/vt-d: Make domain_add_dev_info() take struct device iommu/vt-d: Make domain_remove_one_dev_info() take struct device iommu/vt-d: Rename 'hwdev' variables to 'dev' now that that's the norm iommu/vt-d: Remove some pointless to_pci_dev() calls iommu/vt-d: Make get_valid_domain_for_dev() take struct device iommu/vt-d: Make iommu_should_identity_map() take struct device iommu/vt-d: Handle RMRRs for non-PCI devices iommu/vt-d: Make get_domain_for_dev() take struct device iommu/vt-d: Make domain_context_mapp{ed,ing}() take struct device iommu/vt-d: Make device_to_iommu() cope with non-PCI devices ...
| | \ \ \ \
| | \ \ \ \
| *-. \ \ \ \ Merge branches 'iommu/fixes', 'arm/smmu', 'x86/amd', 'arm/omap', ↵Joerg Roedel2014-04-024-33/+67
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | 'arm/shmobile' and 'x86/vt-d' into next
| | | * | | | | iommu/vt-d: Store PCI segment number in struct intel_iommuDavid Woodhouse2014-03-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
| | | * | | | | iommu/vt-d: Change scope lists to struct device, bus, devfnDavid Woodhouse2014-03-241-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not only for PCI devices any more, and the scope information for an ACPI device provides the bus and devfn so that has to be stored here too. It is the device pointer itself which needs to be protected with RCU, so the __rcu annotation follows it into the definition of struct dmar_dev_scope, since we're no longer just passing arrays of device pointers around. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
| | | * | | | | iommu/vt-d: Add ACPI namespace device reporting structuresDavid Woodhouse2014-03-201-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
| | | * | | | | iommu/vt-d: Update IOMMU state when memory hotplug happensJiang Liu2014-03-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If static identity domain is created, IOMMU driver needs to update si_domain page table when memory hotplug event happens. Otherwise PCI device DMA operations can't access the hot-added memory regions. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Joerg Roedel <joro@8bytes.org>
| | | * | | | | iommu/vt-d: Unify the way to process DMAR device scope arrayJiang Liu2014-03-041-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we have a PCI bus notification based mechanism to update DMAR device scope array, we could extend the mechanism to support boot time initialization too, which will help to unify and simplify the implementation. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Joerg Roedel <joro@8bytes.org>
| | | * | | | | iommu/vt-d: Update DRHD/RMRR/ATSR device scope caches when PCI hotplug happensJiang Liu2014-03-041-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current Intel DMAR/IOMMU driver assumes that all PCI devices associated with DMAR/RMRR/ATSR device scope arrays are created at boot time and won't change at runtime, so it caches pointers of associated PCI device object. That assumption may be wrong now due to: 1) introduction of PCI host bridge hotplug 2) PCI device hotplug through sysfs interfaces. Wang Yijing has tried to solve this issue by caching <bus, dev, func> tupple instead of the PCI device object pointer, but that's still unreliable because PCI bus number may change in case of hotplug. Please refer to http://lkml.org/lkml/2013/11/5/64 Message from Yingjing's mail: after remove and rescan a pci device [ 611.857095] dmar: DRHD: handling fault status reg 2 [ 611.857109] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff7000 [ 611.857109] DMAR:[fault reason 02] Present bit in context entry is clear [ 611.857524] dmar: DRHD: handling fault status reg 102 [ 611.857534] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff6000 [ 611.857534] DMAR:[fault reason 02] Present bit in context entry is clear [ 611.857936] dmar: DRHD: handling fault status reg 202 [ 611.857947] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff5000 [ 611.857947] DMAR:[fault reason 02] Present bit in context entry is clear [ 611.858351] dmar: DRHD: handling fault status reg 302 [ 611.858362] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff4000 [ 611.858362] DMAR:[fault reason 02] Present bit in context entry is clear [ 611.860819] IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready [ 611.860983] dmar: DRHD: handling fault status reg 402 [ 611.860995] dmar: INTR-REMAP: Request device [[86:00.3] fault index a4 [ 611.860995] INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear This patch introduces a new mechanism to update the DRHD/RMRR/ATSR device scope caches by hooking PCI bus notification. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Joerg Roedel <joro@8bytes.org>
| | | * | | | | iommu/vt-d: Use RCU to protect global resources in interrupt contextJiang Liu2014-03-041-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Global DMA and interrupt remapping resources may be accessed in interrupt context, so use RCU instead of rwsem to protect them in such cases. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Joerg Roedel <joro@8bytes.org>
| | | * | | | | iommu/vt-d: Introduce a rwsem to protect global data structuresJiang Liu2014-03-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a global rwsem dmar_global_lock, which will be used to protect DMAR related global data structures from DMAR/PCI/memory device hotplug operations in process context. DMA and interrupt remapping related data structures are read most, and only change when memory/PCI/DMAR hotplug event happens. So a global rwsem solution is adopted for balance between simplicity and performance. For interrupt remapping driver, function intel_irq_remapping_supported(), dmar_table_init(), intel_enable_irq_remapping(), disable_irq_remapping(), reenable_irq_remapping() and enable_drhd_fault_handling() etc are called during booting, suspending and resuming with interrupt disabled, so no need to take the global lock. For interrupt remapping entry allocation, the locking model is: down_read(&dmar_global_lock); /* Find corresponding iommu */ iommu = map_hpet_to_ir(id); if (iommu) /* * Allocate remapping entry and mark entry busy, * the IOMMU won't be hot-removed until the * allocated entry has been released. */ index = alloc_irte(iommu, irq, 1); up_read(&dmar_global_lock); For DMA remmaping driver, we only uses the dmar_global_lock rwsem to protect functions which are only called in process context. For any function which may be called in interrupt context, we will use RCU to protect them in following patches. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Joerg Roedel <joro@8bytes.org>