summaryrefslogtreecommitdiffstats
path: root/init
Commit message (Collapse)AuthorAgeFilesLines
* Make CONFIG_FHANDLE default yAndi Kleen2016-04-011-1/+2
| | | | | | | | | | | | | | | | | | | | Newer Fedora and OpenSUSE didn't boot with my standard configuration. It took me some time to figure out why, in fact I had to write a script to try different config options systematically. The problem is that something (systemd) in dracut depends on CONFIG_FHANDLE, which adds open by file handle syscalls. While it is set in defconfigs it is very easy to miss when updating older configs because it is not default y. Make it default y and also depend on EXPERT, as dracut use is likely widespread. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-4.6' of ↵Linus Torvalds2016-03-181-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "cgroup changes for v4.6-rc1. No userland visible behavior changes in this pull request. I'll send out a separate pull request for the addition of cgroup namespace support. - The biggest change is the revamping of cgroup core task migration and controller handling logic. There are quite a few places where controllers and tasks are manipulated. Previously, many of those places implemented custom operations for each specific use case assuming specific starting conditions. While this worked, it makes the code fragile and difficult to follow. The bulk of this pull request restructures these operations so that most related operations are performed through common helpers which implement recursive (subtrees are always processed consistently) and idempotent (they make cgroup hierarchy converge to the target state rather than performing operations assuming specific starting conditions). This makes the code a lot easier to understand, verify and extend. - Implicit controller support is added. This is primarily for using perf_event on the v2 hierarchy so that perf can match cgroup v2 path without requiring the user to do anything special. The kernel portion of perf_event changes is acked but userland changes are still pending review. - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in certain environments. - There is a regression introduced during v4.4 devel cycle where attempts to migrate zombie tasks can mess up internal object management. This was fixed earlier this week and included in this pull request w/ stable cc'd. - Misc non-critical fixes and improvements" * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits) cgroup: avoid false positive gcc-6 warning cgroup: ignore css_sets associated with dead cgroups during migration Documentation: cgroup v2: Trivial heading correction. cgroup: implement cgroup_subsys->implicit_on_dfl cgroup: use css_set->mg_dst_cgrp for the migration target cgroup cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses() cgroup: Trivial correction to reflect controller. cgroup: remove stale item in cgroup-v1 document INDEX file. cgroup: update css iteration in cgroup_update_dfl_csses() cgroup: allocate 2x cgrp_cset_links when setting up a new root cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends cgroup: use cgroup_apply_enable_control() in cgroup creation path cgroup: combine cgroup_mutex locking and offline css draining cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write() cgroup: introduce cgroup_{save|propagate|restore}_control() cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() ...
| * cgroup: Trivial correction to reflect controller.Parav Pandit2016-03-051-2/+2
| | | | | | | | | | | | | | | | | | Trivial correction in menuconfig help to reflect PIDs as controller instead of subsystem to align to rest of the text and documentation. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'next' of ↵Linus Torvalds2016-03-171-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor fixes scattered across the subsystem. IMA now requires signed policy, and that policy is also now measured and appraised" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits) X.509: Make algo identifiers text instead of enum akcipher: Move the RSA DER encoding check to the crypto layer crypto: Add hash param to pkcs1pad sign-file: fix build with CMS support disabled MAINTAINERS: update tpmdd urls MODSIGN: linux/string.h should be #included to get memcpy() certs: Fix misaligned data in extra certificate list X.509: Handle midnight alternative notation in GeneralizedTime X.509: Support leap seconds Handle ISO 8601 leap seconds and encodings of midnight in mktime64() X.509: Fix leap year handling again PKCS#7: fix unitialized boolean 'want' firmware: change kernel read fail to dev_dbg() KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert KEYS: Reserve an extra certificate symbol for inserting without recompiling modsign: hide openssl output in silent builds tpm_tis: fix build warning with tpm_tis_resume ima: require signed IMA policy ima: measure and appraise the IMA policy itself ima: load policy using path ...
| * | akcipher: Move the RSA DER encoding check to the crypto layerDavid Howells2016-03-031-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key subtype to the rsa crypto module's pkcs1pad template. This means that the public_key subtype no longer has any dependencies on public key type. To make this work, the following changes have been made: (1) The rsa pkcs1pad template is now used for RSA keys. This strips off the padding and returns just the message hash. (2) In a previous patch, the pkcs1pad template gained an optional second parameter that, if given, specifies the hash used. We now give this, and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5 encoding and verifies that the correct digest OID is present. (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to something that doesn't care about what the encryption actually does and and has been merged into public_key.c. (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone. Module signing must set CONFIG_CRYPTO_RSA=y instead. Thoughts: (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to the padding template? Should there be multiple padding templates registered that share most of the code? Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-03-162-3/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge first patch-bomb from Andrew Morton: - some misc things - ofs2 updates - about half of MM - checkpatch updates - autofs4 update * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) autofs4: fix string.h include in auto_dev-ioctl.h autofs4: use pr_xxx() macros directly for logging autofs4: change log print macros to not insert newline autofs4: make autofs log prints consistent autofs4: fix some white space errors autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked() autofs4: fix coding style line length in autofs4_wait() autofs4: fix coding style problem in autofs4_get_set_timeout() autofs4: coding style fixes autofs: show pipe inode in mount options kallsyms: add support for relative offsets in kallsyms address table kallsyms: don't overload absolute symbol type for percpu symbols x86: kallsyms: disable absolute percpu symbols on !SMP checkpatch: fix another left brace warning checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses checkpatch: warn on bare unsigned or signed declarations without int checkpatch: exclude asm volatile from complex macro check mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate() mm: migrate: consolidate mem_cgroup_migrate() calls mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous ...
| * | kallsyms: add support for relative offsets in kallsyms address tableArd Biesheuvel2016-03-151-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to how relative extables are implemented, it is possible to emit the kallsyms table in such a way that it contains offsets relative to some anchor point in the kernel image rather than absolute addresses. On 64-bit architectures, it cuts the size of the kallsyms address table in half, since offsets between kernel symbols can typically be expressed in 32 bits. This saves several hundreds of kilobytes of permanent .rodata on average. In addition, the kallsyms address table is no longer subject to dynamic relocation when CONFIG_RELOCATABLE is in effect, so the relocation work done after decompression now doesn't have to do relocation updates for all these values. This saves up to 24 bytes (i.e., the size of a ELF64 RELA relocation table entry) per value, which easily adds up to a couple of megabytes of uncompressed __init data on ppc64 or arm64. Even if these relocation entries typically compress well, the combined size reduction of 2.8 MB uncompressed for a ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500 KB space saving in the compressed image. Since it is useful for some architectures (like x86) to retain the ability to emit absolute values as well, this patch also adds support for capturing both absolute and relative values when KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu addresses as positive 32-bit values, and addresses relative to the lowest encountered relative symbol as negative values, which are subtracted from the runtime address of this base symbol to produce the actual address. Support for the above is enabled by default for all architectures except IA-64 and Tile-GX, whose symbols are too far apart to capture in this manner. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | x86: kallsyms: disable absolute percpu symbols on !SMPArd Biesheuvel2016-03-151-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | scripts/kallsyms.c has a special --absolute-percpu command line option which deals with the zero based per cpu offsets that are used when building for SMP on x86_64. This means that the option should only be passed in that case, so add a Kconfig symbol with the correct predicate, and use that instead. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | init/main.c: use list_for_each_entry()Geliang Tang2016-03-151-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use list_for_each_entry() instead of list_for_each() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds2016-03-151-15/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug updates from Thomas Gleixner: "This is the first part of the ongoing cpu hotplug rework: - Initial implementation of the state machine - Runs all online and prepare down callbacks on the plugged cpu and not on some random processor - Replaces busy loop waiting with completions - Adds tracepoints so the states can be followed" More detailed commentary on this work from an earlier email: "What's wrong with the current cpu hotplug infrastructure? - Asymmetry The hotplug notifier mechanism is asymmetric versus the bringup and teardown. This is mostly caused by the notifier mechanism. - Largely undocumented dependencies While some notifiers use explicitely defined notifier priorities, we have quite some notifiers which use numerical priorities to express dependencies without any documentation why. - Control processor driven Most of the bringup/teardown of a cpu is driven by a control processor. While it is understandable, that preperatory steps, like idle thread creation, memory allocation for and initialization of essential facilities needs to be done before a cpu can boot, there is no reason why everything else must run on a control processor. Before this patch series, bringup looks like this: Control CPU Booting CPU do preparatory steps kick cpu into life do low level init sync with booting cpu sync with control cpu bring the rest up - All or nothing approach There is no way to do partial bringups. That's something which is really desired because we waste e.g. at boot substantial amount of time just busy waiting that the cpu comes to life. That's stupid as we could very well do preparatory steps and the initial IPI for other cpus and then go back and do the necessary low level synchronization with the freshly booted cpu. - Minimal debuggability Due to the notifier based design, it's impossible to switch between two stages of the bringup/teardown back and forth in order to test the correctness. So in many hotplug notifiers the cancel mechanisms are either not existant or completely untested. - Notifier [un]registering is tedious To [un]register notifiers we need to protect against hotplug at every callsite. There is no mechanism that bringup/teardown callbacks are issued on the online cpus, so every caller needs to do it itself. That also includes error rollback. What's the new design? The base of the new design is a symmetric state machine, where both the control processor and the booting/dying cpu execute a well defined set of states. Each state is symmetric in the end, except for some well defined exceptions, and the bringup/teardown can be stopped and reversed at almost all states. So the bringup of a cpu will look like this in the future: Control CPU Booting CPU do preparatory steps kick cpu into life do low level init sync with booting cpu sync with control cpu bring itself up The synchronization step does not require the control cpu to wait. That mechanism can be done asynchronously via a worker or some other mechanism. The teardown can be made very similar, so that the dying cpu cleans up and brings itself down. Cleanups which need to be done after the cpu is gone, can be scheduled asynchronously as well. There is a long way to this, as we need to refactor the notion when a cpu is available. Today we set the cpu online right after it comes out of the low level bringup, which is not really correct. The proper mechanism is to set it to available, i.e. cpu local threads, like softirqd, hotplug thread etc. can be scheduled on that cpu, and once it finished all booting steps, it's set to online, so general workloads can be scheduled on it. The reverse happens on teardown. First thing to do is to forbid scheduling of general workloads, then teardown all the per cpu resources and finally shut it off completely. This patch series implements the basic infrastructure for this at the core level. This includes the following: - Basic state machine implementation with well defined states, so ordering and prioritization can be expressed. - Interfaces to [un]register state callbacks This invokes the bringup/teardown callback on all online cpus with the proper protection in place and [un]installs the callbacks in the state machine array. For callbacks which have no particular ordering requirement we have a dynamic state space, so that drivers don't have to register an explicit hotplug state. If a callback fails, the code automatically does a rollback to the previous state. - Sysfs interface to drive the state machine to a particular step. This is only partially functional today. Full functionality and therefor testability will be achieved once we converted all existing hotplug notifiers over to the new scheme. - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying processor: Control CPU Booting CPU do preparatory steps kick cpu into life do low level init sync with booting cpu sync with control cpu wait for boot bring itself up Signal completion to control cpu In a previous step of this work we've done a full tree mechanical conversion of all hotplug notifiers to the new scheme. The balance is a net removal of about 4000 lines of code. This is not included in this series, as we decided to take a different approach. Instead of mechanically converting everything over, we will do a proper overhaul of the usage sites one by one so they nicely fit into the symmetric callback scheme. I decided to do that after I looked at the ugliness of some of the converted sites and figured out that their hotplug mechanism is completely buggered anyway. So there is no point to do a mechanical conversion first as we need to go through the usage sites one by one again in order to achieve a full symmetric and testable behaviour" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) cpu/hotplug: Document states better cpu/hotplug: Fix smpboot thread ordering cpu/hotplug: Remove redundant state check cpu/hotplug: Plug death reporting race rcu: Make CPU_DYING_IDLE an explicit call cpu/hotplug: Make wait for dead cpu completion based cpu/hotplug: Let upcoming cpu bring itself fully up arch/hotplug: Call into idle with a proper state cpu/hotplug: Move online calls to hotplugged cpu cpu/hotplug: Create hotplug threads cpu/hotplug: Split out the state walk into functions cpu/hotplug: Unpark smpboot threads from the state machine cpu/hotplug: Move scheduler cpu_online notifier to hotplug core cpu/hotplug: Implement setup/removal interface cpu/hotplug: Make target state writeable cpu/hotplug: Add sysfs state interface cpu/hotplug: Hand in target state to _cpu_up/down cpu/hotplug: Convert the hotplugged cpu work to a state machine cpu/hotplug: Convert to a state machine for the control processor cpu/hotplug: Add tracepoints ...
| * | cpu/hotplug: Unpark smpboot threads from the state machineThomas Gleixner2016-03-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handle the smpboot threads in the state machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: Rik van Riel <riel@redhat.com> Cc: Rafael Wysocki <rafael.j.wysocki@intel.com> Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | cpu/hotplug: Convert to a state machine for the control processorThomas Gleixner2016-03-011-14/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the split out steps into a callback array and let the cpu_up/down code iterate through the array functions. For now most of the callbacks are asymmetric to resemble the current hotplug maze. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: Rik van Riel <riel@redhat.com> Cc: Rafael Wysocki <rafael.j.wysocki@intel.com> Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge branch 'mm-readonly-for-linus' of ↵Linus Torvalds2016-03-141-4/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull read-only kernel memory updates from Ingo Molnar: "This tree adds two (security related) enhancements to the kernel's handling of read-only kernel memory: - extend read-only kernel memory to a new class of formerly writable kernel data: 'post-init read-only memory' via the __ro_after_init attribute, and mark the ARM and x86 vDSO as such read-only memory. This kind of attribute can be used for data that requires a once per bootup initialization sequence, but is otherwise never modified after that point. This feature was based on the work by PaX Team and Brad Spengler. (by Kees Cook, the ARM vDSO bits by David Brown.) - make CONFIG_DEBUG_RODATA always enabled on x86 and remove the Kconfig option. This simplifies the kernel and also signals that read-only memory is the default model and a first-class citizen. (Kees Cook)" * 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ARM/vdso: Mark the vDSO code read-only after init x86/vdso: Mark the vDSO code read-only after init lkdtm: Verify that '__ro_after_init' works correctly arch: Introduce post-init read-only memory x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings asm-generic: Consolidate mark_rodata_ro()
| * | mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel ↵Kees Cook2016-02-221-4/+23
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mappings It may be useful to debug writes to the readonly sections of memory, so provide a cmdline "rodata=off" to allow for this. This can be expanded in the future to support "log" and "write" modes, but that will need to be architecture-specific. This also makes KDB software breakpoints more usable, as read-only mappings can now be disabled on any kernel. Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Brown <david.brown@linaro.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: PaX Team <pageexec@freemail.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-hardening@lists.openwall.com Cc: linux-arch <linux-arch@vger.kernel.org> Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* / locking/lockdep: Eliminate lockdep_init()Andrey Ryabinin2016-02-091-5/+0
|/ | | | | | | | | | | | | | | Lockdep is initialized at compile time now. Get rid of lockdep_init(). Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Krinkin <krinkin.m.u@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: mm-commits@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: memcontrol: rein in the CONFIG space madnessJohannes Weiner2016-01-201-18/+3
| | | | | | | | | | | | | What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory controller code is insignificant, having these conditionals is not worth the complication and fragility that comes with them. [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEMJohannes Weiner2016-01-201-1/+9
| | | | | | | | | | | | | | Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2 interface. This also makes legacy-only code sections stand out better. [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* init/do_mounts: initrd_load() can be booleanYaowei Bai2016-01-202-5/+5
| | | | | | | | | | | Make initrd_load() return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* init/main.c: obsolete_checksetup can be booleanYaowei Bai2016-01-201-5/+5
| | | | | | | | | | | Make obsolete_checksetup() return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds2016-01-171-8/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit updates from Paul Moore: "Seven audit patches for 4.5, all very minor despite the diffstat. The diffstat churn for linux/audit.h can be attributed to needing to reshuffle the linux/audit.h header to fix the seccomp auditing issue (see the commit description for details). Besides the seccomp/audit fix, most of the fixes are around trying to improve the connection with the audit daemon and a Kconfig simplification. Nothing crazy, and everything passes our little audit-testsuite" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: audit: always enable syscall auditing when supported and audit is enabled audit: force seccomp event logging to honor the audit_enabled flag audit: Delete unnecessary checks before two function calls audit: wake up threads if queue switched from limited to unlimited audit: include auditd's threads in audit_log_start() wait exception audit: remove audit_backlog_wait_overflow audit: don't needlessly reset valid wait time
| * audit: always enable syscall auditing when supported and audit is enabledPaul Moore2016-01-131-8/+3
| | | | | | | | | | | | | | | | | | To the best of our knowledge, everyone who enables audit at compile time also enables syscall auditing; this patch simplifies the Kconfig menus by removing the option to disable syscall auditing when audit is selected and the target arch supports it. Signed-off-by: Paul Moore <pmoore@redhat.com>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-01-171-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge second patch-bomb from Andrew Morton: - more MM stuff: - Kirill's page-flags rework - Kirill's now-allegedly-fixed THP rework - MADV_FREE implementation - DAX feature work (msync/fsync). This isn't quite complete but DAX is new and it's good enough and the guys have a handle on what needs to be done - I expect this to be wrapped in the next week or two. - some vsprintf maintenance work - various other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (145 commits) printk: change recursion_bug type to bool lib/vsprintf: factor out %pN[F] handler as netdev_bits() lib/vsprintf: refactor duplicate code to special_hex_number() printk-formats.txt: remove unimplemented %pT printk: help pr_debug and pr_devel to optimize out arguments lib/test_printf.c: test dentry printing lib/test_printf.c: add test for large bitmaps lib/test_printf.c: account for kvasprintf tests lib/test_printf.c: add a few number() tests lib/test_printf.c: test precision quirks lib/test_printf.c: check for out-of-bound writes lib/test_printf.c: don't BUG lib/kasprintf.c: add sanity check to kvasprintf lib/vsprintf.c: warn about too large precisions and field widths lib/vsprintf.c: help gcc make number() smaller lib/vsprintf.c: expand field_width to 24 bits lib/vsprintf.c: eliminate potential race in string() lib/vsprintf.c: move string() below widen_string() lib/vsprintf.c: pull out padding code from dentry_name() printk: do cond_resched() between lines while outputting to consoles ...
| * | uselib: default depending if libc5 was usedRiku Voipio2016-01-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | uselib hasn't been used since libc5; glibc does not use it. Deprecate uselib a bit more, by making the default y only if libc5 was widely used on the plaform. This makes arm64 kernel built with defconfig slightly smaller bloat-o-meter: add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390) function old new delta kernel_config_data 18164 18162 -2 uselib_flags 20 - -20 padzero 216 192 -24 sys_uselib 380 - -380 load_elf_library 964 - -964 Signed-off-by: Riku Voipio <riku.voipio@linaro.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'docs-4.5' of git://git.lwn.net/linuxLinus Torvalds2016-01-171-7/+0
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull documentation updates from Jon Corbet: "A relatively boring cycle in the docs tree. There's a few kernel-doc fixes and various document tweaks. One patch reaches out of the documentation subtree to fix a comment in init/do_mounts_rd.c. There didn't seem to be anybody more appropriate to take that one, so I accepted it" * tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits) thermal: add description for integral_cutoff unit Documentation: update libhugetlbfs site url Documentation: Explain pci=conf1,conf2 more verbosely DMA-API: fix confusing sentence in Documentation/DMA-API.txt Documentation: translations: update linux cross reference link Documentation: fix typo in CodingStyle init, Documentation: Remove ramdisk_blocksize mentions Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main() Documentation: HOWTO: update versions from 3.x to 4.x Documentation: remove outdated references from translations Doc: treewide: Fix grammar "a" to "an" Documentation: cpu-hotplug: Fix sysfs mount instructions can-doc: Add hint about getting timestamps Fix CFQ I/O scheduler parameter name in documentation Documentation: arm: remove dead links from Marvell Berlin docs Documentation: HOWTO: update code cross reference link Doc: Docbook/iio: Fix typo in iio.tmpl DocBook: make index.html generation less verbose by default DocBook: Cleanup: remove an unused $(call) line DocBook: Add a help message for DOCBOOKS env var ...
| * | init, Documentation: Remove ramdisk_blocksize mentionsRobert Elliott2015-12-261-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The brd driver has never supported the ramdisk_blocksize kernel parameter that was in the rd driver it replaced, so remove mention of this parameter from comments and Documentation. Commit 9db5579be4bb ("rewrite rd") replaced rd with brd, keeping a brd_blocksize variable in struct brd_device but never using it. Commit a2cba2913c76 ("brd: get rid of unused members from struct brd_device") removed the unused variable. Commit f5abc8e75815 ("Documentation/blockdev/ramdisk.txt: updates") removed mentions of ramdisk_blocksize from that file. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
* | | Merge branch 'for-4.5' of ↵Linus Torvalds2016-01-121-127/+114
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup v2 interface is now official. It's no longer hidden behind a devel flag and can be mounted using the new cgroup2 fs type. Unfortunately, cpu v2 interface hasn't made it yet due to the discussion around in-process hierarchical resource distribution and only memory and io controllers can be used on the v2 interface at the moment. - The existing documentation which has always been a bit of mess is relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt is added as the authoritative documentation for the v2 interface. - Some features are added through for-4.5-ancestor-test branch to enable netfilter xt_cgroup match to use cgroup v2 paths. The actual netfilter changes will be merged through the net tree which pulled in the said branch. - Various cleanups * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rename cgroup documentations cgroup: fix a typo. cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX. cgroup: demote subsystem init messages to KERN_DEBUG cgroup: Fix uninitialized variable warning cgroup: put controller Kconfig options in meaningful order cgroup: clean up the kernel configuration menu nomenclature cgroup_pids: fix a typo. Subject: cgroup: Fix incomplete dd command in blkio documentation cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends cpuset: Replace all instances of time_t with time64_t cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
| * | | cgroup: put controller Kconfig options in meaningful orderJohannes Weiner2015-12-181-107/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make it easier to quickly find what's needed list the basic resource controllers of cgroup2 first - io, memory, cpu - while pushing the more exotic and/or legacy controllers to the bottom. tj: Removed spurious "&& CGROUPS" from CGROUP_PERF as suggested by Li. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: clean up the kernel configuration menu nomenclatureJohannes Weiner2015-12-181-39/+26
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | The config options for the different cgroup controllers use various terms: resource controller, cgroup subsystem, etc. Simplify this to "controller", which is clear enough in the cgroup context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | Merge branch 'for-mingo' of ↵Ingo Molnar2016-01-061-0/+2
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Adding transitivity uniformly to rcu_node structure ->lock acquisitions. (This is implemented by the first two commits on top of v4.4-rc2 due to the pervasive nature of this change.) - Documentation updates, including RCU requirements. - Expedited grace-period changes. - Miscellaneous fixes. - Linked-list fixes, courtesy of KTSAN. - Torture-test updates. - Late-breaking fix to sysrq-generated crash. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | rcu: Wire up rcu_end_inkernel_boot()Paul E. McKenney2015-12-041-0/+2
| |/ | | | | | | | | | | | | | | | | This commit adds the invocation of rcu_end_inkernel_boot() just before init is invoked. This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig option to do something useful and prepares for the upcoming rcupdate.rcu_normal_after_boot kernel parameter. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* / kernel: remove stop_machine() Kconfig dependencyChris Wilson2015-12-121-7/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the full stop_machine() routine is only enabled on SMP if module unloading is enabled, or if the CPUs are hotpluggable. This leads to configurations where stop_machine() is broken as it will then only run the callback on the local CPU with irqs disabled, and not stop the other CPUs or run the callback on them. For example, this breaks MTRR setup on x86 in certain configs since ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions") as the MTRR is only established on the boot CPU. This patch removes the Kconfig option for STOP_MACHINE and uses the SMP and HOTPLUG_CPU config options to compile the correct stop_machine() for the architecture, removing the false dependency on MODULE_UNLOAD in the process. Link: https://lkml.org/lkml/2014/10/8/124 References: https://bugs.freedesktop.org/show_bug.cgi?id=84794 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sys_membarrier(): system-wide memory barrier (generic, x86)Mathieu Desnoyers2015-09-111-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kexec: split kexec_load syscall from kexec core codeDave Young2015-09-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kmod: use system_unbound_wq instead of khelperFrederic Weisbecker2015-09-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'next' of ↵Linus Torvalds2015-09-081-19/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - PKCS#7 support added to support signed kexec, also utilized for module signing. See comments in 3f1e1bea. ** NOTE: this requires linking against the OpenSSL library, which must be installed, e.g. the openssl-devel on Fedora ** - Smack - add IPv6 host labeling; ignore labels on kernel threads - support smack labeling mounts which use binary mount data - SELinux: - add ioctl whitelisting (see http://kernsec.org/files/lss2015/vanderstoep.pdf) - fix mprotect PROT_EXEC regression caused by mm change - Seccomp: - add ptrace options for suspend/resume" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits) PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them Documentation/Changes: Now need OpenSSL devel packages for module signing scripts: add extract-cert and sign-file to .gitignore modsign: Handle signing key in source tree modsign: Use if_changed rule for extracting cert from module signing key Move certificate handling to its own directory sign-file: Fix warning about BIO_reset() return value PKCS#7: Add MODULE_LICENSE() to test module Smack - Fix build error with bringup unconfigured sign-file: Document dependency on OpenSSL devel libraries PKCS#7: Appropriately restrict authenticated attributes and content type KEYS: Add a name for PKEY_ID_PKCS7 PKCS#7: Improve and export the X.509 ASN.1 time object decoder modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS extract-cert: Cope with multiple X.509 certificates in a single file sign-file: Generate CMS message as signature instead of PKCS#7 PKCS#7: Support CMS messages also [RFC5652] X.509: Change recorded SKID & AKID to not include Subject or Issuer PKCS#7: Check content type and versions MAINTAINERS: The keyrings mailing list has moved ...
| * Move certificate handling to its own directoryDavid Howells2015-08-141-39/+0
| | | | | | | | | | | | | | | | | | Move certificate handling out of the kernel/ directory and into a certs/ directory to get all the weird stuff in one place and move the generated signing keys into this directory. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
| * sign-file: Document dependency on OpenSSL devel librariesDavid Howells2015-08-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The revised sign-file program is no longer a script that wraps the openssl program, but now rather a program that makes use of OpenSSL's crypto library. This means that to build the sign-file program, the kernel build process now has a dependency on the OpenSSL development packages in addition to OpenSSL itself. Document this in Kconfig and in module-signing.txt. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
| * modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS optionDavid Woodhouse2015-08-071-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let the user explicitly provide a file containing trusted keys, instead of just automatically finding files matching *.x509 in the build tree and trusting whatever we find. This really ought to be an *explicit* configuration, and the build rules for dealing with the files were fairly painful too. Fix applied from James Morris that removes an '=' from a macro definition in kernel/Makefile as this is a feature that only exists from GNU make 3.82 onwards. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * modsign: Use single PEM file for autogenerated keyDavid Woodhouse2015-08-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current rule for generating signing_key.priv and signing_key.x509 is a classic example of a bad rule which has a tendency to break parallel make. When invoked to create *either* target, it generates the other target as a side-effect that make didn't predict. So let's switch to using a single file signing_key.pem which contains both key and certificate. That matches what we do in the case of an external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also slightly cleaner. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if neededDavid Woodhouse2015-08-071-4/+4
| | | | | | | | | | | | | | | | | | Where an external PEM file or PKCS#11 URI is given, we can get the cert from it for ourselves instead of making the user drop signing_key.x509 in place for us. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * modsign: Allow external signing key to be specifiedDavid Woodhouse2015-08-071-0/+14
| | | | | | | | | | Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * MODSIGN: Extract the blob PKCS#7 signature verifier from module signingDavid Howells2015-08-071-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract the function that drives the PKCS#7 signature verification given a data blob and a PKCS#7 blob out from the module signing code and lump it with the system keyring code as it's generic. This makes it independent of module config options and opens it to use by the firmware loader. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ming Lei <ming.lei@canonical.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Kyle McMartin <kyle@kernel.org>
| * MODSIGN: Use PKCS#7 messages as module signaturesDavid Howells2015-08-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move to using PKCS#7 messages as module signatures because: (1) We have to be able to support the use of X.509 certificates that don't have a subjKeyId set. We're currently relying on this to look up the X.509 certificate in the trusted keyring list. (2) PKCS#7 message signed information blocks have a field that supplies the data required to match with the X.509 certificate that signed it. (3) The PKCS#7 certificate carries fields that specify the digest algorithm used to generate the signature in a standardised way and the X.509 certificates specify the public key algorithm in a standardised way - so we don't need our own methods of specifying these. (4) We now have PKCS#7 message support in the kernel for signed kexec purposes and we can make use of this. To make this work, the old sign-file script has been replaced with a program that needs compiling in a previous patch. The rules to build it are added here. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Vivek Goyal <vgoyal@redhat.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2015-09-051-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "In this one: - d_move fixes (Eric Biederman) - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error handling ought to be fixed) - switch of sb_writers to percpu rwsem (Oleg Nesterov) - superblock scalability (Josef Bacik and Dave Chinner) - swapon(2) race fix (Hugh Dickins)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits) vfs: Test for and handle paths that are unreachable from their mnt_root dcache: Reduce the scope of i_lock in d_splice_alias dcache: Handle escaped paths in prepend_path mm: fix potential data race in SyS_swapon inode: don't softlockup when evicting inodes inode: rename i_wb_list to i_io_list sync: serialise per-superblock sync operations inode: convert inode_sb_list_lock to per-sb inode: add hlist_fake to avoid the inode hash lock in evict writeback: plug writeback at a high level change sb_writers to use percpu_rw_semaphore shift percpu_counter_destroy() into destroy_super_work() percpu-rwsem: kill CONFIG_PERCPU_RWSEM percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() percpu-rwsem: introduce percpu_down_read_trylock() document rwsem_release() in sb_wait_write() fix the broken lockdep logic in __sb_start_write() introduce __sb_writers_{acquired,release}() helpers ufs_inode_get{frag,block}(): get rid of 'phys' argument ufs_getfrag_block(): tidy up a bit ...
| * | percpu-rwsem: kill CONFIG_PERCPU_RWSEMOleg Nesterov2015-08-151-1/+0
| | | | | | | | | | | | | | | | | | | | | Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional user of percpu_rw_semaphore. Signed-off-by: Oleg Nesterov <oleg@redhat.com>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-09-051-0/+18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge patch-bomb from Andrew Morton: - a few misc things - Andy's "ambient capabilities" - fs/nofity updates - the ocfs2 queue - kernel/watchdog.c updates and feature work. - some of MM. Includes Andrea's userfaultfd feature. [ Hadn't noticed that userfaultfd was 'default y' when applying the patches, so that got fixed in this merge instead. We do _not_ mark new features that nobody uses yet 'default y' - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits) mm/hugetlb.c: make vma_has_reserves() return bool mm/madvise.c: make madvise_behaviour_valid() return bool mm/memory.c: make tlb_next_batch() return bool mm/dmapool.c: change is_page_busy() return from int to bool mm: remove struct node_active_region mremap: simplify the "overlap" check in mremap_to() mremap: don't do uneccesary checks if new_len == old_len mremap: don't do mm_populate(new_addr) on failure mm: move ->mremap() from file_operations to vm_operations_struct mremap: don't leak new_vma if f_op->mremap() fails mm/hugetlb.c: make vma_shareable() return bool mm: make GUP handle pfn mapping unless FOLL_GET is requested mm: fix status code which move_pages() returns for zero page mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() genalloc: add support of multiple gen_pools per device genalloc: add name arg to gen_pool_get() and devm_gen_pool_create() mm/memblock: WARN_ON when nid differs from overlap region Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap mm: defer flush of writable TLB entries mm: send one IPI per CPU to TLB flush all entries after unmapping pages ...
| * | | mm: send one IPI per CPU to TLB flush all entries after unmapping pagesMel Gorman2015-09-041-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | userfaultfd: buildsystem activationAndrea Arcangeli2015-09-041-0/+11
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows to select the userfaultfd during configuration to build it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-4.3' of ↵Linus Torvalds2015-09-021-0/+16
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - a new PIDs controller is added. It turns out that PIDs are actually an independent resource from kmem due to the limited PID space. - more core preparations for the v2 interface. Once cpu side interface is settled, it should be ready for lifting the devel mask. for-4.3-unified-base was temporarily branched so that other trees (block) can pull cgroup core changes that blkcg changes depend on. - a non-critical idr_preload usage bug fix. * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: pids: fix invalid get/put usage cgroup: introduce cgroup_subsys->legacy_name cgroup: don't print subsystems for the default hierarchy cgroup: make cftype->private a unsigned long cgroup: export cgrp_dfl_root cgroup: define controller file conventions cgroup: fix idr_preload usage cgroup: add documentation for the PIDs controller cgroup: implement the PIDs subsystem cgroup: allow a cgroup subsystem to reject a fork
| * | | cgroup: implement the PIDs subsystemAleksa Sarai2015-07-141-0/+16
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a new single-purpose PIDs subsystem to limit the number of tasks that can be forked inside a cgroup. Essentially this is an implementation of RLIMIT_NPROC that applies to a cgroup rather than a process tree. However, it should be noted that organisational operations (adding and removing tasks from a PIDs hierarchy) will *not* be prevented. Rather, the number of tasks in the hierarchy cannot exceed the limit through forking. This is due to the fact that, in the unified hierarchy, attach cannot fail (and it is not possible for a task to overcome its PIDs cgroup policy limit by attaching to a child cgroup -- even if migrating mid-fork it must be able to fork in the parent first). PIDs are fundamentally a global resource, and it is possible to reach PID exhaustion inside a cgroup without hitting any reasonable kmemcg policy. Once you've hit PID exhaustion, you're only in a marginally better state than OOM. This subsystem allows PID exhaustion inside a cgroup to be prevented. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org>