summaryrefslogtreecommitdiffstats
path: root/security
Commit message (Collapse)AuthorAgeFilesLines
* KEYS: Add a 'trusted' flag and a 'trusted only' flagDavid Howells2013-09-252-0/+12
| | | | | | | | | | | | Add KEY_FLAG_TRUSTED to indicate that a key either comes from a trusted source or had a cryptographic signature chain that led back to a trusted key the kernel already possessed. Add KEY_FLAGS_TRUSTED_ONLY to indicate that a keyring will only accept links to keys marked with KEY_FLAGS_TRUSTED. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org>
* KEYS: Add per-user_namespace registers for persistent per-UID kerberos cachesDavid Howells2013-09-247-0/+213
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for per-user_namespace registers of persistent per-UID kerberos caches held within the kernel. This allows the kerberos cache to be retained beyond the life of all a user's processes so that the user's cron jobs can work. The kerberos cache is envisioned as a keyring/key tree looking something like: struct user_namespace \___ .krb_cache keyring - The register \___ _krb.0 keyring - Root's Kerberos cache \___ _krb.5000 keyring - User 5000's Kerberos cache \___ _krb.5001 keyring - User 5001's Kerberos cache \___ tkt785 big_key - A ccache blob \___ tkt12345 big_key - Another ccache blob Or possibly: struct user_namespace \___ .krb_cache keyring - The register \___ _krb.0 keyring - Root's Kerberos cache \___ _krb.5000 keyring - User 5000's Kerberos cache \___ _krb.5001 keyring - User 5001's Kerberos cache \___ tkt785 keyring - A ccache \___ krbtgt/REDHAT.COM@REDHAT.COM big_key \___ http/REDHAT.COM@REDHAT.COM user \___ afs/REDHAT.COM@REDHAT.COM user \___ nfs/REDHAT.COM@REDHAT.COM user \___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key \___ http/KERNEL.ORG@KERNEL.ORG big_key What goes into a particular Kerberos cache is entirely up to userspace. Kernel support is limited to giving you the Kerberos cache keyring that you want. The user asks for their Kerberos cache by: krb_cache = keyctl_get_krbcache(uid, dest_keyring); The uid is -1 or the user's own UID for the user's own cache or the uid of some other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to mess with the cache. The cache returned is a keyring named "_krb.<uid>" that the possessor can read, search, clear, invalidate, unlink from and add links to. Active LSMs get a chance to rule on whether the caller is permitted to make a link. Each uid's cache keyring is created when it first accessed and is given a timeout that is extended each time this function is called so that the keyring goes away after a while. The timeout is configurable by sysctl but defaults to three days. Each user_namespace struct gets a lazily-created keyring that serves as the register. The cache keyrings are added to it. This means that standard key search and garbage collection facilities are available. The user_namespace struct's register goes away when it does and anything left in it is then automatically gc'd. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Simo Sorce <simo@redhat.com> cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> cc: Eric W. Biederman <ebiederm@xmission.com>
* KEYS: Implement a big key type that can save to tmpfsDavid Howells2013-09-243-0/+216
| | | | | | | | Implement a big key type that can save its contents to tmpfs and thus swapspace when memory is tight. This is useful for Kerberos ticket caches. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Simo Sorce <simo@redhat.com>
* KEYS: Expand the capacity of a keyringDavid Howells2013-09-246-742/+792
| | | | | | | | | | | | | | | | | | | | Expand the capacity of a keyring to be able to hold a lot more keys by using the previously added associative array implementation. Currently the maximum capacity is: (PAGE_SIZE - sizeof(header)) / sizeof(struct key *) which, on a 64-bit system, is a little more 500. However, since this is being used for the NFS uid mapper, we need more than that. The new implementation gives us effectively unlimited capacity. With some alterations, the keyutils testsuite runs successfully to completion after this patch is applied. The alterations are because (a) keyrings that are simply added to no longer appear ordered and (b) some of the errors have changed a bit. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Drop the permissions argument from __keyring_search_one()David Howells2013-09-243-9/+5
| | | | | | | Drop the permissions argument from __keyring_search_one() as the only caller passes 0 here - which causes all checks to be skipped. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Define a __key_get() wrapper to use rather than atomic_inc()David Howells2013-09-243-12/+12
| | | | | | | Define a __key_get() wrapper to use rather than atomic_inc() on the key usage count as this makes it easier to hook in refcount error debugging. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Search for auth-key by name rather than target key IDDavid Howells2013-09-241-14/+7
| | | | | | | | Search for auth-key by name rather than by target key ID as, in a future patch, we'll by searching directly by index key in preference to iteration over all keys. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Introduce a search context structureDavid Howells2013-09-247-158/+174
| | | | | | | | | | | | | | | | | | | Search functions pass around a bunch of arguments, each of which gets copied with each call. Introduce a search context structure to hold these. Whilst we're at it, create a search flag that indicates whether the search should be directly to the description or whether it should iterate through all keys looking for a non-description match. This will be useful when keyrings use a generic data struct with generic routines to manage their content as the search terms can just be passed through to the iterator callback function. Also, for future use, the data to be supplied to the match function is separated from the description pointer in the search context. This makes it clear which is being supplied. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Consolidate the concept of an 'index key' for key accessDavid Howells2013-09-244-62/+67
| | | | | | | | | | | Consolidate the concept of an 'index key' for accessing keys. The index key is the search term needed to find a key directly - basically the key type and the key description. We can add to that the description length. This will be useful when turning a keyring into an associative array rather than just a pointer block. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: key_is_dead() should take a const key pointer argumentDavid Howells2013-09-241-1/+1
| | | | | | | key_is_dead() should take a const key pointer argument as it doesn't modify what it points to. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Use bool in make_key_ref() and is_key_possessed()David Howells2013-09-241-2/+3
| | | | | | | Make make_key_ref() take a bool possession parameter and make is_key_possessed() return a bool. Signed-off-by: David Howells <dhowells@redhat.com>
* KEYS: Skip key state checks when checking for possessionDavid Howells2013-09-244-6/+11
| | | | | | | | | | | | | | | | | Skip key state checks (invalidation, revocation and expiration) when checking for possession. Without this, keys that have been marked invalid, revoked keys and expired keys are not given a possession attribute - which means the possessor is not granted any possession permits and cannot do anything with them unless they also have one a user, group or other permit. This causes failures in the keyutils test suite's revocation and expiration tests now that commit 96b5c8fea6c0861621051290d705ec2e971963f1 reduced the initial permissions granted to a key. The failures are due to accesses to revoked and expired keys being given EACCES instead of EKEYREVOKED or EKEYEXPIRED. Signed-off-by: David Howells <dhowells@redhat.com>
* security: remove erroneous comment about capabilities.o link orderingEric Paris2013-09-241-1/+0
| | | | | | | | | | | | | Back when we had half ass LSM stacking we had to link capabilities.o after bigger LSMs so that on initialization the bigger LSM would register first and the capabilities module would be the one stacked as the 'seconday'. Somewhere around 6f0f0fd496333777d53 (back in 2008) we finally removed the last of the kinda module stacking code but this comment in the makefile still lives today. Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-071-5/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users
| * capabilities: allow nice if we are privilegedSerge Hallyn2013-08-301-4/+4
| | | | | | | | | | | | | | | | | | | | | | We allow task A to change B's nice level if it has a supserset of B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has CAP_SYS_NICE with respect to B - meaning it is root in the same namespace, or it created B's namespace. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * userns: Allow PR_CAPBSET_DROP in a user namespace.Eric W. Biederman2013-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | As the capabilites and capability bounding set are per user namespace properties it is safe to allow changing them with just CAP_SETPCAP permission in the user namespace. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Richard Weinberger <richard@nod.at> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | Merge branch 'next' of ↵Linus Torvalds2013-09-0728-547/+1666
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Nothing major for this kernel, just maintenance updates" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (21 commits) apparmor: add the ability to report a sha1 hash of loaded policy apparmor: export set of capabilities supported by the apparmor module apparmor: add the profile introspection file to interface apparmor: add an optional profile attachment string for profiles apparmor: add interface files for profiles and namespaces apparmor: allow setting any profile into the unconfined state apparmor: make free_profile available outside of policy.c apparmor: rework namespace free path apparmor: update how unconfined is handled apparmor: change how profile replacement update is done apparmor: convert profile lists to RCU based locking apparmor: provide base for multiple profiles to be replaced at once apparmor: add a features/policy dir to interface apparmor: enable users to query whether apparmor is enabled apparmor: remove minimum size check for vmalloc() Smack: parse multiple rules per write to load2, up to PAGE_SIZE-1 bytes Smack: network label match fix security: smack: add a hash table to quicken smk_find_entry() security: smack: fix memleak in smk_write_rules_list() xattr: Constify ->name member of "struct xattr". ...
| * \ Merge branch 'smack-for-3.12' of git://git.gitorious.org/smack-next/kernel ↵James Morris2013-08-234-114/+150
| |\ \ | | | | | | | | | | | | into ra-next
| | * | Smack: parse multiple rules per write to load2, up to PAGE_SIZE-1 bytesRafal Krypa2013-08-121-85/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Smack interface for loading rules has always parsed only single rule from data written to it. This requires user program to call one write() per each rule it wants to load. This change makes it possible to write multiple rules, separated by new line character. Smack will load at most PAGE_SIZE-1 characters and properly return number of processed bytes. In case when user buffer is larger, it will be additionally truncated. All characters after last \n will not get parsed to avoid partial rule near input buffer boundary. Signed-off-by: Rafal Krypa <r.krypa@samsung.com>
| | * | Smack: network label match fixCasey Schaufler2013-08-013-9/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Smack code that matches incoming CIPSO tags with Smack labels reaches through the NetLabel interfaces and compares the network data with the CIPSO header associated with a Smack label. This was done in a ill advised attempt to optimize performance. It works so long as the categories fit in a single capset, but this isn't always the case. This patch changes the Smack code to use the appropriate NetLabel interfaces to compare the incoming CIPSO header with the CIPSO header associated with a label. It will always match the CIPSO headers correctly. Targeted for git://git.gitorious.org/smack-next/kernel.git Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
| | * | security: smack: add a hash table to quicken smk_find_entry()Tomasz Stanislawski2013-08-013-9/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Accepted for the smack-next tree after changing the number of slots from 128 to 16. This patch adds a hash table to quicken searching of a smack label by its name. Basically, the patch improves performance of SMACK initialization. Parsing of rules involves translation from a string to a smack_known (aka label) entity which is done in smk_find_entry(). The current implementation of the function iterates over a global list of smack_known resulting in O(N) complexity for smk_find_entry(). The total complexity of SMACK initialization becomes O(rules * labels). Therefore it scales quadratically with a complexity of a system. Applying the patch reduced the complexity of smk_find_entry() to O(1) as long as number of label is in hundreds. If the number of labels is increased please update SMACK_HASH_SLOTS constant defined in security/smack/smack.h. Introducing the configuration of this constant with Kconfig or cmdline might be a good idea. The size of the hash table was adjusted experimentally. The rule set used by TIZEN contains circa 17K rules for 500 labels. The table above contains results of SMACK initialization using 'time smackctl apply' bash command. The 'Ref' is a kernel without this patch applied. The consecutive values refers to value of SMACK_HASH_SLOTS. Every measurement was repeated three times to reduce noise. | Ref | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 -------------------------------------------------------------------------------------------- Run1 | 1.156 | 1.096 | 0.883 | 0.764 | 0.692 | 0.667 | 0.649 | 0.633 | 0.634 | 0.629 | 0.620 Run2 | 1.156 | 1.111 | 0.885 | 0.764 | 0.694 | 0.661 | 0.649 | 0.651 | 0.634 | 0.638 | 0.623 Run3 | 1.160 | 1.107 | 0.886 | 0.764 | 0.694 | 0.671 | 0.661 | 0.638 | 0.631 | 0.624 | 0.638 AVG | 1.157 | 1.105 | 0.885 | 0.764 | 0.693 | 0.666 | 0.653 | 0.641 | 0.633 | 0.630 | 0.627 Surprisingly, a single hlist is slightly faster than a double-linked list. The speed-up saturates near 64 slots. Therefore I chose value 128 to provide some margin if more labels were used. It looks that IO becomes a new bottleneck. Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
| | * | security: smack: fix memleak in smk_write_rules_list()Tomasz Stanislawski2013-08-011-22/+11
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | The smack_parsed_rule structure is allocated. If a rule is successfully installed then the last reference to the object is lost. This patch fixes this leak. Moreover smack_parsed_rule is allocated on stack because it no longer needed ofter smk_write_rules_list() is finished. Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
| * | apparmor: add the ability to report a sha1 hash of loaded policyJohn Johansen2013-08-148-6/+199
| | | | | | | | | | | | | | | | | | | | | | | | | | | Provide userspace the ability to introspect a sha1 hash value for each profile currently loaded. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: export set of capabilities supported by the apparmor moduleJohn Johansen2013-08-144-1/+15
| | | | | | | | | | | | | | | | | | Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: add the profile introspection file to interfaceJohn Johansen2013-08-141-0/+236
| | | | | | | | | | | | | | | | | | | | | | | | Add the dynamic namespace relative profiles file to the interace, to allow introspection of loaded profiles and their modes. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Kees Cook <kees@ubuntu.com>
| * | apparmor: add an optional profile attachment string for profilesJohn Johansen2013-08-144-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add the ability to take in and report a human readable profile attachment string for profiles so that attachment specifications can be easily inspected. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: add interface files for profiles and namespacesJohn Johansen2013-08-147-29/+436
| | | | | | | | | | | | | | | | | | | | | | | | Add basic interface files to access namespace and profile information. The interface files are created when a profile is loaded and removed when the profile or namespace is removed. Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: allow setting any profile into the unconfined stateJohn Johansen2013-08-145-9/+22
| | | | | | | | | | | | | | | | | | | | | | | | Allow emulating the default profile behavior from boot, by allowing loading of a profile in the unconfined state into a new NS. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: make free_profile available outside of policy.cJohn Johansen2013-08-143-7/+7
| | | | | | | | | | | | Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: rework namespace free pathJohn Johansen2013-08-142-35/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | namespaces now completely use the unconfined profile to track the refcount and rcu freeing cycle. So rework the code to simplify (track everything through the profile path right up to the end), and move the rcu_head from policy base to profile as the namespace no longer needs it. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: update how unconfined is handledJohn Johansen2013-08-143-83/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ns->unconfined is being used read side without locking, nor rcu but is being updated when a namespace is removed. This works for the root ns which is never removed but has a race window and can cause failures when children namespaces are removed. Also ns and ns->unconfined have a circular refcounting dependency that is problematic and must be broken. Currently this is done incorrectly when the namespace is destroyed. Fix this by forward referencing unconfined via the replacedby infrastructure instead of directly updating the ns->unconfined pointer. Remove the circular refcount dependency by making the ns and its unconfined profile share the same refcount. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Seth Arnold <seth.arnold@canonical.com>
| * | apparmor: change how profile replacement update is doneJohn Johansen2013-08-146-87/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | remove the use of replaced by chaining and move to profile invalidation and lookup to handle task replacement. Replacement chaining can result in large chains of profiles being pinned in memory when one profile in the chain is use. With implicit labeling this will be even more of a problem, so move to a direct lookup method. Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: convert profile lists to RCU based lockingJohn Johansen2013-08-144-111/+167
| | | | | | | | | | | | Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: provide base for multiple profiles to be replaced at onceJohn Johansen2013-08-144-146/+283
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | previously profiles had to be loaded one at a time, which could result in cases where a replacement of a set would partially succeed, and then fail resulting in inconsistent policy. Allow multiple profiles to replaced "atomically" so that the replacement either succeeds or fails for the entire set of profiles. Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: add a features/policy dir to interfaceJohn Johansen2013-08-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a policy directory to features to contain features that can affect policy compilation but do not affect mediation. Eg of such features would be types of dfa compression supported, etc. Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Kees Cook <kees@ubuntu.com>
| * | apparmor: enable users to query whether apparmor is enabledJohn Johansen2013-08-141-1/+1
| | | | | | | | | | | | Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | apparmor: remove minimum size check for vmalloc()Tetsuo Handa2013-08-141-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow-up to commit b5b3ee6c "apparmor: no need to delay vfree()". Since vmalloc() will do "size = PAGE_ALIGN(size);", we don't need to check for "size >= sizeof(struct work_struct)". Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: John Johansen <john.johansen@canonical.com>
| * | xattr: Constify ->name member of "struct xattr".Tetsuo Handa2013-07-255-24/+14
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since everybody sets kstrdup()ed constant string to "struct xattr"->name but nobody modifies "struct xattr"->name , we can omit kstrdup() and its failure checking by constifying ->name member of "struct xattr". Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Joel Becker <jlbec@evilplan.org> [ocfs2] Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2013-09-051-1/+6
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking changes from David Miller: "Noteworthy changes this time around: 1) Multicast rejoin support for team driver, from Jiri Pirko. 2) Centralize and simplify TCP RTT measurement handling in order to reduce the impact of bad RTO seeding from SYN/ACKs. Also, when both timestamps and local RTT measurements are available prefer the later because there are broken middleware devices which scramble the timestamp. From Yuchung Cheng. 3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel memory consumed to queue up unsend user data. From Eric Dumazet. 4) Add a "physical port ID" abstraction for network devices, from Jiri Pirko. 5) Add a "suppress" operation to influence fib_rules lookups, from Stefan Tomanek. 6) Add a networking development FAQ, from Paul Gortmaker. 7) Extend the information provided by tcp_probe and add ipv6 support, from Daniel Borkmann. 8) Use RCU locking more extensively in openvswitch data paths, from Pravin B Shelar. 9) Add SCTP support to openvswitch, from Joe Stringer. 10) Add EF10 chip support to SFC driver, from Ben Hutchings. 11) Add new SYNPROXY netfilter target, from Patrick McHardy. 12) Compute a rate approximation for sending in TCP sockets, and use this to more intelligently coalesce TSO frames. Furthermore, add a new packet scheduler which takes advantage of this estimate when available. From Eric Dumazet. 13) Allow AF_PACKET fanouts with random selection, from Daniel Borkmann. 14) Add ipv6 support to vxlan driver, from Cong Wang" Resolved conflicts as per discussion. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits) openvswitch: Fix alignment of struct sw_flow_key. netfilter: Fix build errors with xt_socket.c tcp: Add missing braces to do_tcp_setsockopt caif: Add missing braces to multiline if in cfctrl_linkup_request bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize vxlan: Fix kernel panic on device delete. net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls net: mvneta: properly disable HW PHY polling and ensure adjust_link() works icplus: Use netif_running to determine device state ethernet/arc/arc_emac: Fix huge delays in large file copies tuntap: orphan frags before trying to set tx timestamp tuntap: purge socket error queue on detach qlcnic: use standard NAPI weights ipv6:introduce function to find route for redirect bnx2x: VF RSS support - VF side bnx2x: VF RSS support - PF side vxlan: Notify drivers for listening UDP port changes net: usbnet: update addr_assign_type if appropriate driver/net: enic: update enic maintainers and driver driver/net: enic: Exposing symbols for Cisco's low latency driver ...
| * \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-08-161-13/+11
| |\ \
| * | | net: split rt_genid for ipv4 and ipv6fan.du2013-07-311-1/+6
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current net name space has only one genid for both IPv4 and IPv6, it has below drawbacks: - Add/delete an IPv4 address will invalidate all IPv6 routing table entries. - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table entries even when the policy is only applied for one address family. Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6 separately in a fine granularity. Signed-off-by: Fan Du <fan.du@windriver.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge tag 'modules-next-for-linus' of ↵Linus Torvalds2013-09-041-0/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Minor fixes mainly, including a potential use-after-free on remove found by CONFIG_DEBUG_KOBJECT_RELEASE which may be theoretical" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: Fix mod->mkobj.kobj potentially freed too early kernel/params.c: use scnprintf() instead of sprintf() kernel/module.c: use scnprintf() instead of sprintf() module/lsm: Have apparmor module parameters work with no args module: Add NOARG flag for ops with param_set_bool_enable_only() set function module: Add flag to allow mod params to have no arguments modules: add support for soft module dependencies scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture. module: fix sprintf format specifier in param_get_byte()
| * | | module/lsm: Have apparmor module parameters work with no argsSteven Rostedt2013-08-201-0/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The apparmor module parameters for param_ops_aabool and param_ops_aalockpolicy are both based off of the param_ops_bool, and can handle a NULL value passed in as val. Have it enable the new KERNEL_PARAM_FL_NOARGS flag to allow the parameters to be set without having to state "=y" or "=1". Cc: John Johansen <john.johansen@canonical.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* | | Merge branch 'for-3.12' of ↵Linus Torvalds2013-09-031-39/+26
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
| * | cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo2013-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
| * | cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo2013-08-081-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
| * | cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo2013-08-081-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
| * | cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methodsTejun Heo2013-08-081-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
| * | cgroup: add css_parent()Tejun Heo2013-08-081-13/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
| * | cgroup: add/update accessors which obtain subsys specific data from cssTejun Heo2013-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>