linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	rxrpc: Remove skb_count from struct rxrpc_call	David Howells	2016-09-08	2	-23/+12
\| \| \| \| \| \| \|	Remove the sk_buff count from the rxrpc_call struct as it's less useful once we stop queueing sk_buffs. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Convert rxrpc_local::services to an hlist	David Howells	2016-09-08	5	-11/+11
\| \| \| \| \| \| \|	Convert the rxrpc_local::services list to an hlist so that it can be accessed under RCU conditions more readily. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Fix ASSERTCMP and ASSERTIFCMP to handle signed values	David Howells	2016-09-08	1	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Fix ASSERTCMP and ASSERTIFCMP to be able to handle signed values by casting both parameters to the type of the first before comparing. Without this, both values are cast to unsigned long, which means that checks for values less than zero don't work. The downside of this is that the state enum values in struct rxrpc_call and struct rxrpc_connection can't be bitfields as __typeof__ can't handle them. Signed-off-by: David Howells <dhowells@redhat.com>
*	net: xfrm: Change u32 sysctl entries to use proc_douintvec	subashab@codeaurora.org	2016-09-07	1	-2/+2
\| \| \| \| \| \| \| \|	proc_dointvec limits the values to INT_MAX in u32 sysctl entries. proc_douintvec allows to write upto UINT_MAX. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: diag: make udp_diag_destroy work for mapped addresses.	Lorenzo Colitti	2016-09-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	udp_diag_destroy does look up the IPv4 UDP hashtable for mapped addresses, but it gets the IPv4 address to look up from the beginning of the IPv6 address instead of the end. Tested: https://android-review.googlesource.com/269874 Fixes: 5d77dca82839 ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netlink: don't forget to release a rhashtable_iter structure	Andrey Vagin	2016-09-07	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bug was detected by kmemleak: unreferenced object 0xffff8804269cc3c0 (size 64): comm "criu", pid 1042, jiffies 4294907360 (age 13.713s) hex dump (first 32 bytes): a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00 .2.,............ 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ backtrace: [<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280 [<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag] v2: don't remove a reference on a rhashtable_iter structure to release it from netlink_diag_dump_done Cc: Herbert Xu <herbert@gondor.apana.org.au> Fixes: ad202074320c ("netlink: Use rhashtable walk interface in diag dump") Signed-off-by: Andrei Vagin <avagin@openvz.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	rxrpc: Add tracepoint for working out where aborts happen	David Howells	2016-09-07	8	-90/+91
\| \| \| \| \| \| \| \| \| \| \|	Add a tracepoint for working out where local aborts happen. Each tracepoint call is labelled with a 3-letter code so that they can be distinguished - and the DATA sequence number is added too where available. rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can indicate the circumstances when it aborts a call. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Fix returns of call completion helpers	David Howells	2016-09-07	1	-5/+8
\| \| \| \| \| \| \| \| \| \|	rxrpc_set_call_completion() returns bool, not int, so the ret variable should match this. rxrpc_call_completed() and __rxrpc_call_completed() should return the value of rxrpc_set_call_completion(). Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Calls shouldn't hold socket refs	David Howells	2016-09-07	11	-279/+303
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	rxrpc calls shouldn't hold refs on the sock struct. This was done so that the socket wouldn't go away whilst the call was in progress, such that the call could reach the socket's queues. However, we can mark the socket as requiring an RCU release and rely on the RCU read lock. To make this work, we do: (1) rxrpc_release_call() removes the call's call user ID. This is now only called from socket operations and not from the call processor: rxrpc_accept_call() / rxrpc_kernel_accept_call() rxrpc_reject_call() / rxrpc_kernel_reject_call() rxrpc_kernel_end_call() rxrpc_release_calls_on_socket() rxrpc_recvmsg() Though it is also called in the cleanup path of rxrpc_accept_incoming_call() before we assign a user ID. (2) Pass the socket pointer into rxrpc_release_call() rather than getting it from the call so that we can get rid of uninitialised calls. (3) Fix call processor queueing to pass a ref to the work queue and to release that ref at the end of the processor function (or to pass it back to the work queue if we have to requeue). (4) Skip out of the call processor function asap if the call is complete and don't requeue it if the call is complete. (5) Clean up the call immediately that the refcount reaches 0 rather than trying to defer it. Actual deallocation is deferred to RCU, however. (6) Don't hold socket refs for allocated calls. (7) Use the RCU read lock when queueing a message on a socket and treat the call's socket pointer according to RCU rules and check it for NULL. We also need to use the RCU read lock when viewing a call through procfs. (8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call() if this hasn't been done yet so that we can then disconnect the call. Once the call is disconnected, it won't have any access to the connection struct and the UDP socket for the call work processor to be able to send the ACK. Terminal retransmission will be handled by the connection processor. (9) Release all calls immediately on the closing of a socket rather than trying to defer this. Incomplete calls will be aborted. The call refcount model is much simplified. Refs are held on the call by: (1) A socket's user ID tree. (2) A socket's incoming call secureq and acceptq. (3) A kernel service that has a call in progress. (4) A queued call work processor. We have to take care to put any call that we failed to queue. (5) sk_buffs on a socket's receive queue. A future patch will get rid of this. Whilst we're at it, we can do: (1) Get rid of the RXRPC_CALL_EV_RELEASE event. Release is now done entirely from the socket routines and never from the call's processor. (2) Get rid of the RXRPC_CALL_DEAD state. Calls now end in the RXRPC_CALL_COMPLETE state. (3) Get rid of the rxrpc_call::destroyer work item. Calls are now torn down when their refcount reaches 0 and then handed over to RCU for final cleanup. (4) Get rid of the rxrpc_call::deadspan timer. Calls are cleaned up immediately they're finished with and don't hang around. Post-completion retransmission is handled by the connection processor once the call is disconnected. (5) Get rid of the dead call expiry setting as there's no longer a timer to set. (6) rxrpc_destroy_all_calls() can just check that the call list is empty. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Use rxrpc_is_service_call() rather than rxrpc_conn_is_service()	David Howells	2016-09-07	1	-2/+2
\| \| \| \| \| \| \|	Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() if the call is available just in case call->conn is NULL. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Pass the connection pointer to rxrpc_post_packet_to_call()	David Howells	2016-09-07	1	-3/+4
\| \| \| \| \| \| \| \| \|	Pass the connection pointer to rxrpc_post_packet_to_call() as the call might get disconnected whilst we're looking at it, but the connection pointer determined by rxrpc_data_read() is guaranteed by RCU for the duration of the call. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Cache the security index in the rxrpc_call struct	David Howells	2016-09-07	5	-2/+7
\| \| \| \| \| \| \| \|	Cache the security index in the rxrpc_call struct so that we can get at it even when the call has been disconnected and the connection pointer cleared. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Use call->peer rather than call->conn->params.peer	David Howells	2016-09-07	1	-3/+5
\| \| \| \| \| \| \| \|	Use call->peer rather than call->conn->params.peer to avoid the possibility of call->conn being NULL and, whilst we're at it, check it for NULL before we access it. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Improve the call tracking tracepoint	David Howells	2016-09-07	9	-43/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improve the call tracking tracepoint by showing more differentiation between some of the put and get events, including: (1) Getting and putting refs for the socket call user ID tree. (2) Getting and putting refs for queueing and failing to queue the call processor work item. Note that these aren't necessarily used in this patch, but will be taken advantage of in future patches. An enum is added for the event subtype numbers rather than coding them directly as decimal numbers and a table of 3-letter strings is provided rather than a sequence of ?: operators. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Delete unused rxrpc_kernel_free_skb()	David Howells	2016-09-07	1	-13/+0
\| \| \| \| \| \|	Delete rxrpc_kernel_free_skb() as it's unused. Signed-off-by: David Howells <dhowells@redhat.com>
*	rxrpc: Whitespace cleanup	David Howells	2016-09-07	1	-2/+1
\| \| \| \| \| \|	Remove some whitespace. Signed-off-by: David Howells <dhowells@redhat.com>
*	Merge tag 'rxrpc-rewrite-20160904-2' of ↵	David S. Miller	2016-09-06	5	-640/+660
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Split output code from sendmsg code Here's a set of small patches that split the packet transmission code from the sendmsg code and simply rearrange the new file to make it more logically laid out ready for being rewritten. An enum is also moved out of the header file to there as it's only used there. This needs to be applied on top of the just-posted fixes patch set. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	rxrpc Move enum rxrpc_command to sendmsg.c	David Howells	2016-09-04	2	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Move enum rxrpc_command to sendmsg.c as it's now only used in that file. Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: Rearrange net/rxrpc/sendmsg.c	David Howells	2016-09-04	1	-281/+277
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rearrange net/rxrpc/sendmsg.c to be in a more logical order. This makes it easier to follow and eliminates forward declarations. Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: Split sendmsg from packet transmission code	David Howells	2016-09-04	5	-633/+657
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split the sendmsg code from the packet transmission code (mostly to be found in output.c). Signed-off-by: David Howells <dhowells@redhat.com>
* \|	Merge tag 'rxrpc-rewrite-20160904-1' of ↵	David S. Miller	2016-09-06	4	-31/+23
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Small fixes Here's a set of small fix patches: (1) Fix some uninitialised variables. (2) Set the client call state before making it live by attaching it to the conn struct. (3) Randomise the epoch and starting client conn ID values, and don't change the epoch when the client conn ID rolls round. (4) Replace deprecated create_singlethread_workqueue() calls. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	rxrpc: Don't change the epoch	David Howells	2016-09-04	1	-24/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It seems the local epoch should only be changed on boot, so remove the code that changes it for client connections. Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: Randomise epoch and starting client conn ID values	David Howells	2016-09-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Create a random epoch value rather than a time-based one on startup and set the top bit to indicate that this is the case. Also create a random starting client connection ID value. This will be incremented from here as new client connections are created. Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: The client call state must be changed before attachment to conn	David Howells	2016-09-04	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We must set the client call state to RXRPC_CALL_CLIENT_SEND_REQUEST before attaching the call to the connection struct, not after, as it's liable to receive errors and conn aborts as soon as the assignment is made - and these will cause its state to be changed outside of the initiating thread's control. Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: Fix uninitialised variable warning	David Howells	2016-09-02	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix the following uninitialised variable warning: ../net/rxrpc/call_event.c: In function 'rxrpc_process_call': ../net/rxrpc/call_event.c:879:58: warning: 'error' may be used uninitialized in this function [-Wmaybe-uninitialized] _debug("post net error %d", error); ^ Signed-off-by: David Howells <dhowells@redhat.com>
\| *	rxrpc: fix undefined behavior in rxrpc_mark_call_released	Arnd Bergmann	2016-09-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gcc -Wmaybe-initialized correctly points out a newly introduced bug through which we can end up calling rxrpc_queue_call() for a dead connection: net/rxrpc/call_object.c: In function 'rxrpc_mark_call_released': net/rxrpc/call_object.c:600:5: error: 'sched' may be used uninitialized in this function [-Werror=maybe-uninitialized] This sets the 'sched' variable to zero to restore the previous behavior. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: f5c17aaeb2ae ("rxrpc: Calls should only have one terminal state") Signed-off-by: David Howells <dhowells@redhat.com>
* \|	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next	David S. Miller	2016-09-06	36	-1554/+1196
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Most relevant updates are the removal of per-conntrack timers to use a workqueue/garbage collection approach instead from Florian Westphal, the hash and numgen expression for nf_tables from Laura Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag, removal of ip_conntrack sysctl and many other incremental updates on our Netfilter codebase. More specifically, they are: 1) Retrieve only 4 bytes to fetch ports in case of non-linear skb transport area in dccp, sctp, tcp, udp and udplite protocol conntrackers, from Gao Feng. 2) Missing whitespace on error message in physdev match, from Hangbin Liu. 3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang. 4) Add nf_ct_expires() helper function and use it, from Florian Westphal. 5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also from Florian. 6) Rename nf_tables set implementation to nft_set_{name}.c 7) Introduce the hash expression to allow arbitrary hashing of selector concatenations, from Laura Garcia Liebana. 8) Remove ip_conntrack sysctl backward compatibility code, this code has been around for long time already, and we have two interfaces to do this already: nf_conntrack sysctl and ctnetlink. 9) Use nf_conntrack_get_ht() helper function whenever possible, instead of opencoding fetch of hashtable pointer and size, patch from Liping Zhang. 10) Add quota expression for nf_tables. 11) Add number generator expression for nf_tables, this supports incremental and random generators that can be combined with maps, very useful for load balancing purpose, again from Laura Garcia Liebana. 12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King. 13) Introduce a nft_chain_parse_hook() helper function to parse chain hook configuration, this is used by a follow up patch to perform better chain update validation. 14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the nft_set_hash implementation to honor the NLM_F_EXCL flag. 15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(), patch from Florian Westphal. 16) Don't use the DYING bit to know if the conntrack event has been already delivered, instead a state variable to track event re-delivery states, also from Florian. 17) Remove the per-conntrack timer, use the workqueue approach that was discussed during the NFWS, from Florian Westphal. 18) Use the netlink conntrack table dump path to kill stale entries, again from Florian. 19) Add a garbage collector to get rid of stale conntracks, from Florian. 20) Reschedule garbage collector if eviction rate is high. 21) Get rid of the __nf_ct_kill_acct() helper. 22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger. 23) Make nf_log_set() interface assertive on unsupported families. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \|	netfilter: log: Check param to avoid overflow in nf_log_set	Gao Feng	2016-08-30	5	-11/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nf_log_set is an interface function, so it should do the strict sanity check of parameters. Convert the return value of nf_log_set as int instead of void. When the pf is invalid, return -EOPNOTSUPP. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1'	Gao Feng	2016-08-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP, so we could use it instead of the literal number '1'. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: remove __nf_ct_kill_acct helper	Florian Westphal	2016-08-30	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After timer removal this just calls nf_ct_delete so remove the __ prefix version and make nf_ct_kill a shorthand for nf_ct_delete. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: conntrack: resched gc again if eviction rate is high	Florian Westphal	2016-08-30	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we evicted a large fraction of the scanned conntrack entries re-schedule the next gc cycle for immediate execution. This triggers during tests where load is high, then drops to zero and many connections will be in TW/CLOSE state with < 30 second timeouts. Without this change it will take several minutes until conntrack count comes back to normal. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: conntrack: add gc worker to remove timed-out entries	Florian Westphal	2016-08-30	1	-0/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conntrack gc worker to evict stale entries. GC happens once every 5 seconds, but we only scan at most 1/64th of the table (and not more than 8k) buckets to avoid hogging cpu. This means that a complete scan of the table will take several minutes of wall-clock time. Considering that the gc run will never have to evict any entries during normal operation because those will happen from packet path this should be fine. We only need gc to make sure userspace (conntrack event listeners) eventually learn of the timeout, and for resource reclaim in case the system becomes idle. We do not disable BH and cond_resched for every bucket so this should not introduce noticeable latencies either. A followup patch will add a small change to speed up GC for the extreme case where most entries are timed out on an otherwise idle system. v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on nulls value change in gc worker, suggested by Eric Dumazet. v3: don't call cancel_delayed_work_sync twice (again, Eric). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: evict stale entries on netlink dumps	Florian Westphal	2016-08-30	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When dumping we already have to look at the entire table, so we might as well toss those entries whose timeout value is in the past. We also look at every entry during resize operations. However, eviction there is not as simple because we hold the global resize lock so we can't evict without adding a 'expired' list to drop from later. Considering that resizes are very rare it doesn't seem worth doing it. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: conntrack: get rid of conntrack timer	Florian Westphal	2016-08-30	4	-58/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as Eric Dumazet pointed out during netfilter workshop 2016. Eric also says: "Another reason was the fact that Thomas was about to change max timer range [..]" (500462a9de657f8, 'timers: Switch to a non-cascading wheel'). Remove the timer and use a 32bit jiffies value containing timestamp until entry is valid. During conntrack lookup, even before doing tuple comparision, check the timeout value and evict the entry in case it is too old. The dying bit is used as a synchronization point to avoid races where multiple cpus try to evict the same entry. Because lookup is always lockless, we need to bump the refcnt once when we evict, else we could try to evict already-dead entry that is being recycled. This is the standard/expected way when conntrack entries are destroyed. Followup patches will introduce garbage colliction via work queue and further places where we can reap obsoleted entries (e.g. during netlink dumps), this is needed to avoid expired conntracks from hanging around for too long when lookup rate is low after a busy period. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: don't rely on DYING bit to detect when destroy event was sent	Florian Westphal	2016-08-30	1	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The reliable event delivery mode currently (ab)uses the DYING bit to detect which entries on the dying list have to be skipped when re-delivering events from the eache worker in reliable event mode. Currently when we delete the conntrack from main table we only set this bit if we could also deliver the netlink destroy event to userspace. If we fail we move it to the dying list, the ecache worker will reattempt event delivery for all confirmed conntracks on the dying list that do not have the DYING bit set. Once timer is gone, we can no longer use if (del_timer()) to detect when we 'stole' the reference count owned by the timer/hash entry, so we need some other way to avoid racing with other cpu. Pablo suggested to add a marker in the ecache extension that skips entries that have been unhashed from main table but are still waiting for the last reference count to be dropped (e.g. because one skb waiting on nfqueue verdict still holds a reference). We do this by adding a tristate. If we fail to deliver the destroy event, make a note of this in the eache extension. The worker can then skip all entries that are in a different state. Either they never delivered a destroy event, e.g. because the netlink backend was not loaded, or redelivery took place already. Once the conntrack timer is removed we will now be able to replace del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid racing with other cpu that tries to evict the same conntrack. Because DYING will then be set right before we report the destroy event we can no longer skip event reporting when dying bit is set. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: restart search if moved to other chain	Florian Westphal	2016-08-30	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case nf_conntrack_tuple_taken did not find a conflicting entry check that all entries in this hash slot were tested and restart in case an entry was moved to another chain. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: Use nla_put_be32() to dump immediate parameters	Pablo Neira Ayuso	2016-08-26	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nft_dump_register() should only be used with registers, not with immediates. Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression") Fixes: 91dbc6be0a62("netfilter: nf_tables: add number generator expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion	Pablo Neira Ayuso	2016-08-26	3	-13/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the NLM_F_EXCL flag is set, then new elements that clash with an existing one return EEXIST. In case you try to add an element whose data area differs from what we have, then this returns EBUSY. If no flag is specified at all, then this returns success to userspace. This patch also update the set insert operation so we can fetch the existing element that clashes with the one you want to add, we need this to make sure the element data doesn't differ. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: reject hook configuration updates on existing chains	Pablo Neira Ayuso	2016-08-23	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, if you add a base chain whose name clashes with an existing non-base chain, nf_tables doesn't complain about this. Similarly, if you update the chain type, the hook number and priority. With this patch, nf_tables bails out in case any of this unsupported operations occur by returning EBUSY. # nft add table x # nft add chain x y # nft add chain x y { type nat hook input priority 0\; } <cmdline>:1:1-49: Error: Could not process rule: Device or resource busy add chain x y { type nat hook input priority 0; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: introduce nft_chain_parse_hook()	Pablo Neira Ayuso	2016-08-23	1	-63/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce a new function to wrap the code that parses the chain hook configuration so we can reuse this code to validate chain updates. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nft_hash: fix non static symbol warning	Wei Yongjun	2016-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes the following sparse warning: net/netfilter/nft_hash.c:40:25: warning: symbol 'nft_hash_policy' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: fix spelling mistake: "delimitter" -> "delimiter"	Colin Ian King	2016-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	trivial fix to spelling mistake in pr_debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: add number generator expression	Laura Garcia Liebana	2016-08-22	3	-0/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the numgen expression that allows us to generated incremental and random numbers, this generator is bound to a upper limit that is specified by userspace. This expression is useful to distribute packets in a round-robin fashion as well as randomly. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: add quota expression	Pablo Neira Ayuso	2016-08-22	3	-0/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the quota expression. This new stateful expression integrate easily into the dynset expression to build 'hashquota' flow tables. Arguably, we could use instead "counter bytes > 1000" instead, but this approach has several problems: 1) We only support for one single stateful expression in dynamic set definitions, and the expression above is a composite of two expressions: get counter + comparison. 2) We would need to restore the packed counter representation (that we used to have) based on seqlock to synchronize this, since per-cpu is not suitable for this. So instead of bloating the counter expression back with the seqlock representation and extending the existing set infrastructure to make it more complex for the composite described above, let's follow the more simple approach of adding a quota expression that we can plug into our existing infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbol	Pablo Neira Ayuso	2016-08-18	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is required to iterate over the hash table in cttimeout, ctnetlink and nf_conntrack_ipv4. >> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined! Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: conntrack: simplify the code by using nf_conntrack_get_ht	Liping Zhang	2016-08-18	1	-36/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 64b87639c9cb ("netfilter: conntrack: fix race between nf_conntrack proc read and hash resize") introduce the nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation again and again to get the hash table and hash size. And convert nf_conntrack_get_ht to inline function here. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: remove ip_conntrack* sysctl compat code	Pablo Neira Ayuso	2016-08-13	11	-993/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This backward compatibility has been around for more than ten years, since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and the conntrack utility got adopted by many people in the user community according to what I observed on the netfilter user mailing list. So let's get rid of this. Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do not need to be exported as symbol anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: add hash expression	Laura Garcia Liebana	2016-08-12	3	-0/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a new hash expression, this provides jhash support but this can be extended to support for other hash functions. The modulus and seed already comes embedded into this new expression. Use case example: ... meta mark set hash ip saddr mod 10 Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	netfilter: nf_tables: rename set implementations	Pablo Neira Ayuso	2016-08-12	4	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use nft_set_* prefix for backend set implementations, thus we can use nft_hash for the new hash expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| * \|	ipvs: use nf_ct_kill helper	Florian Westphal	2016-08-12	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Once timer is removed from nf_conn struct we cannot open-code the removal sequence anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>