summaryrefslogtreecommitdiffstats
path: root/net/rxrpc
Commit message (Collapse)AuthorAgeFilesLines
* rxrpc: Fix net namespace cleanupDavid Howells2019-05-051-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b13023421b5179413421333f602850914f6a7ad8 ] In rxrpc_destroy_all_calls(), there are two phases: (1) make sure the ->calls list is empty, emitting error messages if not, and (2) wait for the RCU cleanup to happen on outstanding calls (ie. ->nr_calls becomes 0). To avoid taking the call_lock, the function prechecks ->calls and if empty, it returns to avoid taking the lock - this is wrong, however: it still needs to go and do the second phase and wait for ->nr_calls to become 0. Without this, the rxrpc_net struct may get deallocated before we get to the RCU cleanup for the last calls. This can lead to: Slab corruption (Not tainted): kmalloc-16k start=ffff88802b178000, len=16384 050: 6b 6b 6b 6b 6b 6b 6b 6b 61 6b 6b 6b 6b 6b 6b 6b kkkkkkkkakkkkkkk Note the "61" at offset 0x58. This corresponds to the ->nr_calls member of struct rxrpc_net (which is >9k in size, and thus allocated out of the 16k slab). Fix this by flipping the condition on the if-statement, putting the locked section inside the if-body and dropping the return from there. The function will then always go on to wait for the RCU cleanup on outstanding calls. Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rxrpc: fix race condition in rxrpc_input_packet()Eric Dumazet2019-05-022-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 032be5f19a94de51093851757089133dcc1e92aa upstream. After commit 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook"), rxrpc_input_packet() is directly called from lockless UDP receive path, under rcu_read_lock() protection. It must therefore use RCU rules : - udp_sk->sk_user_data can be cleared at any point in this function. rcu_dereference_sk_user_data() is what we need here. - Also, since sk_user_data might have been set in rxrpc_open_socket() we must observe a proper RCU grace period before kfree(local) in rxrpc_lookup_local() v4: @local can be NULL in xrpc_lookup_local() as reported by kbuild test robot <lkp@intel.com> and Julia Lawall <julia.lawall@lip6.fr>, thanks ! v3,v2 : addressed David Howells feedback, thanks ! syzbot reported : kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 19236 Comm: syz-executor703 Not tainted 5.1.0-rc6 #79 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__lock_acquire+0xbef/0x3fb0 kernel/locking/lockdep.c:3573 Code: 00 0f 85 a5 1f 00 00 48 81 c4 10 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 4a 21 00 00 49 81 7d 00 20 54 9c 89 0f 84 cf f4 RSP: 0018:ffff88809d7aef58 EFLAGS: 00010002 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000026 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88809d7af090 R08: 0000000000000001 R09: 0000000000000001 R10: ffffed1015d05bc7 R11: ffff888089428600 R12: 0000000000000000 R13: 0000000000000130 R14: 0000000000000001 R15: 0000000000000001 FS: 00007f059044d700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004b6040 CR3: 00000000955ca000 CR4: 00000000001406f0 Call Trace: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4211 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:152 skb_queue_tail+0x26/0x150 net/core/skbuff.c:2972 rxrpc_reject_packet net/rxrpc/input.c:1126 [inline] rxrpc_input_packet+0x4a0/0x5536 net/rxrpc/input.c:1414 udp_queue_rcv_one_skb+0xaf2/0x1780 net/ipv4/udp.c:2011 udp_queue_rcv_skb+0x128/0x730 net/ipv4/udp.c:2085 udp_unicast_rcv_skb.isra.0+0xb9/0x360 net/ipv4/udp.c:2245 __udp4_lib_rcv+0x701/0x2ca0 net/ipv4/udp.c:2301 udp_rcv+0x22/0x30 net/ipv4/udp.c:2482 ip_protocol_deliver_rcu+0x60/0x8f0 net/ipv4/ip_input.c:208 ip_local_deliver_finish+0x23b/0x390 net/ipv4/ip_input.c:234 NF_HOOK include/linux/netfilter.h:289 [inline] NF_HOOK include/linux/netfilter.h:283 [inline] ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:255 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x1e1/0x300 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:289 [inline] NF_HOOK include/linux/netfilter.h:283 [inline] ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0x115/0x1a0 net/core/dev.c:4987 __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5099 netif_receive_skb_internal+0x117/0x660 net/core/dev.c:5202 napi_frags_finish net/core/dev.c:5769 [inline] napi_gro_frags+0xade/0xd10 net/core/dev.c:5843 tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027 call_write_iter include/linux/fs.h:1866 [inline] do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681 do_iter_write fs/read_write.c:957 [inline] do_iter_write+0x184/0x610 fs/read_write.c:938 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002 do_writev+0x15e/0x370 fs/read_write.c:1037 __do_sys_writev fs/read_write.c:1110 [inline] __se_sys_writev fs/read_write.c:1107 [inline] __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rxrpc: Fix client call connect/disconnect raceDavid Howells2019-04-201-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 930c9f9125c85b5134b3e711bc252ecc094708e3 ] rxrpc_disconnect_client_call() reads the call's connection ID protocol value (call->cid) as part of that function's variable declarations. This is bad because it's not inside the locked section and so may race with someone granting use of the channel to the call. This manifests as an assertion failure (see below) where the call in the presumed channel (0 because call->cid wasn't set when we read it) doesn't match the call attached to the channel we were actually granted (if 1, 2 or 3). Fix this by moving the read and dependent calculations inside of the channel_lock section. Also, only set the channel number and pointer variables if cid is not zero (ie. unset). This problem can be induced by injecting an occasional error in rxrpc_wait_for_channel() before the call to schedule(). Make two further changes also: (1) Add a trace for wait failure in rxrpc_connect_call(). (2) Drop channel_lock before BUG'ing in the case of the assertion failure. The failure causes a trace akin to the following: rxrpc: Assertion failed - 18446612685268945920(0xffff8880beab8c00) == 18446612685268621312(0xffff8880bea69800) is false ------------[ cut here ]------------ kernel BUG at net/rxrpc/conn_client.c:824! ... RIP: 0010:rxrpc_disconnect_client_call+0x2bf/0x99d ... Call Trace: rxrpc_connect_call+0x902/0x9b3 ? wake_up_q+0x54/0x54 rxrpc_new_client_call+0x3a0/0x751 ? rxrpc_kernel_begin_call+0x141/0x1bc ? afs_alloc_call+0x1b5/0x1b5 rxrpc_kernel_begin_call+0x141/0x1bc afs_make_call+0x20c/0x525 ? afs_alloc_call+0x1b5/0x1b5 ? __lock_is_held+0x40/0x71 ? lockdep_init_map+0xaf/0x193 ? lockdep_init_map+0xaf/0x193 ? __lock_is_held+0x40/0x71 ? yfs_fs_fetch_data+0x33b/0x34a yfs_fs_fetch_data+0x33b/0x34a afs_fetch_data+0xdc/0x3b7 afs_read_dir+0x52d/0x97f afs_dir_iterate+0xa0/0x661 ? iterate_dir+0x63/0x141 iterate_dir+0xa2/0x141 ksys_getdents64+0x9f/0x11b ? filldir+0x111/0x111 ? do_syscall_64+0x3e/0x1a0 __x64_sys_getdents64+0x16/0x19 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 45025bceef17 ("rxrpc: Improve management and caching of client connection objects") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* rxrpc: Fix client call queueing, waiting for channelDavid Howells2019-03-191-2/+2
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 69ffaebb90369ce08657b5aea4896777b9d6e8fc ] rxrpc_get_client_conn() adds a new call to the front of the waiting_calls queue if the connection it's going to use already exists. This is bad as it allows calls to get starved out. Fix this by adding to the tail instead. Also change the other enqueue point in the same function to put it on the front (ie. when we have a new connection). This makes the point that in the case of a new connection the new call goes at the front (though it doesn't actually matter since the queue should be unoccupied). Fixes: 45025bceef17 ("rxrpc: Improve management and caching of client connection objects") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rxrpc: bad unlock balance in rxrpc_recvmsgEric Dumazet2019-02-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6dce3c20ac429e7a651d728e375853370c796e8d ] When either "goto wait_interrupted;" or "goto wait_error;" paths are taken, socket lock has already been released. This patch fixes following syzbot splat : WARNING: bad unlock balance detected! 5.0.0-rc4+ #59 Not tainted ------------------------------------- syz-executor223/8256 is trying to release lock (sk_lock-AF_RXRPC) at: [<ffffffff86651353>] rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598 but there are no more locks to release! other info that might help us debug this: 1 lock held by syz-executor223/8256: #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline] #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: release_sock+0x20/0x1c0 net/core/sock.c:2798 stack backtrace: CPU: 1 PID: 8256 Comm: syz-executor223 Not tainted 5.0.0-rc4+ #59 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_unlock_imbalance_bug kernel/locking/lockdep.c:3391 [inline] print_unlock_imbalance_bug.cold+0x114/0x123 kernel/locking/lockdep.c:3368 __lock_release kernel/locking/lockdep.c:3601 [inline] lock_release+0x67e/0xa00 kernel/locking/lockdep.c:3860 sock_release_ownership include/net/sock.h:1471 [inline] release_sock+0x183/0x1c0 net/core/sock.c:2808 rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598 sock_recvmsg_nosec net/socket.c:794 [inline] sock_recvmsg net/socket.c:801 [inline] sock_recvmsg+0xd0/0x110 net/socket.c:797 __sys_recvfrom+0x1ff/0x350 net/socket.c:1845 __do_sys_recvfrom net/socket.c:1863 [inline] __se_sys_recvfrom net/socket.c:1859 [inline] __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:1859 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x446379 Code: e8 2c b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe5da89fd98 EFLAGS: 00000246 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: 00000000006dbc28 RCX: 0000000000446379 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c R13: 0000000000000000 R14: 0000000000000000 R15: 20c49ba5e353f7cf Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Howells <dhowells@redhat.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rxrpc: Fix lockup due to no error backoff after ack transmit errorDavid Howells2018-11-233-8/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c7e86acfcee30794dc99a0759924bf7b9d43f1ca ] If the network becomes (partially) unavailable, say by disabling IPv6, the background ACK transmission routine can get itself into a tizzy by proposing immediate ACK retransmission. Since we're in the call event processor, that happens immediately without returning to the workqueue manager. The condition should clear after a while when either the network comes back or the call times out. Fix this by: (1) When re-proposing an ACK on failed Tx, don't schedule it immediately. This will allow a certain amount of time to elapse before we try again. (2) Enforce a return to the workqueue manager after a certain number of iterations of the call processing loop. (3) Add a backoff delay that increases the delay on deferred ACKs by a jiffy per failed transmission to a limit of HZ. The backoff delay is cleared on a successful return from kernel_sendmsg(). (4) Cancel calls immediately if the opening sendmsg fails. The layer above can arrange retransmission or rotate to another server. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "net: simplify sock_poll_wait"Karsten Graul2018-11-041-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ] This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317. This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an internal TCP socket for the initial handshake with the remote peer. Whenever the SMC connection can not be established this TCP socket is used as a fallback. All socket operations on the SMC socket are then forwarded to the TCP socket. In case of poll, the file->private_data pointer references the SMC socket because the TCP socket has no file assigned. This causes tcp_poll to wait on the wrong socket. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rxrpc: Fix a missing rxrpc_put_peer() in the error_report handlerDavid Howells2018-10-151-0/+1
| | | | | | | | | | | | | Fix a missing call to rxrpc_put_peer() on the main path through the rxrpc_error_report() function. This manifests itself as a ref leak whenever an ICMP packet or other error comes in. In commit f334430316e7, the hand-off of the ref to a work item was removed and was not replaced with a put. Fixes: f334430316e7 ("rxrpc: Fix error distribution") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: Fix incorrect conditional on IPV6David Howells2018-10-151-1/+1
| | | | | | | | | | | | | | | | | | | The udpv6_encap_enable() function is part of the ipv6 code, and if that is configured as a loadable module and rxrpc is built in then a build failure will occur because the conditional check is wrong: net/rxrpc/local_object.o: In function `rxrpc_lookup_local': local_object.c:(.text+0x2688): undefined reference to `udpv6_encap_enable' Use the correct config symbol (CONFIG_AF_RXRPC_IPV6) in the conditional check rather than CONFIG_IPV6 as that will do the right thing. Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Reported-by: kbuild-all@01.org Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: use correct kvec num when sending BUSY response packetYueHaibing2018-10-151-1/+2
| | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: net/rxrpc/output.c: In function 'rxrpc_reject_packets': net/rxrpc/output.c:527:11: warning: variable 'ioc' set but not used [-Wunused-but-set-variable] 'ioc' is the correct kvec num when sending a BUSY (or an ABORT) response packet. Fixes: ece64fec164f ("rxrpc: Emit BUSY packets when supposed to rather than ABORTs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: Fix an uninitialised variableDavid Howells2018-10-151-1/+1
| | | | | | | | | | Fix an uninitialised variable introduced by the last patch. This can cause a crash when a new call comes in to a local service, such as when an AFS fileserver calls back to the local cache manager. Fixes: c1e15b4944c9 ("rxrpc: Fix the packet reception routine") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: Fix the packet reception routineDavid Howells2018-10-086-50/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rxrpc_input_packet() function and its call tree was built around the assumption that data_ready() handler called from UDP to inform a kernel service that there is data to be had was non-reentrant. This means that certain locking could be dispensed with. This, however, turns out not to be the case with a multi-queue network card that can deliver packets to multiple cpus simultaneously. Each of those cpus can be in the rxrpc_input_packet() function at the same time. Fix by adding or changing some structure members: (1) Add peer->rtt_input_lock to serialise access to the RTT buffer. (2) Make conn->service_id into a 32-bit variable so that it can be cmpxchg'd on all arches. (3) Add call->input_lock to serialise access to the Rx/Tx state. Note that although the Rx and Tx states are (almost) entirely separate, there's no point completing the separation and having separate locks since it's a bi-phasal RPC protocol rather than a bi-direction streaming protocol. Data transmission and data reception do not take place simultaneously on any particular call. and making the following functional changes: (1) In rxrpc_input_data(), hold call->input_lock around the core to prevent simultaneous producing of packets into the Rx ring and updating of tracking state for a particular call. (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and check it before checking RXRPC_CALL_PINGING as that's a cheaper test. The bit test and bit clear can then be combined. No further locking is needed here. (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of the ACK packet. The superseded ACK check is then done both before and after the lock is taken. The handing of ackinfo data is split, parsing before the lock is taken and processing with it held. This is keyed on rxMTU being non-zero. Congestion management is also done within the locked section. (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window rotation. The ACKALL packet carries no information and is only really useful after all packets have been transmitted since it's imprecise. (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to prevent calls being simultaneously implicitly ended on two cpus and also to prevent any races with incoming call setup. (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade on a connection. It is only permitted to happen once for a connection. (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside rx->incoming_lock to see if someone else set up the call, connection or peer whilst we were getting there. We can't trust the values from the earlier routing check unless we pin refs on them - which we want to avoid. Further, we need to allow for an incoming call to have its state changed on another CPU between us making it live and us adjusting it because the conn is now in the RXRPC_CONN_SERVICE state. (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access to the RTT buffer. Don't need to lock around setting peer->rtt. For reference, the inventory of state-accessing or state-altering functions used by the packet input procedure is: > rxrpc_input_packet() * PACKET CHECKING * ROUTING > rxrpc_post_packet_to_local() > rxrpc_find_connection_rcu() - uses RCU > rxrpc_lookup_peer_rcu() - uses RCU > rxrpc_find_service_conn_rcu() - uses RCU > idr_find() - uses RCU * CONNECTION-LEVEL PROCESSING - Service upgrade - Can only happen once per conn ! Changed to use cmpxchg > rxrpc_post_packet_to_conn() - Setting conn->hi_serial - Probably safe not using locks - Maybe use cmpxchg * CALL-LEVEL PROCESSING > Old-call checking > rxrpc_input_implicit_end_call() > rxrpc_call_completed() > rxrpc_queue_call() ! Need to take rx->incoming_lock > __rxrpc_disconnect_call() > rxrpc_notify_socket() > rxrpc_new_incoming_call() - Uses rx->incoming_lock for the entire process - Might be able to drop this earlier in favour of the call lock > rxrpc_incoming_call() ! Conflicts with rxrpc_input_implicit_end_call() > rxrpc_send_ping() - Don't need locks to check rtt state > rxrpc_propose_ACK * PACKET DISTRIBUTION > rxrpc_input_call_packet() > rxrpc_input_data() * QUEUE DATA PACKET ON CALL > rxrpc_reduce_call_timer() - Uses timer_reduce() ! Needs call->input_lock() > rxrpc_receiving_reply() ! Needs locking around ack state > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_proto_abort() > rxrpc_input_dup_data() - Fills the Rx buffer - rxrpc_propose_ACK() - rxrpc_notify_socket() > rxrpc_input_ack() * APPLY ACK PACKET TO CALL AND DISCARD PACKET > rxrpc_input_ping_response() - Probably doesn't need any extra locking ! Need READ_ONCE() on call->ping_serial > rxrpc_input_check_for_lost_ack() - Takes call->lock to consult Tx buffer > rxrpc_peer_add_rtt() ! Needs to take a lock (peer->rtt_input_lock) ! Could perhaps manage with cmpxchg() and xadd() instead > rxrpc_input_requested_ack - Consults Tx buffer ! Probably needs a lock > rxrpc_peer_add_rtt() > rxrpc_propose_ack() > rxrpc_input_ackinfo() - Changes call->tx_winsize ! Use cmpxchg to handle change ! Should perhaps track serial number - Uses peer->lock to record MTU specification changes > rxrpc_proto_abort() ! Need to take call->input_lock > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_input_soft_acks() - Consults the Tx buffer > rxrpc_congestion_management() - Modifies the Tx annotations ! Needs call->input_lock() > rxrpc_queue_call() > rxrpc_input_abort() * APPLY ABORT PACKET TO CALL AND DISCARD PACKET > rxrpc_set_call_completion() > rxrpc_notify_socket() > rxrpc_input_ackall() * APPLY ACKALL PACKET TO CALL AND DISCARD PACKET ! Need to take call->input_lock > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_reject_packet() There are some functions used by the above that queue the packet, after which the procedure is terminated: - rxrpc_post_packet_to_local() - local->event_queue is an sk_buff_head - local->processor is a work_struct - rxrpc_post_packet_to_conn() - conn->rx_queue is an sk_buff_head - conn->processor is a work_struct - rxrpc_reject_packet() - local->reject_queue is an sk_buff_head - local->processor is a work_struct And some that offload processing to process context: - rxrpc_notify_socket() - Uses RCU lock - Uses call->notify_lock to call call->notify_rx - Uses call->recvmsg_lock to queue recvmsg side - rxrpc_queue_call() - call->processor is a work_struct - rxrpc_propose_ACK() - Uses call->lock to wrap __rxrpc_propose_ACK() And a bunch that complete a call, all of which use call->state_lock to protect the call state: - rxrpc_call_completed() - rxrpc_set_call_completion() - rxrpc_abort_call() - rxrpc_proto_abort() - Also uses rxrpc_queue_call() Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Fix connection-level abort handlingDavid Howells2018-10-083-15/+19
| | | | | | | | | | | | Fix connection-level abort handling to cache the abort and error codes properly so that a new incoming call can be properly aborted if it races with the parent connection being aborted by another CPU. The abort_code and error parameters can then be dropped from rxrpc_abort_calls(). Fixes: f5c17aaeb2ae ("rxrpc: Calls should only have one terminal state") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Only take the rwind and mtu values from latest ACKDavid Howells2018-10-081-9/+10
| | | | | | | | | Move the out-of-order and duplicate ACK packet check to before the call to rxrpc_input_ackinfo() so that the receive window size and MTU size are only checked in the latest ACK packet and don't regress. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window()David Howells2018-10-081-6/+8
| | | | | | | | Carry the call state out of the locked section in rxrpc_rotate_tx_window() rather than sampling it afterwards. This is only used to select tracepoint data, but could have changed by the time we do the tracepoint. Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window()David Howells2018-10-081-16/+19
| | | | | | | | | | | | | | We should only call the function to end a call's Tx phase if we rotated the marked-last packet out of the transmission buffer. Make rxrpc_rotate_tx_window() return an indication of whether it just rotated the packet marked as the last out of the transmit buffer, carrying the information out of the locked section in that function. We can then check the return value instead of examining RXRPC_CALL_TX_LAST. Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Don't need to take the RCU read lock in the packet receiverDavid Howells2018-10-081-28/+13
| | | | | | | | | | | | We don't need to take the RCU read lock in the rxrpc packet receive function because it's held further up the stack in the IP input routine around the UDP receive routines. Fix this by dropping the RCU read lock calls from rxrpc_input_packet(). This simplifies the code. Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Use the UDP encap_rcv hookDavid Howells2018-10-083-44/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception in which a packet is placed onto the UDP receive queue and then immediately removed again by rxrpc. Going via the queue in this manner seems like it should be unnecessary. This does, however, require the invention of a value to place in encap_type as that's one of the conditions to switch packets out to the encap_rcv hook. Possibly the value doesn't actually matter for anything other than sockopts on the UDP socket, which aren't accessible outside of rxrpc anyway. This seems to cut a bit of time out of the time elapsed between each sk_buff being timestamped and turning up in rxrpc (the final number in the following trace excerpts). I measured this by making the rxrpc_rx_packet trace point print the time elapsed between the skb being timestamped and the current time (in ns), e.g.: ... 424.278721: rxrpc_rx_packet: ... ACK 25026 So doing a 512MiB DIO read from my test server, with an unmodified kernel: N min max sum mean stddev 27605 2626 7581 7.83992e+07 2840.04 181.029 and with the patch applied: N min max sum mean stddev 27547 1895 12165 6.77461e+07 2459.29 255.02 Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Fix the data_ready handlerDavid Howells2018-10-052-35/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the rxrpc_data_ready() function to pick up all packets and to not miss any. There are two problems: (1) The sk_data_ready pointer on the UDP socket is set *after* it is bound. This means that it's open for business before we're ready to dequeue packets and there's a tiny window exists in which a packet can sneak onto the receive queue, but we never know about it. Fix this by setting the pointers on the socket prior to binding it. (2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was an error on the transmission side, even though we set the sk_error_report hook. Because rxrpc_data_ready() returns immediately in such a case, it never actually removes its packet from the receive queue. Fix this by abstracting out the UDP dequeuing and checksumming into a separate function that keeps hammering on skb_recv_udp() until it returns -EAGAIN, passing the packets extracted to the remainder of the function. and two potential problems: (3) It might be possible in some circumstances or in the future for packets to be being added to the UDP receive queue whilst rxrpc is running consuming them, so the data_ready() handler might get called less often than once per packet. Allow for this by fully draining the queue on each call as (2). (4) If a packet fails the checksum check, the code currently returns after discarding the packet without checking for more. Allow for this by fully draining the queue on each call as (2). Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com>
* rxrpc: Fix some missed refs to init_netDavid Howells2018-10-056-24/+34
| | | | | | | | | Fix some refs to init_net that should've been changed to the appropriate network namespace. Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com>
* rxrpc: Fix error distributionDavid Howells2018-09-286-61/+15
| | | | | | | | | | | | | | | | | | | | Fix error distribution by immediately delivering the errors to all the affected calls rather than deferring them to a worker thread. The problem with the latter is that retries and things can happen in the meantime when we want to stop that sooner. To this end: (1) Stop the error distributor from removing calls from the error_targets list so that peer->lock isn't needed to synchronise against other adds and removals. (2) Require the peer's error_targets list to be accessed with RCU, thereby avoiding the need to take peer->lock over distribution. (3) Don't attempt to affect a call's state if it is already marked complete. Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Fix transport sockopts to get IPv4 errors on an IPv6 socketDavid Howells2018-09-281-10/+13
| | | | | | | | | | | | | It seems that enabling IPV6_RECVERR on an IPv6 socket doesn't also turn on IP_RECVERR, so neither local errors nor ICMP-transported remote errors from IPv4 peer addresses are returned to the AF_RXRPC protocol. Make the sockopt setting code in rxrpc_open_socket() fall through from the AF_INET6 case to the AF_INET case to turn on all the AF_INET options too in the AF_INET6 case. Fixes: f2aeed3a591f ("rxrpc: Fix error reception on AF_INET6 sockets") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Make service call handling more robustDavid Howells2018-09-285-60/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the following changes to improve the robustness of the code that sets up a new service call: (1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a service ID check and pass that along to rxrpc_new_incoming_call(). This means that I can remove the check from rxrpc_new_incoming_call() without the need to worry about the socket attached to the local endpoint getting replaced - which would invalidate the check. (2) Cache the rxrpc_peer struct, thereby allowing the peer search to be done once. The peer is passed to rxrpc_new_incoming_call(), thereby saving the need to repeat the search. This also reduces the possibility of rxrpc_publish_service_conn() BUG()'ing due to the detection of a duplicate connection, despite the initial search done by rxrpc_find_connection_rcu() having turned up nothing. This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be non-reentrant and the result of the initial search should still hold true, but it has proven possible to hit. I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short the iteration over the hash table if it finds a matching peer with a zero usage count, but I don't know for sure since it's only ever been hit once that I know of. Another possibility is that a bug in rxrpc_data_ready() that checked the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag might've let through a packet that caused a spurious and invalid call to be set up. That is addressed in another patch. (3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero usage count rather than stopping and returning not found, just in case there's another peer record behind it in the bucket. (4) Don't search the peer records in rxrpc_alloc_incoming_call(), but rather either use the peer cached in (2) or, if one wasn't found, preemptively install a new one. Fixes: 8496af50eb38 ("rxrpc: Use RCU to access a peer's service connection tree") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Improve up-front incoming packet checkingDavid Howells2018-09-282-28/+50
| | | | | | | | | | | Do more up-front checking on incoming packets to weed out invalid ones and also ones aimed at services that we don't support. Whilst we're at it, replace the clearing of call and skew if we don't find a connection with just initialising the variables to zero at the top of the function. Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Emit BUSY packets when supposed to rather than ABORTsDavid Howells2018-09-284-18/+26
| | | | | | | | | | | | | | | | | | | | | In the input path, a received sk_buff can be marked for rejection by setting RXRPC_SKB_MARK_* in skb->mark and, if needed, some auxiliary data (such as an abort code) in skb->priority. The rejection is handled by queueing the sk_buff up for dealing with in process context. The output code reads the mark and priority and, theoretically, generates an appropriate response packet. However, if RXRPC_SKB_MARK_BUSY is set, this isn't noticed and an ABORT message with a random abort code is generated (since skb->priority wasn't set to anything). Fix this by outputting the appropriate sort of packet. Also, whilst we're at it, most of the marks are no longer used, so remove them and rename the remaining two to something more obvious. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Fix RTT gatheringDavid Howells2018-09-283-15/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix RTT information gathering in AF_RXRPC by the following means: (1) Enable Rx timestamping on the transport socket with SO_TIMESTAMPNS. (2) If the sk_buff doesn't have a timestamp set when rxrpc_data_ready() collects it, set it at that point. (3) Allow ACKs to be requested on the last packet of a client call, but not a service call. We need to be careful lest we undo: bf7d620abf22c321208a4da4f435e7af52551a21 Author: David Howells <dhowells@redhat.com> Date: Thu Oct 6 08:11:51 2016 +0100 rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase but that only really applies to service calls that we're handling, since the client side gets to send the final ACK (or not). (4) When about to transmit an ACK or DATA packet, record the Tx timestamp before only; don't update the timestamp afterwards. (5) Switch the ordering between recording the serial and recording the timestamp to always set the serial number first. The serial number shouldn't be seen referenced by an ACK packet until we've transmitted the packet bearing it - so in the Rx path, we don't need the timestamp until we've checked the serial number. Fixes: cf1a6474f807 ("rxrpc: Add per-peer RTT tracker") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Fix checks as to whether we should set up a new callDavid Howells2018-09-283-9/+15
| | | | | | | | | | | There's a check in rxrpc_data_ready() that's checking the CLIENT_INITIATED flag in the packet type field rather than in the packet flags field. Fix this by creating a pair of helper functions to check whether the packet is going to the client or to the server and use them generally. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
* rxrpc: Remove dup code from rxrpc_find_connection_rcu()David Howells2018-09-271-3/+0
| | | | | | | rxrpc_find_connection_rcu() initialises variable k twice with the same information. Remove one of the initialisations. Signed-off-by: David Howells <dhowells@redhat.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2018-08-1512-67/+123
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Highlights: - Gustavo A. R. Silva keeps working on the implicit switch fallthru changes. - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From Luca Coelho. - Re-enable ASPM in r8169, from Kai-Heng Feng. - Add virtual XFRM interfaces, which avoids all of the limitations of existing IPSEC tunnels. From Steffen Klassert. - Convert GRO over to use a hash table, so that when we have many flows active we don't traverse a long list during accumluation. - Many new self tests for routing, TC, tunnels, etc. Too many contributors to mention them all, but I'm really happy to keep seeing this stuff. - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu. - Lots of cleanups and fixes in L2TP code from Guillaume Nault. - Add IPSEC offload support to netdevsim, from Shannon Nelson. - Add support for slotting with non-uniform distribution to netem packet scheduler, from Yousuk Seung. - Add UDP GSO support to mlx5e, from Boris Pismenny. - Support offloading of Team LAG in NFP, from John Hurley. - Allow to configure TX queue selection based upon RX queue, from Amritha Nambiar. - Support ethtool ring size configuration in aquantia, from Anton Mikaev. - Support DSCP and flowlabel per-transport in SCTP, from Xin Long. - Support list based batching and stack traversal of SKBs, this is very exciting work. From Edward Cree. - Busyloop optimizations in vhost_net, from Toshiaki Makita. - Introduce the ETF qdisc, which allows time based transmissions. IGB can offload this in hardware. From Vinicius Costa Gomes. - Add parameter support to devlink, from Moshe Shemesh. - Several multiplication and division optimizations for BPF JIT in nfp driver, from Jiong Wang. - Lots of prepatory work to make more of the packet scheduler layer lockless, when possible, from Vlad Buslov. - Add ACK filter and NAT awareness to sch_cake packet scheduler, from Toke Høiland-Jørgensen. - Support regions and region snapshots in devlink, from Alex Vesker. - Allow to attach XDP programs to both HW and SW at the same time on a given device, with initial support in nfp. From Jakub Kicinski. - Add TLS RX offload and support in mlx5, from Ilya Lesokhin. - Use PHYLIB in r8169 driver, from Heiner Kallweit. - All sorts of changes to support Spectrum 2 in mlxsw driver, from Ido Schimmel. - PTP support in mv88e6xxx DSA driver, from Andrew Lunn. - Make TCP_USER_TIMEOUT socket option more accurate, from Jon Maxwell. - Support for templates in packet scheduler classifier, from Jiri Pirko. - IPV6 support in RDS, from Ka-Cheong Poon. - Native tproxy support in nf_tables, from Máté Eckl. - Maintain IP fragment queue in an rbtree, but optimize properly for in-order frags. From Peter Oskolkov. - Improvde handling of ACKs on hole repairs, from Yuchung Cheng" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits) bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT" hv/netvsc: Fix NULL dereference at single queue mode fallback net: filter: mark expected switch fall-through xen-netfront: fix warn message as irq device name has '/' cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0 net: dsa: mv88e6xxx: missing unlock on error path rds: fix building with IPV6=m inet/connection_sock: prefer _THIS_IP_ to current_text_addr net: dsa: mv88e6xxx: bitwise vs logical bug net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd() ieee802154: hwsim: using right kind of iteration net: hns3: Add vlan filter setting by ethtool command -K net: hns3: Set tx ring' tc info when netdev is up net: hns3: Remove tx ring BD len register in hns3_enet net: hns3: Fix desc num set to default when setting channel net: hns3: Fix for phy link issue when using marvell phy driver net: hns3: Fix for information of phydev lost problem when down/up net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero net: hns3: Add support for serdes loopback selftest bnxt_en: take coredump_record structure off stack ...
| * rxrpc: remove redundant static int 'zero'Colin Ian King2018-08-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The static int 'zero' is defined but is never used hence it is redundant and can be removed. The use of this variable was removed with commit a158bdd3247b ("rxrpc: Fix call timeouts"). Cleans up clang warning: warning: 'zero' defined but not used [-Wunused-const-variable=] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-08-097-90/+109
| |\ | | | | | | | | | | | | | | | | | | Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst adding some tracepoints. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | rxrpc: Push iov_iter up from rxrpc_kernel_recv_data() to callerDavid Howells2018-08-031-22/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow non-contiguous iovs to be passed down, thereby permitting file reading to be simplified in the AFS filesystem in a future patch. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | rxrpc: Reuse SKCIPHER_REQUEST_ON_STACK bufferKees Cook2018-08-031-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The use of SKCIPHER_REQUEST_ON_STACK() will trigger FRAME_WARN warnings (when less than 2048) once the VLA is no longer hidden from the check: net/rxrpc/rxkad.c:398:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=] net/rxrpc/rxkad.c:242:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=] This passes the initial SKCIPHER_REQUEST_ON_STACK allocation to the leaf functions for reuse. Two requests allocated on the stack is not needed when only one is used at a time. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-08-021-2/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BTF conflicts were simple overlapping changes. The virtio_net conflict was an overlap of a fix of statistics counter, happening alongisde a move over to a bonafide statistics structure rather than counting value on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | rxrpc: Remove set but not used variable 'nowj'Wei Yongjun2018-08-021-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: net/rxrpc/proc.c: In function 'rxrpc_call_seq_show': net/rxrpc/proc.c:66:29: warning: variable 'nowj' set but not used [-Wunused-but-set-variable] unsigned long timeout = 0, nowj; ^ Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | rxrpc: Transmit more ACKs during data receptionDavid Howells2018-08-012-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Immediately flush any outstanding ACK on entry to rxrpc_recvmsg_data() - which transfers data to the target buffers - if we previously had an Rx underrun (ie. we returned -EAGAIN because we ran out of received data). This lets the server know what we've managed to receive something. Also flush any outstanding ACK after calling the function if it hit -EAGAIN to let the server know we processed some data. It might be better to send more ACKs, possibly on a time-based scheme, but that needs some more consideration. With this and some additional AFS patches, it is possible to get large unencrypted O_DIRECT reads to be almost as fast as NFS over TCP. It looks like it might be theoretically possible to improve performance yet more for a server running a single operation as investigation of packet timestamps indicates that the server keeps stalling. The issue appears to be that rxrpc runs in to trouble with ACK packets getting batched together (up to ~32 at a time) somewhere between the IP transmit queue on the client and the ethernet receive queue on the server. However, this case isn't too much of a worry as even a lightly loaded server should be receiving sufficient packet flux to flush the ACK packets to the UDP socket. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Propose, but don't immediately transmit, the final ACK for a callDavid Howells2018-08-011-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The final ACK that closes out an rxrpc call needs to be transmitted by the client unless we're going to follow up with a DATA packet for a new call on the same channel (which implicitly ACK's the previous call, thereby saving an ACK). Currently, we don't do that, so if no follow on call is immediately forthcoming, the server will resend the last DATA packet - at which point rxrpc_conn_retransmit_call() will be triggered and will (re)send the final ACK. But the server has to hold on to the last packet until the ACK is received, thereby holding up its resources. Fix the client side to propose a delayed final ACK, to be transmitted after a short delay, assuming the call isn't superseded by a new one. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Increase the size of a call's Rx windowDavid Howells2018-08-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increase the size of a call's Rx window from 32 to 63 - ie. one less than the size of the ring buffer. This makes large data transfers perform better when the Tx window on the other side is around 64 (as is the case with Auristor's YFS fileserver). If the server window size is ~32 or smaller, this should make no difference. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Trace socket notificationDavid Howells2018-08-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Trace notifications from the softirq side of the socket to the process-context side. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Trace packet transmissionDavid Howells2018-08-017-15/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trace successful packet transmission (kernel_sendmsg() succeeded, that is) in AF_RXRPC. We can share the enum that defines the transmission points with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be applicable to both. Also, save the internal call->debug_id in the rxrpc_channel struct so that it can be used in retransmission trace lines. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Fix the trace for terminal ACK (re)transmissionDavid Howells2018-08-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the trace for terminal ACK (re)transmission to put in the right parameters. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Show some more information through /proc filesDavid Howells2018-08-011-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Show the four current call IDs in /proc/net/rxrpc/conns. Show the current packet Rx serial number in /proc/net/rxrpc/calls. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: Display call expect-receive-by timeout in procDavid Howells2018-08-011-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Display in /proc/net/rxrpc/calls the timeout by which a call next expects to receive a packet. This makes it easier to debug timeout issues. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | rxrpc: remove redundant variables 'sp' and 'did_discard'YueHaibing2018-08-012-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variables 'sp' and 'did_discard' are being assigned, but are never used, hence they are redundant and can be removed. fix following warning: net/rxrpc/call_event.c:165:25: warning: variable 'sp' set but not used [-Wunused-but-set-variable] net/rxrpc/conn_client.c:1054:7: warning: variable 'did_discard' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * | | net: simplify sock_poll_waitChristoph Hellwig2018-07-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The wait_address argument is always directly derived from the filp argument, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2018-08-134-5/+5
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking/atomics update from Thomas Gleixner: "The locking, atomics and memory model brains delivered: - A larger update to the atomics code which reworks the ordering barriers, consolidates the atomic primitives, provides the new atomic64_fetch_add_unless() primitive and cleans up the include hell. - Simplify cmpxchg() instrumentation and add instrumentation for xchg() and cmpxchg_double(). - Updates to the memory model and documentation" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) locking/atomics: Rework ordering barriers locking/atomics: Instrument cmpxchg_double*() locking/atomics: Instrument xchg() locking/atomics: Simplify cmpxchg() instrumentation locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation tools/memory-model: Rename litmus tests to comply to norm7 tools/memory-model/Documentation: Fix typo, smb->smp sched/Documentation: Update wake_up() & co. memory-barrier guarantees locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock() sched/core: Use smp_mb() in wake_woken_function() tools/memory-model: Add informal LKMM documentation to MAINTAINERS locking/atomics/Documentation: Describe atomic_set() as a write operation tools/memory-model: Make scripts executable tools/memory-model: Remove ACCESS_ONCE() from model tools/memory-model: Remove ACCESS_ONCE() from recipes locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example MAINTAINERS: Add Daniel Lustig as an LKMM reviewer tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name tools/memory-model: Add litmus test for full multicopy atomicity locking/refcount: Always allow checked forms ...
| * | | Merge tag 'v4.18-rc5' into locking/core, to pick up fixesIngo Molnar2018-07-171-3/+7
| |\| | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | atomics/treewide: Rename __atomic_add_unless() => atomic_fetch_add_unless()Mark Rutland2018-06-214-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While __atomic_add_unless() was originally intended as a building-block for atomic_add_unless(), it's now used in a number of places around the kernel. It's the only common atomic operation named __atomic*(), rather than atomic_*(), and for consistency it would be better named atomic_fetch_add_unless(). This lack of consistency is slightly confusing, and gets in the way of scripting atomics. Given that, let's clean things up and promote it to an official part of the atomics API, in the form of atomic_fetch_add_unless(). This patch converts definitions and invocations over to the new name, including the instrumented version, using the following script: ---- git grep -w __atomic_add_unless | while read line; do sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}"; done git grep -w __arch_atomic_add_unless | while read line; do sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}"; done ---- Note that we do not have atomic{64,_long}_fetch_add_unless(), which will be introduced by later patches. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | rxrpc: Fix the keepalive generator [ver #2]David Howells2018-08-087-89/+109
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AF_RXRPC has a keepalive message generator that generates a message for a peer ~20s after the last transmission to that peer to keep firewall ports open. The implementation is incorrect in the following ways: (1) It mixes up ktime_t and time64_t types. (2) It uses ktime_get_real(), the output of which may jump forward or backward due to adjustments to the time of day. (3) If the current time jumps forward too much or jumps backwards, the generator function will crank the base of the time ring round one slot at a time (ie. a 1s period) until it catches up, spewing out VERSION packets as it goes. Fix the problem by: (1) Only using time64_t. There's no need for sub-second resolution. (2) Use ktime_get_seconds() rather than ktime_get_real() so that time isn't perceived to go backwards. (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two parts: (a) The "worker" function that manages the buckets and the timer. (b) The "dispatch" function that takes the pending peers and potentially transmits a keepalive packet before putting them back in the ring into the slot appropriate to the revised last-Tx time. (4) Taking everything that's pending out of the ring and splicing it into a temporary collector list for processing. In the case that there's been a significant jump forward, the ring gets entirely emptied and then the time base can be warped forward before the peers are processed. The warping can't happen if the ring isn't empty because the slot a peer is in is keepalive-time dependent, relative to the base time. (5) Limit the number of iterations of the bucket array when scanning it. (6) Set the timer to skip any empty slots as there's no point waking up if there's nothing to do yet. This can be triggered by an incoming call from a server after a reboot with AF_RXRPC and AFS built into the kernel causing a peer record to be set up before userspace is started. The system clock is then adjusted by userspace, thereby potentially causing the keepalive generator to have a meltdown - which leads to a message like: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23] ... Workqueue: krxrpcd rxrpc_peer_keepalive_worker EIP: lock_acquire+0x69/0x80 ... Call Trace: ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? _raw_spin_lock_bh+0x29/0x60 ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? __lock_acquire+0x3d3/0x870 ? process_one_work+0x110/0x340 ? process_one_work+0x166/0x340 ? process_one_work+0x110/0x340 ? worker_thread+0x39/0x3c0 ? kthread+0xdb/0x110 ? cancel_delayed_work+0x90/0x90 ? kthread_stop+0x70/0x70 ? ret_from_fork+0x19/0x24 Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | rxrpc: Fix user call ID check in rxrpc_service_prealloc_oneYueHaibing2018-08-011-2/+2
| |/ |/| | | | | | | | | | | | | | | | | | | | | There just check the user call ID isn't already in use, hence should compare user_call_ID with xcall->user_call_ID, which is current node's user_call_ID. Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Suggested-by: David Howells <dhowells@redhat.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>