summaryrefslogtreecommitdiffstats
path: root/net/smc
Commit message (Collapse)AuthorAgeFilesLines
* net/smc: add net device tracker to struct smc_pnetentryEric Dumazet2021-12-071-4/+5
| | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/smc: Clear memory when release and reuse bufferTony Lu2021-12-061-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, buffers are cleared when smc connections are created and buffers are reused. This slows down the speed of establishing new connections. In most cases, the applications want to establish connections as quickly as possible. This patch moves memset() from connection creation path to release and buffer unuse path, this trades off between speed of establishing and release. Test environments: - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4 - socket sndbuf / rcvbuf: 16384 / 131072 bytes - w/o first round, 5 rounds, avg, 100 conns batch per round - smc_buf_create() use bpftrace kprobe, introduces extra latency Latency benchmarks for smc_buf_create(): w/o patch : 19040.0 ns w/ patch : 1932.6 ns ratio : 10.2% (-89.8%) Latency benchmarks for socket create and connect: w/o patch : 143.3 us w/ patch : 102.2 us ratio : 71.3% (-28.7%) The latency of establishing connections is reduced by 28.7%. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-12-022-6/+9
|\ | | | | | | Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Keep smc_close_final rc during active closeTony Lu2021-12-021-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When smc_close_final() returns error, the return code overwrites by kernel_sock_shutdown() in smc_close_active(). The return code of smc_close_final() is more important than kernel_sock_shutdown(), and it will pass to userspace directly. Fix it by keeping both return codes, if smc_close_final() raises an error, return it or kernel_sock_shutdown()'s. Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: 606a63c9783a ("net/smc: Ensure the active closing peer first closes clcsock") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: fix wrong list_del in smc_lgr_cleanup_earlyDust Li2021-12-021-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | smc_lgr_cleanup_early() meant to delete the link group from the link group list, but it deleted the list head by mistake. This may cause memory corruption since we didn't remove the real link group from the list and later memseted the link group structure. We got a list corruption panic when testing: [  231.277259] list_del corruption. prev->next should be ffff8881398a8000, but was 0000000000000000 [  231.278222] ------------[ cut here ]------------ [  231.278726] kernel BUG at lib/list_debug.c:53! [  231.279326] invalid opcode: 0000 [#1] SMP NOPTI [  231.279803] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.10.46+ #435 [  231.280466] Hardware name: Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 [  231.281248] Workqueue: events smc_link_down_work [  231.281732] RIP: 0010:__list_del_entry_valid+0x70/0x90 [  231.282258] Code: 4c 60 82 e8 7d cc 6a 00 0f 0b 48 89 fe 48 c7 c7 88 4c 60 82 e8 6c cc 6a 00 0f 0b 48 89 fe 48 c7 c7 c0 4c 60 82 e8 5b cc 6a 00 <0f> 0b 48 89 fe 48 c7 c7 00 4d 60 82 e8 4a cc 6a 00 0f 0b cc cc cc [  231.284146] RSP: 0018:ffffc90000033d58 EFLAGS: 00010292 [  231.284685] RAX: 0000000000000054 RBX: ffff8881398a8000 RCX: 0000000000000000 [  231.285415] RDX: 0000000000000001 RSI: ffff88813bc18040 RDI: ffff88813bc18040 [  231.286141] RBP: ffffffff8305ad40 R08: 0000000000000003 R09: 0000000000000001 [  231.286873] R10: ffffffff82803da0 R11: ffffc90000033b90 R12: 0000000000000001 [  231.287606] R13: 0000000000000000 R14: ffff8881398a8000 R15: 0000000000000003 [  231.288337] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 [  231.289160] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  231.289754] CR2: 0000000000e72058 CR3: 000000010fa96006 CR4: 00000000003706f0 [  231.290485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [  231.291211] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [  231.291940] Call Trace: [  231.292211]  smc_lgr_terminate_sched+0x53/0xa0 [  231.292677]  smc_switch_conns+0x75/0x6b0 [  231.293085]  ? update_load_avg+0x1a6/0x590 [  231.293517]  ? ttwu_do_wakeup+0x17/0x150 [  231.293907]  ? update_load_avg+0x1a6/0x590 [  231.294317]  ? newidle_balance+0xca/0x3d0 [  231.294716]  smcr_link_down+0x50/0x1a0 [  231.295090]  ? __wake_up_common_lock+0x77/0x90 [  231.295534]  smc_link_down_work+0x46/0x60 [  231.295933]  process_one_work+0x18b/0x350 Fixes: a0a62ee15a829 ("net/smc: separate locks for SMCD and SMCR link group lists") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-11-263-22/+37
|\| | | | | | | | | | | | | | | | | | | drivers/net/ipa/ipa_main.c 8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop") 76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Don't call clcsock shutdown twice when smc shutdownTony Lu2021-11-261-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When applications call shutdown() with SHUT_RDWR in userspace, smc_close_active() calls kernel_sock_shutdown(), and it is called twice in smc_shutdown(). This fixes this by checking sk_state before do clcsock shutdown, and avoids missing the application's call of smc_shutdown(). Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: 606a63c9783a ("net/smc: Ensure the active closing peer first closes clcsock") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20211126024134.45693-1-tonylu@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Fix loop in smc_listenGuo DaXing2021-11-241-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel_listen function in smc_listen will fail when all the available ports are occupied. At this point smc->clcsock->sk->sk_data_ready has been changed to smc_clcsock_data_ready. When we call smc_listen again, now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point to the smc_clcsock_data_ready function. The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which now points to itself resulting in an infinite loop. This patch restores smc->clcsock->sk->sk_data_ready with the old value. Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers") Signed-off-by: Guo DaXing <guodaxing@huawei.com> Acked-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk()Karsten Graul2021-11-241-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity reports a possible NULL dereferencing problem: in smc_vlan_by_tcpsk(): 6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times). 7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next. 1623 ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower); CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS) 8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev. 1624 if (is_vlan_dev(ndev)) { Remove the manual implementation and use netdev_walk_all_lower_dev() to iterate over the lower devices. While on it remove an obsolete function parameter comment. Fixes: cb9d43f67754 ("net/smc: determine vlan_id of stacked net_device") Suggested-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Ensure the active closing peer first closes clcsockTony Lu2021-11-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The side that actively closed socket, it's clcsock doesn't enter TIME_WAIT state, but the passive side does it. It should show the same behavior as TCP sockets. Consider this, when client actively closes the socket, the clcsock in server enters TIME_WAIT state, which means the address is occupied and won't be reused before TIME_WAIT dismissing. If we restarted server, the service would be unavailable for a long time. To solve this issue, shutdown the clcsock in [A], perform the TCP active close progress first, before the passive closed side closing it. So that the actively closed side enters TIME_WAIT, not the passive one. Client | Server close() // client actively close | smc_release() | smc_close_active() // PEERCLOSEWAIT1 | smc_close_final() // abort or closed = 1| smc_cdc_get_slot_and_msg_send() | [A] | |smc_cdc_msg_recv_action() // ACTIVE | queue_work(smc_close_wq, &conn->close_work) | smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1 | smc_close_passive_abort_received() // only in abort | |close() // server recv zero, close | smc_release() // PROCESSABORT or APPCLOSEWAIT1 | smc_close_active() | smc_close_abort() or smc_close_final() // CLOSED | smc_cdc_get_slot_and_msg_send() // abort or closed = 1 smc_cdc_msg_recv_action() | smc_clcsock_release() queue_work(smc_close_wq, &conn->close_work) | sock_release(tcp) // actively close clc, enter TIME_WAIT smc_close_passive_work() // PEERCLOSEWAIT1 | smc_conn_free() smc_close_passive_abort_received() // CLOSED| smc_conn_free() | smc_clcsock_release() | sock_release(tcp) // passive close clc | Link: https://www.spinics.net/lists/netdev/msg780407.html Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: Clean up local struct sock variablesTony Lu2021-11-231-2/+2
| | | | | | | | | | | | | | | | | | | | There remains some variables to replace with local struct sock. So clean them up all. Fixes: 3163c5071f25 ("net/smc: use local struct sock variables consistently") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: Avoid warning of possible recursive lockingWen Gu2021-11-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Possible recursive locking is detected by lockdep when SMC falls back to TCP. The corresponding warnings are as follows: ============================================ WARNING: possible recursive locking detected 5.16.0-rc1+ #18 Tainted: G E -------------------------------------------- wrk/1391 is trying to acquire lock: ffff975246c8e7d8 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0x109/0x250 [smc] but task is already holding lock: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ei->socket.wq.wait); lock(&ei->socket.wq.wait); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by wrk/1391: #0: ffff975246040130 (sk_lock-AF_SMC){+.+.}-{0:0}, at: smc_connect+0x43/0x150 [smc] #1: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc] stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x56/0x7b __lock_acquire+0x951/0x11f0 lock_acquire+0x27a/0x320 ? smc_switch_to_fallback+0x109/0x250 [smc] ? smc_switch_to_fallback+0xfe/0x250 [smc] _raw_spin_lock_irq+0x3b/0x80 ? smc_switch_to_fallback+0x109/0x250 [smc] smc_switch_to_fallback+0x109/0x250 [smc] smc_connect_fallback+0xe/0x30 [smc] __smc_connect+0xcf/0x1090 [smc] ? mark_held_locks+0x61/0x80 ? __local_bh_enable_ip+0x77/0xe0 ? lockdep_hardirqs_on+0xbf/0x130 ? smc_connect+0x12a/0x150 [smc] smc_connect+0x12a/0x150 [smc] __sys_connect+0x8a/0xc0 ? syscall_enter_from_user_mode+0x20/0x70 __x64_sys_connect+0x16/0x20 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The nested locking in smc_switch_to_fallback() is considered to possibly cause a deadlock because smc_wait->lock and clc_wait->lock are the same type of lock. But actually it is safe so far since there is no other place trying to obtain smc_wait->lock when clc_wait->lock is held. So the patch replaces spin_lock() with spin_lock_nested() to avoid false report by lockdep. Link: https://lkml.org/lkml/2021/11/19/962 Fixes: 2153bd1e3d3d ("Transfer remaining wait queue entries during fallback") Reported-by: syzbot+e979d3597f48262cb4ee@syzkaller.appspotmail.com Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-11-182-1/+16
|\| | | | | | | Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Make sure the link_id is uniqueWen Gu2021-11-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | The link_id is supposed to be unique, but smcr_next_link_id() doesn't skip the used link_id as expected. So the patch fixes this. Fixes: 026c381fb477 ("net/smc: introduce link_idx for link group array") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: Transfer remaining wait queue entries during fallbackWen Gu2021-11-151-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SMC fallback is incomplete currently. There may be some wait queue entries remaining in smc socket->wq, which should be removed to clcsocket->wq during the fallback. For example, in nginx/wrk benchmark, this issue causes an all-zeros test result: server: nginx -g 'daemon off;' client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html Running 5s test @ http://11.200.15.93/index.html 1 threads and 1 connections Thread Stats Avg Stdev Max ± Stdev Latency 0.00us 0.00us 0.00us -nan% Req/Sec 0.00 0.00 0.00 -nan% 0 requests in 5.00s, 0.00B read Requests/sec: 0.00 Transfer/sec: 0.00B The reason for this all-zeros result is that when wrk used SMC to replace TCP, it added an eppoll_entry into smc socket->wq and expected to be notified if epoll events like EPOLL_IN/ EPOLL_OUT occurred on the smc socket. However, once a fallback occurred, wrk switches to use clcsocket. Now it is clcsocket->wq instead of smc socket->wq which will be woken up. The eppoll_entry remaining in smc socket->wq does not work anymore and wrk stops the test. This patch fixes this issue by removing remaining wait queue entries from smc socket->wq to clcsocket->wq during the fallback. Link: https://www.spinics.net/lists/netdev/msg779769.html Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: drop nopreempt requirement on sock_prot_inuse_add()Eric Dumazet2021-11-161-1/+1
|/ | | | | | | | | | This is distracting really, let's make this simpler, because many callers had to take care of this by themselves, even if on x86 this adds more code than really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix sk_refcnt underflow on linkdown and fallbackDust Li2021-11-101-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We got the following WARNING when running ab/nginx test with RDMA link flapping (up-down-up). The reason is when smc_sock fallback and at linkdown happens simultaneously, we may got the following situation: __smc_lgr_terminate() --> smc_conn_kill() --> smc_close_active_abort() smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) smc_sock was set to SMC_CLOSED and sock_put() been called when terminate the link group. But later application call close() on the socket, then we got: __smc_release(): if (smc_sock->fallback) smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) Again we set the smc_sock to CLOSED through it's already in CLOSED state, and double put the refcnt, so the following warning happens: refcount_t: underflow; use-after-free. WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0 Modules linked in: CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 RIP: 0010:refcount_warn_saturate+0x8d/0xf0 Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48 RSP: 0018:ffffc90000527e50 EFLAGS: 00010286 RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027 RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001 R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340 R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8 FS: 00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: smc_release+0x353/0x3f0 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x93/0x230 task_work_run+0x65/0xa0 exit_to_user_mode_prepare+0xf9/0x100 syscall_exit_to_user_mode+0x27/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This patch adds check in __smc_release() to make sure we won't do an extra sock_put() and set the socket to CLOSED when its already in CLOSED state. Fixes: 51f1de79ad8e (net/smc: replace sock_put worker by socket refcounting) Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Print function name in smcr_link_down tracepointTony Lu2021-11-051-1/+1
| | | | | | | | | | | | | | This makes the output of smcr_link_down tracepoint easier to use and understand without additional translating function's pointer address. It prints the function name with offset: <idle>-0 [000] ..s. 69.087164: smcr_link_down: lnk=00000000dab41cdc lgr=000000007d5d8e24 state=0 rc=1 dev=mlx5_0 location=smc_wr_tx_tasklet_fn+0x5ef/0x6f0 [smc] Link: https://lore.kernel.org/netdev/11f17a34-fd35-f2ec-3f20-dd0c34e55fde@linux.ibm.com/ Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Introduce tracepoint for smcr link downTony Lu2021-11-013-2/+38
| | | | | | | | | | | | SMC-R link down event is important to help us find links' issues, we should track this event, especially in the single nic mode, which means upper layer connection would be shut down. Then find out the direct link-down reason in time, not only increased the counter, also the location of the code who triggered this event. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Introduce tracepoints for tx and rx msgTony Lu2021-11-014-0/+45
| | | | | | | | | | This introduce two tracepoints for smc tx and rx msg to help us diagnosis issues of data path. These two tracepoitns don't cover the path of CORK or MSG_MORE in tx, just the top half of data path. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Introduce tracepoint for fallbackTony Lu2021-11-014-0/+59
| | | | | | | | | | | This introduces tracepoint for smc fallback to TCP, so that we can track which connection and why it fallbacks, and map the clcsocks' pointer with /proc/net/tcp to find more details about TCP connections. Compared with kprobe or other dynamic tracing, tracepoints are stable and easy to use. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-10-282-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | include/net/sock.h 7b50ecfcc6cd ("net: Rename ->stream_memory_read to ->sock_is_readable") 4c1e34c0dbff ("vsock: Enable y2038 safe timeval for timeout") drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c 0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table") e77bcdd1f639 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.") Adjacent code addition in both cases, keep both. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Correct spelling mistake to TCPF_SYN_RECVWen Gu2021-10-281-1/+1
| | | | | | | | | | | | | | | | There should use TCPF_SYN_RECV instead of TCP_SYN_RECV. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: Fix smc_link->llc_testlink_time overflowTony Lu2021-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The value of llc_testlink_time is set to the value stored in net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of sysctl_tcp_keepalive_time is already jiffies, so we don't need to multiply by HZ, which would cause smc_link->llc_testlink_time overflow, and test_link send flood. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: stop links when their GID is removedKarsten Graul2021-10-161-0/+53
| | | | | | | | | | | | | | | | | | | | With SMC-Rv2 the GID is an IP address that can be deleted from the device. When an IB_EVENT_GID_CHANGE event is provided then iterate over all active links and check if their GID is still defined. Otherwise stop the affected link. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add netlink support for SMC-Rv2Karsten Graul2021-10-161-25/+73
| | | | | | | | | | | | | | | | Implement the netlink support for SMC-Rv2 related attributes that are provided to user space. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: extend LLC layer for SMC-Rv2Karsten Graul2021-10-167-115/+531
| | | | | | | | | | | | | | | | | | | | | | Add support for large v2 LLC control messages in smc_llc.c. The new large work request buffer allows to combine control messages into one packet that had to be spread over several packets before. Add handling of the new v2 LLC messages. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add v2 support to the work request layerKarsten Graul2021-10-165-21/+199
| | | | | | | | | | | | | | | | | | In the work request layer define one large v2 buffer for each link group that is used to transmit and receive large LLC control messages. Add the completion queue handling for this buffer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: retrieve v2 gid from IB deviceKarsten Graul2021-10-164-30/+95
| | | | | | | | | | | | | | | | | | In smc_ib.c, scan for RoCE devices that support UDP encapsulation. Find an eligible device and check that there is a route to the remote peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add v2 format of CLC decline messageKarsten Graul2021-10-162-10/+55
| | | | | | | | | | | | | | | | The CLC decline message changed with SMC-Rv2 and supports up to 4 additional diagnosis codes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add listen processing for SMC-Rv2Karsten Graul2021-10-164-66/+165
| | | | | | | | | | | | | | | | | | Implement the server side of the SMC-Rv2 processing. Process incoming CLC messages, find eligible devices and check for a valid route to the remote peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add SMC-Rv2 connection establishmentKarsten Graul2021-10-166-61/+267
| | | | | | | | | | | | | | | | | | Send a CLC proposal message, and the remote side process this type of message and determine the target GID. Check for a valid route to this GID, and complete the connection establishment. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: prepare for SMC-Rv2 connectionKarsten Graul2021-10-164-35/+113
| | | | | | | | | | | | | | | | Prepare the connection establishment with SMC-Rv2. Detect eligible RoCE cards and indicate all supported SMC modes for the connection. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: save stack space and allocate smc_init_infoKarsten Graul2021-10-161-30/+53
| | | | | | | | | | | | | | | | The struct smc_init_info grew over time, its time to save space on stack and allocate this struct dynamically. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-10-145-41/+85
|\| | | | | | | | | | | | | | | tools/testing/selftests/net/ioam6.sh 7b1700e009cc ("selftests: net: modify IOAM tests for undef bits") bf77b1400a56 ("selftests: net: Test for the IOAM encapsulation with IPv6") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: improved fix wait on already cleared linkKarsten Graul2021-10-085-41/+85
| | | | | | | | | | | | | | | | | | | | | | Commit 8f3d65c16679 ("net/smc: fix wait on already cleared link") introduced link refcounting to avoid waits on already cleared links. This patch extents and improves the refcounting to cover all remaining possible cases for this kind of error situation. Fixes: 15e1b99aadfb ("net/smc: no WR buffer wait for terminating link group") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-09-232-1/+4
|\| | | | | | | | | | | | | | | | | | | net/mptcp/protocol.c 977d293e23b4 ("mptcp: ensure tx skbs always have the MPTCP ext") efe686ffce01 ("mptcp: ensure tx skbs always have the MPTCP ext") same patch merged in both trees, keep net-next. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: fix 'workqueue leaked lock' in smc_conn_abort_workKarsten Graul2021-09-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The abort_work is scheduled when a connection was detected to be out-of-sync after a link failure. The work calls smc_conn_kill(), which calls smc_close_active_abort() and that might end up calling smc_close_cancel_work(). smc_close_cancel_work() cancels any pending close_work and tx_work but needs to release the sock_lock before and acquires the sock_lock again afterwards. So when the sock_lock was NOT acquired before then it may be held after the abort_work completes. Thats why the sock_lock is acquired before the call to smc_conn_kill() in __smc_lgr_terminate(), but this is missing in smc_conn_abort_work(). Fix that by acquiring the sock_lock first and release it after the call to smc_conn_kill(). Fixes: b286a0651e44 ("net/smc: handle incoming CDC validation message") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: add missing error check in smc_clc_prfx_set()Karsten Graul2021-09-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity stumbled over a missing error check in smc_clc_prfx_set(): *** CID 1475954: Error handling issues (CHECKED_RETURN) /net/smc/smc_clc.c: 233 in smc_clc_prfx_set() >>> CID 1475954: Error handling issues (CHECKED_RETURN) >>> Calling "kernel_getsockname" without checking return value (as is done elsewhere 8 out of 10 times). 233 kernel_getsockname(clcsock, (struct sockaddr *)&addrs); Add the return code check in smc_clc_prfx_set(). Fixes: c246d942eabc ("net/smc: restructure netinfo for CLC proposal msgs") Reported-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add generic netlink support for system EIDKarsten Graul2021-09-143-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | With SMC-Dv2 users can configure if the static system EID should be used during CLC handshake, or if only user EIDs are allowed. Add generic netlink support to enable and disable the system EID, and to retrieve the system EID and its current enabled state. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: keep static copy of system EIDKarsten Graul2021-09-145-18/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | The system EID is retrieved using an registered ISM device each time when needed. This adds some unnecessary complexity at all places where the system EID is needed, but no ISM device is at hand. Simplify the code and save the system EID in a static variable in smc_ism.c. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add support for user defined EIDsKarsten Graul2021-09-147-31/+320
|/ | | | | | | | | | | | SMC-Dv2 allows users to define EIDs which allows to create separate name spaces enabling users to cluster their SMC-Dv2 connections. Add support for user defined EIDs and extent the generic netlink interface so users can add, remove and dump EIDs. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-08-136-10/+38
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/smc: Correct smc link connection counter in case of smc clientGuvenc Gulce2021-08-093-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | SMC clients may be assigned to a different link after the initial connection between two peers was established. In such a case, the connection counter was not correctly set. Update the connection counter correctly when a smc client connection is assigned to a different smc link. Fixes: 07d51580ff65 ("net/smc: Add connection counters for links") Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Tested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: fix wait on already cleared linkKarsten Graul2021-08-094-7/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There can be a race between the waiters for a tx work request buffer and the link down processing that finally clears the link. Although all waiters are woken up before the link is cleared there might be waiters which did not yet get back control and are still waiting. This results in an access to a cleared wait queue head. Fix this by introducing atomic reference counting around the wait calls, and wait with the link clear processing until all waiters have finished. Move the work request layer related calls into smc_wr.c and set the link state to INACTIVE before calling smcr_link_clear() in smc_llc_srv_add_link(). Fixes: 15e1b99aadfb ("net/smc: no WR buffer wait for terminating link group") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: Allow SMC-D 1MB DMB allocationsStefan Raspl2021-08-091-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit a3fe3d01bd0d7 ("net/smc: introduce sg-logic for RMBs") introduced a restriction for RMB allocations as used by SMC-R. However, SMC-D does not use scatter-gather lists to back its DMBs, yet it was limited by this restriction, still. This patch exempts SMC, but limits allocations to the maximum RMB/DMB size respectively. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Remove redundant if statementsYajun Deng2021-08-052-4/+2
|/ | | | | | | | The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sock: introduce sk_error_reportAlexander Aring2021-06-291-1/+1
| | | | | | | | | This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Ensure correct state of the socket in send pathGuvenc Gulce2021-06-251-2/+3
| | | | | | | | | | | | | | When smc_sendmsg() is called before the SMC socket initialization has completed, smc_tx_sendmsg() will access un-initialized fields of the SMC socket which results in a null-pointer dereference. Fix this by checking the socket state first in smc_tx_sendmsg(). Fixes: e0e4b8fa5338 ("net/smc: Add SMC statistics support") Reported-by: syzbot+5dda108b672b54141857@syzkaller.appspotmail.com Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Fix ENODATA tests in smc_nl_get_fback_stats()Dan Carpenter2021-06-211-3/+3
| | | | | | | | | | These functions return negative ENODATA but the minus sign was left out in the tests. Fixes: f0dd7bf5e330 ("net/smc: Add netlink support for SMC fallback statistics") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>