summaryrefslogtreecommitdiffstats
path: root/fs/dlm/midcomms.c
Commit message (Collapse)AuthorAgeFilesLines
* fs: dlm: add send ack threshold and append acks to msgsAlexander Aring2023-06-141-45/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the time when we sending an ack back to tell the other side it can free some message because it is arrived on the receiver node, due random reconnects e.g. TCP resets this is handled as well on application layer to not let DLM run into a deadlock state. The current handling has the following problems: 1. We end in situations that we only send an ack back message of 16 bytes out and no other messages. Whereas DLM has logic to combine so much messages as it can in one send() socket call. This behaviour can be discovered by "trace-cmd start -e dlm_recv" and observing the ret field being 16 bytes. 2. When processing of DLM messages will never end because we receive a lot of messages, we will not send an ack back as it happens when the processing loop ends. This patch introduces a likely and unlikely threshold case. The likely case will send an ack back on a transmit path if the threshold is triggered of amount of processed upper layer protocol. This will solve issue 1 because it will be send when another normal DLM message will be sent. It solves issue 2 because it is not part of the processing loop. There is however a unlikely case, the unlikely case has a bigger threshold and will be triggered when we only receive messages and do not sent any message back. This case avoids that the sending node will keep a lot of message for a long time as we send sometimes ack backs to tell the sender to finally release messages. The atomic cmpxchg() is there to provide a atomically ack send with reset of the upper layer protocol delivery counter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: handle sequence numbers as atomicAlexander Aring2023-06-141-15/+25
| | | | | | | | | | | | Currently seq_next is only be read on the receive side which processed in an ordered way. The seq_send is being protected by locks. To being able to read the seq_next value on send side as well we convert it to an atomic_t value. The atomic_cmpxchg() is probably not necessary, however the atomic_inc() depends on a if coniditional and this should be handled in an atomic context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: filter ourself midcomms callsAlexander Aring2023-06-141-9/+0
| | | | | | | | | | | It makes no sense to call midcomms/lowcomms functionality for the local node as socket functionality is only required for remote nodes. This patch filters those calls in the upper layer of lockspace membership handling instead of doing it in midcomms/lowcomms layer as they should never be aware of local nodeid. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: revert check required context while closeAlexander Aring2023-06-141-3/+0
| | | | | | | | | | | | | | | | | | | | | This patch reverts commit 2c3fa6ae4d52 ("dlm: check required context while close"). The function dlm_midcomms_close(), which will call later dlm_lowcomms_close(), is called when the cluster manager tells the node got fenced which means on midcomms/lowcomms layer to disconnect the node from the cluster communication. The node can rejoin the cluster later. This patch was ensuring no new message were able to be triggered when we are in the close() function context. This was done by checking if the lockspace has been stopped. However there is a missing check that we only need to check specific lockspaces where the fenced node is member of. This is currently complicated because there is no way to easily check if a node is part of a specific lockspace without stopping the recovery. For now we just revert this commit as it is just a check to finding possible leaks of stopping lockspaces before close() is called. Cc: stable@vger.kernel.org Fixes: 2c3fa6ae4d52 ("dlm: check required context while close") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove unnecessary waker_up() callsAlexander Aring2023-01-231-2/+0
| | | | | | | | | The wake_up() is already handled inside of midcomms_node_reset() when switching the state to CLOSED state. So there is not need to call it after midcomms_node_reset() again. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: move state change into else branchAlexander Aring2023-01-231-3/+4
| | | | | | | | | | Currently we can switch at first into DLM_CLOSE_WAIT state and then do another state change if a condition is true. Instead of doing two state changes we handle the other state change inside an else branch of this condition. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove newline in log_printAlexander Aring2023-01-231-4/+4
| | | | | | | | | There is an API difference between log_print() and other printk()s to put a newline or not. This one was introduced by mistake because log_print() adds a newline. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: reduce the shutdown timeout to 5 secsAlexander Aring2023-01-231-2/+2
| | | | | | | | When a shutdown is stuck, time out after 5 seconds instead of 3 minutes. After this timeout we try a forced shutdown. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: wait until all midcomms nodes detect versionAlexander Aring2023-01-231-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | The current dlm version detection is very complex due to backwards compatablilty with earlier dlm protocol versions. It takes some time to detect if a peer node has a specific DLM version. If it's not detected, we just cut the socket connection. There could be cases where the local node has not detected the version yet, but the peer node has. In these cases, we are trying to shutdown the dlm connection with a FIN/ACK message exchange to be sure the other peer is ready to shutdown the connection on dlm application level. However this mechanism is only available on DLM protocol version 3.2 and we need to be sure the DLM version is detected before. To make it more robust we introduce a a "best effort" wait to wait for the version detection before shutdown the dlm connection. This need to be done before the kthread recoverd for recovery handling is stopped, because recovery handling will trigger enough messages to have a version detection going on. It is a corner case which was detected by modprobe dlm_locktroture module and rmmod dlm_locktorture module directly afterwards (in a looping behaviour). In practice probably nobody would leave a lockspace immediately after joining it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: ignore unexpected non dlm opts msgsAlexander Aring2023-01-231-10/+2
| | | | | | | | | | | | This patch ignores unexpected RCOM_NAMES/RCOM_STATUS messages. To be backwards compatible, those messages are not part of the new reliable DLM OPTS encapsulation header, and have their own retransmit handling using sequence number matching When we get unexpected non dlm opts messages, we should allow them and let RCOM message handling filter them out using sequence numbers. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: bring back previous shutdown handlingAlexander Aring2023-01-231-13/+7
| | | | | | | | | | | | | | | | | This patch mostly reverts commit 4f567acb0b86 ("fs: dlm: remove socket shutdown handling"). There can be situations where the dlm midcomms nodes hash and lowcomms connection hash are not equal, but we need to guarantee that the lowcomms are all closed on a last release of a dlm lockspace, when a shutdown is invoked. This patch guarantees that we always close all sockets managed by the lowcomms connection hash, and calls shutdown for the last message sent. This ensures we don't cut the socket, which could cause the peer to get a connection reset. In future we should try to merge the midcomms/lowcomms hashes into one hash and not handle both in separate hashes. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: send FIN ack back in right casesAlexander Aring2023-01-231-4/+5
| | | | | | | | | | | | | | | This patch moves to send a ack back for receiving a FIN message only when we are in valid states. In other cases and there might be a sender waiting for a ack we just let it timeout at the senders time and hopefully all other cleanups will remove the FIN message on their sending queue. As an example we should never send out an ACK being in LAST_ACK state or we cannot assume a working socket communication when we are in CLOSED state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: move sending fin message into state change handlingAlexander Aring2023-01-231-24/+9
| | | | | | | | | | | | This patch moves the send fin handling, which should appear in a specific state change, into the state change handling while the per node state_lock is held. I experienced issues with other messages because we changed the state and a fin message was sent out in a different state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: don't set stop rx flag after node resetAlexander Aring2023-01-231-2/+1
| | | | | | | | | | | | | | | | | Similar to the stop tx flag, the rx flag should warn about a dlm message being received at DLM_FIN state change, when we are assuming no other dlm application messages. If we receive a FIN message and we are in the state DLM_FIN_WAIT2 we call midcomms_node_reset() which puts the midcomms node into DLM_CLOSED state. Afterwards we should not set the DLM_NODE_FLAG_STOP_RX flag any more. This patch changes the setting DLM_NODE_FLAG_STOP_RX in those state changes when we receive a FIN message and we assume there will be no other dlm application messages received until we hit DLM_CLOSED state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: fix race setting stop tx flagAlexander Aring2023-01-231-1/+1
| | | | | | | | | | | | | | | | This patch sets the stop tx flag before we commit the dlm message. This flag will report about unexpected transmissions after we send the DLM_FIN message out, which should be the last message sent. When we commit the dlm fin message, it could be that we already got an ack back and the CLOSED state change already happened. We should not set this flag when we are in CLOSED state. To avoid this race we simply set the tx flag before the state change can be in progress by moving it before dlm_midcomms_commit_mhandle(). Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: be sure to call dlm_send_queue_flush()Alexander Aring2023-01-231-0/+1
| | | | | | | | | | | | | If we release a midcomms node structure, there should be nothing left inside the dlm midcomms send queue. However, sometimes this is not true because I believe some DLM_FIN message was not acked... if we run into a shutdown timeout, then we should be sure there is no pending send dlm message inside this queue when releasing midcomms node structure. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: fix use after free in midcomms commitAlexander Aring2023-01-231-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While working on processing dlm message in softirq context I experienced the following KASAN use-after-free warning: [ 151.760477] ================================================================== [ 151.761803] BUG: KASAN: use-after-free in dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.763414] Read of size 4 at addr ffff88811a980c60 by task lock_torture/1347 [ 151.765284] CPU: 7 PID: 1347 Comm: lock_torture Not tainted 6.1.0-rc4+ #2828 [ 151.766778] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+16134+e5908aa2 04/01/2014 [ 151.768726] Call Trace: [ 151.769277] <TASK> [ 151.769748] dump_stack_lvl+0x5b/0x86 [ 151.770556] print_report+0x180/0x4c8 [ 151.771378] ? kasan_complete_mode_report_info+0x7c/0x1e0 [ 151.772241] ? dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.773069] kasan_report+0x93/0x1a0 [ 151.773668] ? dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.774514] __asan_load4+0x7e/0xa0 [ 151.775089] dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.775890] ? create_message.isra.29.constprop.64+0x57/0xc0 [ 151.776770] send_common+0x19f/0x1b0 [ 151.777342] ? remove_from_waiters+0x60/0x60 [ 151.778017] ? lock_downgrade+0x410/0x410 [ 151.778648] ? __this_cpu_preempt_check+0x13/0x20 [ 151.779421] ? rcu_lockdep_current_cpu_online+0x88/0xc0 [ 151.780292] _convert_lock+0x46/0x150 [ 151.780893] convert_lock+0x7b/0xc0 [ 151.781459] dlm_lock+0x3ac/0x580 [ 151.781993] ? 0xffffffffc0540000 [ 151.782522] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.783379] ? dlm_scan_rsbs+0xa70/0xa70 [ 151.784003] ? preempt_count_sub+0xd6/0x130 [ 151.784661] ? is_module_address+0x47/0x70 [ 151.785309] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.786166] ? 0xffffffffc0540000 [ 151.786693] ? lockdep_init_map_type+0xc3/0x360 [ 151.787414] ? 0xffffffffc0540000 [ 151.787947] torture_dlm_lock_sync.isra.3+0xe9/0x150 [dlm_locktorture] [ 151.789004] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.789858] ? 0xffffffffc0540000 [ 151.790392] ? lock_torture_cleanup+0x20/0x20 [dlm_locktorture] [ 151.791347] ? delay_tsc+0x94/0xc0 [ 151.791898] torture_ex_iter+0xc3/0xea [dlm_locktorture] [ 151.792735] ? torture_start+0x30/0x30 [dlm_locktorture] [ 151.793606] lock_torture+0x177/0x270 [dlm_locktorture] [ 151.794448] ? torture_dlm_lock_sync.isra.3+0x150/0x150 [dlm_locktorture] [ 151.795539] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 151.796476] ? do_raw_spin_lock+0x11e/0x1e0 [ 151.797152] ? mark_held_locks+0x34/0xb0 [ 151.797784] ? _raw_spin_unlock_irqrestore+0x30/0x70 [ 151.798581] ? __kthread_parkme+0x79/0x110 [ 151.799246] ? trace_preempt_on+0x2a/0xf0 [ 151.799902] ? __kthread_parkme+0x79/0x110 [ 151.800579] ? preempt_count_sub+0xd6/0x130 [ 151.801271] ? __kasan_check_read+0x11/0x20 [ 151.801963] ? __kthread_parkme+0xec/0x110 [ 151.802630] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 151.803569] kthread+0x192/0x1d0 [ 151.804104] ? kthread_complete_and_exit+0x30/0x30 [ 151.804881] ret_from_fork+0x1f/0x30 [ 151.805480] </TASK> [ 151.806111] Allocated by task 1347: [ 151.806681] kasan_save_stack+0x26/0x50 [ 151.807308] kasan_set_track+0x25/0x30 [ 151.807920] kasan_save_alloc_info+0x1e/0x30 [ 151.808609] __kasan_slab_alloc+0x63/0x80 [ 151.809263] kmem_cache_alloc+0x1ad/0x830 [ 151.809916] dlm_allocate_mhandle+0x17/0x20 [ 151.810590] dlm_midcomms_get_mhandle+0x96/0x260 [ 151.811344] _create_message+0x95/0x180 [ 151.811994] create_message.isra.29.constprop.64+0x57/0xc0 [ 151.812880] send_common+0x129/0x1b0 [ 151.813467] _convert_lock+0x46/0x150 [ 151.814074] convert_lock+0x7b/0xc0 [ 151.814648] dlm_lock+0x3ac/0x580 [ 151.815199] torture_dlm_lock_sync.isra.3+0xe9/0x150 [dlm_locktorture] [ 151.816258] torture_ex_iter+0xc3/0xea [dlm_locktorture] [ 151.817129] lock_torture+0x177/0x270 [dlm_locktorture] [ 151.817986] kthread+0x192/0x1d0 [ 151.818518] ret_from_fork+0x1f/0x30 [ 151.819369] Freed by task 1336: [ 151.819890] kasan_save_stack+0x26/0x50 [ 151.820514] kasan_set_track+0x25/0x30 [ 151.821128] kasan_save_free_info+0x2e/0x50 [ 151.821812] __kasan_slab_free+0x107/0x1a0 [ 151.822483] kmem_cache_free+0x204/0x5e0 [ 151.823152] dlm_free_mhandle+0x18/0x20 [ 151.823781] dlm_mhandle_release+0x2e/0x40 [ 151.824454] rcu_core+0x583/0x1330 [ 151.825047] rcu_core_si+0xe/0x20 [ 151.825594] __do_softirq+0xf4/0x5c2 [ 151.826450] Last potentially related work creation: [ 151.827238] kasan_save_stack+0x26/0x50 [ 151.827870] __kasan_record_aux_stack+0xa2/0xc0 [ 151.828609] kasan_record_aux_stack_noalloc+0xb/0x20 [ 151.829415] call_rcu+0x4c/0x760 [ 151.829954] dlm_mhandle_delete+0x97/0xb0 [ 151.830718] dlm_process_incoming_buffer+0x2fc/0xb30 [ 151.831524] process_dlm_messages+0x16e/0x470 [ 151.832245] process_one_work+0x505/0xa10 [ 151.832905] worker_thread+0x67/0x650 [ 151.833507] kthread+0x192/0x1d0 [ 151.834046] ret_from_fork+0x1f/0x30 [ 151.834900] The buggy address belongs to the object at ffff88811a980c30 which belongs to the cache dlm_mhandle of size 88 [ 151.836894] The buggy address is located 48 bytes inside of 88-byte region [ffff88811a980c30, ffff88811a980c88) [ 151.839007] The buggy address belongs to the physical page: [ 151.839904] page:0000000076cf5d62 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11a980 [ 151.841378] flags: 0x8000000000000200(slab|zone=2) [ 151.842141] raw: 8000000000000200 0000000000000000 dead000000000122 ffff8881089b43c0 [ 151.843401] raw: 0000000000000000 0000000000220022 00000001ffffffff 0000000000000000 [ 151.844640] page dumped because: kasan: bad access detected [ 151.845822] Memory state around the buggy address: [ 151.846602] ffff88811a980b00: fb fb fb fb fc fc fc fc fa fb fb fb fb fb fb fb [ 151.847761] ffff88811a980b80: fb fb fb fc fc fc fc fa fb fb fb fb fb fb fb fb [ 151.848921] >ffff88811a980c00: fb fb fc fc fc fc fa fb fb fb fb fb fb fb fb fb [ 151.850076] ^ [ 151.851085] ffff88811a980c80: fb fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb [ 151.852269] ffff88811a980d00: fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb fc [ 151.853428] ================================================================== [ 151.855618] Disabling lock debugging due to kernel taint It is accessing a mhandle in dlm_midcomms_commit_mhandle() and the mhandle was freed by a call_rcu() call in dlm_process_incoming_buffer(), dlm_mhandle_delete(). It looks like it was freed because an ack of this message was received. There is a short race between committing the dlm message to be transmitted and getting an ack back. If the ack is faster than returning from dlm_midcomms_commit_msg_3_2(), then we run into a use-after free because we still need to reference the mhandle when calling srcu_read_unlock(). To avoid that, we don't allow that mhandle to be freed between dlm_midcomms_commit_msg_3_2() and srcu_read_unlock() by using rcu read lock. We can do that because mhandle is protected by rcu handling. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: parallelize lowcomms socket handlingAlexander Aring2022-11-211-12/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is rework of lowcomms handling, the main goal was here to handle recvmsg() and sendpage() to run parallel. Parallel in two senses: 1. per connection and 2. that recvmsg()/sendpage() doesn't block each other. Currently recvmsg()/sendpage() cannot run parallel because two workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means only one work item can be executed. The amount of queue items will be increased about the amount of nodes being inside the cluster. The current two workqueues for sending and receiving can also block each other if the same connection is executed at the same time in dlm_recv and dlm_send workqueue because a per connection mutex for the socket handling. To make it more parallel we introduce one "dlm_io" workqueue which is not an ordered workqueue, the amount of workers are not limited. Due per connection flags SEND/RECV pending we schedule workers ordered per connection and per send and receive task. To get rid of the mutex blocking same workers to do socket handling we switched to a semaphore which handles socket operations as read lock and sock releases as write operations, to prevent sock_release() being called while the socket is being used. There might be more optimization removing the semaphore and replacing it with other synchronization mechanism, however due other circumstances e.g. othercon behaviour it seems complicated to doing this change. I added comments to remove the othercon handling and moving to a different synchronization mechanism as this is done. We need to do that to the next dlm major version upgrade because it is not backwards compatible with the current connect mechanism. The processing of dlm messages need to be still handled by a ordered workqueue. An dlm_process ordered workqueue was introduced which gets filled by the receive worker. This is probably the next bottleneck of DLM but the application can't currently parse dlm messages parallel. A comment was introduced to lift the workqueue context of dlm processing in a non-sleepable softirq to get messages processing done fast. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove socket shutdown handlingAlexander Aring2022-11-211-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") we have functionality like TCP offers for half-closed sockets on dlm application protocol layer. This feature is required because the cluster manager events about leaving resource memberships can be locally already occurred but other cluster nodes having a pending leaving membership over the cluster manager protocol happening. In this time the local dlm node already shutdown it's connection and don't transmit anymore any new dlm messages, but however it still needs to be able to accept dlm messages because the pending leave membership request of the cluster manager protocol which the dlm kernel implementation has no control about it. We have this functionality on the application for two reasons, the main reason is that SCTP does not support such functionality on socket layer. But we can do it inside application layer. Another small issue is that this feature is broken in the TCP world because some NAT devices does not implement such functionality correctly. This is the same reason why the reliable connection session layer in DLM exists. We give up on middle devices in the networking which sends e.g. TCP resets out. In DLM we cannot have any message dropping and we ensure it over a session layer that it can't happen. Back to the half-closed grace shutdown handling. It's not necessary anymore to do it on socket layer (which is only support for TCP sockets) because we do it on application layer. This patch removes this handling, if there are still issues then we have a problem on the application layer for such handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add midcomms init/start functionsAlexander Aring2022-11-211-1/+16
| | | | | | | | | | | | This patch introduces leftovers of init, start, stop and exit functionality. The dlm application layer should always call the midcomms layer which getting aware of such event and redirect it to the lowcomms layer. Some functionality which is currently handled inside the start functionality of midcomms and lowcomms should be handled in the init functionality as it only need to be initialized once when dlm is loaded. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add dst nodeid for msg tracingAlexander Aring2022-11-211-4/+6
| | | | | | | | | | | | | | | | | In DLM when we send a dlm message it is easy to add the lock resource name, but additional lookup is required when to trace the receive message side. The idea here is to move the lookup work to the user by using a lookup to find the right send message with recv message. As note DLM can't drop any message which is guaranteed by a special session layer. For doing the lookup a 3 tupel is required as an unique identification which is dst nodeid, src nodeid and sequence number. This patch adds the destination nodeid to the dlm message trace points. The source nodeid is given by the h_nodeid field inside the header. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use WARN_ON_ONCE() instead of WARN_ON()Alexander Aring2022-11-081-9/+9
| | | | | | | | | To not get the console spammed about WARN_ON() of invalid states in the dlm midcomms hot path handling we switch to WARN_ON_ONCE() to get it only once that there might be an issue with the midcomms state handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: allow different allocation context per _create_messageAlexander Aring2022-11-081-1/+1
| | | | | | | | | This patch allows to give the use control about the allocation context based on a per message basis. Currently all messages forced to be created under GFP_NOFS context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fd: dlm: trace send/recv of dlm message and rcomAlexander Aring2022-11-081-4/+41
| | | | | | | | | | | | | | | | | | | | | | This patch adds tracepoints for send and recv cases of dlm messages and dlm rcom messages. In case of send and dlm message we add the dlm rsb resource name this dlm messages belongs to. This has the advantage to follow dlm messages on a per lock basis. In case of recv message the resource name can be extracted by follow the send message sequence number. The dlm message DLM_MSG_PURGE doesn't belong to a lock request and will not set the resource name in a dlm_message trace. The same for all rcom messages. There is additional handling required for this debugging functionality which is tried to be small as possible. Also the midcomms layer gets aware of lock resource names, for now this is required to make a connection between sequence number and lock resource names. It is for debugging purpose only. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use packet in dlm_mhandleAlexander Aring2022-11-081-3/+3
| | | | | | | | | To allow more than just dereferencing the inner header we directly point to the inner dlm packet which allows us to dereference the header, rcom or message structure. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: check required context while closeAlexander Aring2022-04-061-0/+3
| | | | | | | | | | | | | | This patch adds a WARN_ON() check to validate the right context while dlm_midcomms_close() is called. Even before commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") in this context dlm_lowcomms_close() flushes all ongoing transmission triggered by dlm application stack. If we do that, it's required that no new message will be triggered by the dlm application stack. The function dlm_midcomms_close() is not called often so we can check if all lockspaces are in such context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: move conversion to compile timeAlexander Aring2022-04-061-10/+10
| | | | | | | | | | This patch is a cleanup to move the byte order conversion to compile time. In a simple comparison like this it's possible to move it to static values so the compiler will always convert those values at compile time. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: use __le types for dlm headerAlexander Aring2022-04-061-16/+12
| | | | | | | | | | | | | | This patch changes to use __le types directly in the dlm header structure which is casted at the right dlm message buffer positions. The main goal what is reached here is to remove sparse warnings regarding to host to little byte order conversion or vice versa. Leaving those sparse issues ignored and always do it in out/in functionality tends to leave it unknown in which byte order the variable is being handled. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: add __CHECKER__ for false positivesAlexander Aring2022-04-061-0/+10
| | | | | | | | | | | | | | | | | | This patch will adds #ifndef __CHECKER__ for false positives warnings about an imbalance lock/unlock srcu handling. Which are shown by running sparse checks: fs/dlm/midcomms.c:1065:20: warning: context imbalance in 'dlm_midcomms_get_mhandle' - wrong count at exit Using __CHECKER__ will tell sparse to ignore these sections. Those imbalances are false positive because from upper layer it is always required to call a function in sequence, e.g. if dlm_midcomms_get_mhandle() is successful there must be a dlm_midcomms_commit_mhandle() call afterwards. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: memory cache for midcomms hotpathAlexander Aring2021-12-071-6/+15
| | | | | | | | This patch will introduce a kmem cache for allocating message handles which are needed for midcomms layer to take track of lowcomms messages. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm:Remove unneeded semicolonZhang Mingyu2021-11-051-1/+1
| | | | | | | | | Eliminate the following coccinelle check warning: fs/dlm/midcomms.c:972:2-3 Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Zhang Mingyu <zhang.mingyu@zte.com.cn> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add debugfs rawmsg send functionalityAlexander Aring2021-11-021-0/+48
| | | | | | | | | | | | | | | | | | | | | This patch adds a dlm functionality to send a raw dlm message to a specific cluster node. This raw message can be build by user space and send out by writing the message to "rawmsg" dlm debugfs file. There is a in progress scapy dlm module which provides a easy build of DLM messages in user space. For example: DLM(h_cmd=3, o_nextcmd=1, h_nodeid=1, h_lockspace=0xe4f48a18, ...) The goal is to provide an easy reproducable state to crash DLM or to fuzz the DLM kernel stack if there are possible ways to crash it. Note: that if the sequence number is zero and dlm version is not set to 3.1 the kernel will automatic will set a right sequence number, otherwise DLM stack testing is not possible. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: let handle callback data as voidAlexander Aring2021-11-021-1/+3
| | | | | | | | | This patch changes the dlm_lowcomms_new_msg() function pointer private data from "struct mhandle *" to "void *" to provide different structures than just "struct mhandle". Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: move version conversion to compile timeAlexander Aring2021-11-021-3/+3
| | | | | | | | This patch moves version conversion to little endian from a runtime variable to compile time constant. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: debug improvements print nodeidAlexander Aring2021-11-021-2/+2
| | | | | | | | This patch improves the debug output for midcomms layer by also printing out the nodeid where users counter belongs to. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: implement delayed ack handlingAlexander Aring2021-08-191-8/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes that we don't ack each message. Lowcomms will take care about to send an ack back after a bulk of messages was processed. Currently it's only when the whole receive buffer was processed, there might better positions to send an ack back but only the lowcomms implementation know when there are more data to receive. This patch has also disadvantages that we might retransmit more on errors, however this is a very rare case. Tested with make_panic on gfs2 with three nodes by running: trace-cmd record -p function -l 'dlm_send_ack' sleep 100 and trace-cmd report | wc -l Before patch: - 20548 - 21376 - 21398 After patch: - 18338 - 20679 - 19949 Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: invalid buffer access in lookup errorAlexander Aring2021-06-111-2/+17
| | | | | | | | | | | This patch will evaluate the message length if a dlm opts header can fit in before accessing it if a node lookup fails. The invalid sequence error means that the version detection failed and an unexpected message arrived. For debugging such situation the type of arrived message is important to know. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: fix race in mhandle deletionAlexander Aring2021-06-111-14/+21
| | | | | | | | | | | This patch fixes a race between mhandle deletion in case of receiving an acknowledge and flush of all pending mhandle in cases of an timeout or resetting node states. Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Reported-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: rename socket and app buffer definesAlexander Aring2021-06-021-2/+2
| | | | | | | | | | | | This patch renames DEFAULT_BUFFER_SIZE to DLM_MAX_SOCKET_BUFSIZE and LOWCOMMS_MAX_TX_BUFFER_LEN to DLM_MAX_APP_BUFSIZE as they are proper names to define what's behind those values. The DLM_MAX_SOCKET_BUFSIZE defines the maximum size of buffer which can be handled on socket layer, the DLM_MAX_APP_BUFSIZE defines the maximum size of buffer which can be handled by the DLM application layer. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: Fix spelling mistake "stucked" -> "stuck"Colin Ian King2021-05-261-2/+2
| | | | | | | There are spelling mistake in log messages. Fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add midcomms debugfs functionalityAlexander Aring2021-05-251-0/+27
| | | | | | | | | | This patch adds functionality to debug midcomms per connection state inside a comms directory which is similar like dlm configfs. Currently there exists the possibility to read out two attributes which is the send queue counter and the version of each midcomms node state. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add reliable connection if reconnectAlexander Aring2021-05-251-44/+1246
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduce to make a tcp lowcomms connection reliable even if reconnects occurs. This is done by an application layer re-transmission handling and sequence numbers in dlm protocols. There are three new dlm commands: DLM_OPTS: This will encapsulate an existing dlm message (and rcom message if they don't have an own application side re-transmission handling). As optional handling additional tlv's (type length fields) can be appended. This can be for example a sequence number field. However because in DLM_OPTS the lockspace field is unused and a sequence number is a mandatory field it isn't made as a tlv and we put the sequence number inside the lockspace id. The possibility to add optional options are still there for future purposes. DLM_ACK: Just a dlm header to acknowledge the receive of a DLM_OPTS message to it's sender. DLM_FIN: This provides a 4 way handshake for connection termination inclusive support for half-closed connections. It's provided on application layer because SCTP doesn't support half-closed sockets, the shutdown() call can interrupted by e.g. TCP resets itself and a hard logic to implement it because the othercon paradigm in lowcomms. The 4-way termination handshake also solve problems to synchronize peer EOF arrival and that the cluster manager removes the peer in the node membership handling of DLM. In some cases messages can be still transmitted in this time and we need to wait for the node membership event. To provide a reliable connection the node will retransmit all unacknowledges message to it's peer on reconnect. The receiver will then filtering out the next received message and drop all messages which are duplicates. As RCOM_STATUS and RCOM_NAMES messages are the first messages which are exchanged and they have they own re-transmission handling, there exists logic that these messages must be first. If these messages arrives we store the dlm version field. This handling is on DLM 3.1 and after this patch 3.2 the same. A backwards compatibility handling has been added which seems to work on tests without tcpkill, however it's not recommended to use DLM 3.1 and 3.2 at the same time, because DLM 3.2 tries to fix long term bugs in the DLM protocol. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: make buffer handling per msgAlexander Aring2021-05-251-2/+5
| | | | | | | | | | | | | | | | | | This patch makes the void pointer handle for lowcomms functionality per message and not per page allocation entry. A refcount handling for the handle was added to keep the message alive until the user doesn't need it anymore. There exists now a per message callback which will be called when allocating a new buffer. This callback will be guaranteed to be called according the order of the sending buffer, which can be used that the caller increments a sequence number for the dlm message handle. For transition process we cast the dlm_mhandle to dlm_msg and vice versa until the midcomms layer will implement a specific dlm_mhandle structure. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add more midcomms hooksAlexander Aring2021-05-251-1/+30
| | | | | | | | | | | | | | | | | | | | | | This patch prepares hooks to redirect to the midcomms layer which will be used by the midcomms re-transmit handling. There exists the new concept of stateless buffers allocation and commits. This can be used to bypass the midcomms re-transmit handling. It is used by RCOM_STATUS and RCOM_NAMES messages, because they have their own ping-like re-transmit handling. As well these two messages will be used to determine the DLM version per node, because these two messages are per observation the first messages which are exchanged. Cluster manager events for node membership are added to add support for half-closed connections in cases that the peer connection get to an end of file but DLM still holds membership of the node. In this time DLM can still trigger new message which we should allow. After the cluster manager node removal event occurs it safe to close the connection. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove unaligned memory access handlingAlexander Aring2021-03-091-14/+12
| | | | | | | | | | | | | | | | | This patch removes unaligned memory access handling for receiving midcomms messages. This handling will not fix the unaligned memory access in general. All messages should be length aligned to 8 bytes, there exists cases where this isn't the case. It's part of the sending handling to not send such messages. As the sending handling itself, with the internal allocator of page buffers, can occur in unaligned memory access of dlm message fields we just ignore that problem for now as it seems this code is used by architecture which can handle it. This patch adds a comment to take care about that problem in a major bump of dlm protocol. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: check on minimum msglen sizeAlexander Aring2021-03-091-3/+4
| | | | | | | | | | This patch adds an additional check for minimum dlm header size which is an invalid dlm message and signals a broken stream. A msglen field cannot be less than the dlm header size because the field is inclusive header lengths. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: rework receive handlingAlexander Aring2020-09-291-83/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch reworks the current receive handling of dlm. As I tried to change the send handling to fix reorder issues I took a look into the receive handling and simplified it, it works as the following: Each connection has a preallocated receive buffer with a minimum length of 4096. On receive, the upper layer protocol will process all dlm message until there is not enough data anymore. If there exists "leftover" data at the end of the receive buffer because the dlm message wasn't fully received it will be copied to the begin of the preallocated receive buffer. Next receive more data will be appended to the previous "leftover" data and processing will begin again. This will remove a lot of code of the current mechanism. Inside the processing functionality we will ensure with a memmove() that the dlm message should be memory aligned. To have a dlm message always started at the beginning of the buffer will reduce some amount of memmove() calls because src and dest pointers are the same. The cluster attribute "buffer_size" becomes a new meaning, it's now the size of application layer receive buffer size. If this is changed during runtime the receive buffer will be reallocated. It's important that the receive buffer size has at minimum the size of the maximum possible dlm message size otherwise the received message cannot be placed inside the receive buffer size. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 193Thomas Gleixner2019-05-301-3/+1
| | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this copyrighted material is made available to anyone wishing to use modify copy or redistribute it subject to the terms and conditions of the gnu general public license v 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 45 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528170027.342746075@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dlm: fix up memory allocation flagsSteven Whitehouse2008-12-231-1/+1
| | | | | | | | | | | Use ls_allocation for memory allocations, which a cluster fs sets to GFP_NOFS. Use GFP_NOFS for allocations when no lockspace struct is available. Taking dlm locks needs to avoid calling back into the cluster fs because write-out can require taking dlm locks. Cc: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: dlm_process_incoming_buffer() fixesAl Viro2008-02-041-13/+20
| | | | | | | | | | | * check that length is large enough to cover the non-variable part of message or rcom resp. (after checking that it's large enough to cover the header, of course). * kill more pointless casts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Teigland <teigland@redhat.com>