summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma/rpc_rdma.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds2020-08-091-21/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS server updates from Chuck Lever: "Highlights: - Support for user extended attributes on NFS (RFC 8276) - Further reduce unnecessary NFSv4 delegation recalls Notable fixes: - Fix recent krb5p regression - Address a few resource leaks and a rare NULL dereference Other: - De-duplicate RPC/RDMA error handling and other utility functions - Replace storage and display of kernel memory addresses by tracepoints" * tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits) svcrdma: CM event handler clean up svcrdma: Remove transport reference counting svcrdma: Fix another Receive buffer leak SUNRPC: Refresh the show_rqstp_flags() macro nfsd: netns.h: delete a duplicated word SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()") nfsd: avoid a NULL dereference in __cld_pipe_upcall() nfsd4: a client's own opens needn't prevent delegations nfsd: Use seq_putc() in two functions svcrdma: Display chunk completion ID when posting a rw_ctxt svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send() svcrdma: Introduce Send completion IDs svcrdma: Record Receive completion ID in svc_rdma_decode_rqst svcrdma: Introduce Receive completion IDs svcrdma: Introduce infrastructure to support completion IDs svcrdma: Add common XDR encoders for RDMA and Read segments svcrdma: Add common XDR decoders for RDMA and Read segments SUNRPC: Add helpers for decoding list discriminators symbolically svcrdma: Remove declarations for functions long removed svcrdma: Clean up trace_svcrdma_send_failed() tracepoint ...
| * svcrdma: Add common XDR encoders for RDMA and Read segmentsChuck Lever2020-07-131-11/+3
| | | | | | | | | | | | Clean up: De-duplicate some code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Add common XDR decoders for RDMA and Read segmentsChuck Lever2020-07-131-4/+1
| | | | | | | | | | | | Clean up: De-duplicate some code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Add helpers for decoding list discriminators symbolicallyChuck Lever2020-07-131-6/+6
| | | | | | | | | | | | | | | | | | Use these helpers in a few spots to demonstrate their use. The remaining open-coded discriminator checks in rpcrdma will be addressed in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | xprtrdma: fix incorrect header size calculationsColin Ian King2020-07-151-2/+2
|/ | | | | | | | | | | | | Currently the header size calculations are using an assignment operator instead of a += operator when accumulating the header size leading to incorrect sizes. Fix this by using the correct operator. Addresses-Coverity: ("Unused value") Fixes: 302d3deb2068 ("xprtrdma: Prevent inline overflow") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix handling of RDMA_ERROR repliesChuck Lever2020-06-221-6/+3
| | | | | | | | | | | | | | | | | | The RPC client currently doesn't handle ERR_CHUNK replies correctly. rpcrdma_complete_rqst() incorrectly passes a negative number to xprt_complete_rqst() as the number of bytes copied. Instead, set task->tk_status to the error value, and return zero bytes copied. In these cases, return -EIO rather than -EREMOTEIO. The RPC client's finite state machine doesn't know what to do with -EREMOTEIO. Additional clean ups: - Don't double-count RDMA_ERROR replies - Remove a stale comment Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.vger.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: receive buffer size estimation values almost never changeChuck Lever2020-06-111-2/+2
| | | | | | | | | | | | | | | | | | Avoid unnecessary cache sloshing by placing the buffer size estimation update logic behind an atomic bit flag. The size of GSS information included in each wrapped Reply does not change during the lifetime of a GSS context. Therefore, the au_rslack and au_ralign fields need to be updated only once after establishing a fresh GSS credential. Thus a slack size update must occur after a cred is created, duplicated, renewed, or expires. I'm not sure I have this exactly right. A trace point is introduced to track updates to these variables to enable troubleshooting the problem if I missed a spot. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}Chuck Lever2020-04-201-4/+11
| | | | | | | | | | These new helpers do not return 0 on success, they return the encoded size. Thus they are not a drop-in replacement for the old helpers. Fixes: 5c266df52701 ("SUNRPC: Add encoders for list item discriminators") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2020-04-071-16/+16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a page leak in nfs_destroy_unlinked_subrequests() - Fix use-after-free issues in nfs_pageio_add_request() - Fix new mount code constant_table array definitions - finish_automount() requires us to hold 2 refs to the mount record Features: - Improve the accuracy of telldir/seekdir by using 64-bit cookies when possible. - Allow one RDMA active connection and several zombie connections to prevent blocking if the remote server is unresponsive. - Limit the size of the NFS access cache by default - Reduce the number of references to credentials that are taken by NFS - pNFS files and flexfiles drivers now support per-layout segment COMMIT lists. - Enable partial-file layout segments in the pNFS/flexfiles driver. - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from the DS using the layouterror mechanism. Bugfixes and cleanups: - SUNRPC: Fix krb5p regressions - Don't specify NFS version in "UDP not supported" error - nfsroot: set tcp as the default transport protocol - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid() - alloc_nfs_open_context() must use the file cred when available - Fix locking when dereferencing the delegation cred - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails - Various clean ups of the NFS O_DIRECT commit code - Clean up RDMA connect/disconnect - Replace zero-length arrays with C99-style flexible arrays" * tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits) NFS: Clean up process of marking inode stale. SUNRPC: Don't start a timer on an already queued rpc task NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn() NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode() NFS: Beware when dereferencing the delegation cred NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout NFS: finish_automount() requires us to hold 2 refs to the mount record NFS: Fix a few constant_table array definitions NFS: Try to join page groups before an O_DIRECT retransmission NFS: Refactor nfs_lock_and_join_requests() NFS: Reverse the submission order of requests in __nfs_pageio_add_request() NFS: Clean up nfs_lock_and_join_requests() NFS: Remove the redundant function nfs_pgio_has_mirroring() NFS: Fix memory leaks in nfs_pageio_stop_mirroring() NFS: Fix a request reference leak in nfs_direct_write_clear_reqs() NFS: Fix use-after-free issues in nfs_pageio_add_request() NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests() NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio() pNFS/flexfiles: Specify the layout segment range in LAYOUTGET ...
| * xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprtChuck Lever2020-03-271-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the rpcrdma_xprt_disconnect() function so that it no longer waits for the DISCONNECTED event. This prevents blocking if the remote is unresponsive. In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is detached. Upon return from rpcrdma_xprt_disconnect(), the transport (r_xprt) is ready immediately for a new connection. The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now handled almost identically. However, because the lifetimes of rpcrdma_xprt structures and rpcrdma_ep structures are now independent, creating an rpcrdma_ep needs to take a module ref count. The ep now owns most of the hardware resources for a transport. Also, a kref is needed to ensure that rpcrdma_ep sticks around long enough for the cm_event_handler to finish. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_epChuck Lever2020-03-271-16/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I eventually want to allocate rpcrdma_ep separately from struct rpcrdma_xprt so that on occasion there can be more than one ep per xprt. The new struct rpcrdma_ep will contain all the fields currently in rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings for the connection, in addition to per-connection settings negotiated with the remote. Take this opportunity to rename the existing ep fields from rep_* to re_* to disambiguate these from struct rpcrdma_rep. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | SUNRPC: Add encoders for list item discriminatorsChuck Lever2020-03-161-31/+5
|/ | | | | | | Clean up. These are taken from the client-side RPC/RDMA transport to a more global header file so they can be used elsewhere. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* xprtrdma: Allocate and map transport header buffers at connect timeChuck Lever2020-01-151-7/+3
| | | | | | | | | | | | | | | | | | | | | Currently the underlying RDMA device is chosen at transport set-up time. But it will soon be at connect time instead. The maximum size of a transport header is based on device capabilities. Thus transport header buffers have to be allocated _after_ the underlying device has been chosen (via address and route resolution); ie, in the connect worker. Thus, move the allocation of transport header buffers to the connect worker, after the point at which the underlying RDMA device has been chosen. This also means the RDMA device is available to do a DMA mapping of these buffers at connect time, instead of in the hot I/O path. Make that optimization as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Eliminate per-transport "max pages"Chuck Lever2020-01-151-1/+1
| | | | | | | | | | | | | | To support device hotplug and migrating a connection between devices of different capabilities, we have to guarantee that all in-kernel devices can support the same max NFS payload size (1 megabyte). This means that possibly one or two in-tree devices are no longer supported for NFS/RDMA because they cannot support 1MB rsize/wsize. The only one I confirmed was cxgb3, but it has already been removed from the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Refactor initialization of ep->rep_max_requestsChuck Lever2020-01-151-3/+3
| | | | | | | | | | | | Clean up: there is no need to keep two copies of the same value. Also, in subsequent patches, rpcrdma_ep_create() will be called in the connect worker rather than at set-up time. Minor fix: Initialize the transport's sendctx to the value based on the capabilities of the underlying device, not the maximum setting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Eliminate ri_max_send_sgesChuck Lever2020-01-151-1/+1
| | | | | | | | Clean-up. The max_send_sge value also happens to be stored in ep->rep_attr. Let's keep just a single copy. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Replace dprintk() in rpcrdma_update_connect_private()Chuck Lever2019-10-241-4/+0
| | | | | | | | | Clean up: Use a single trace point to record each connection's negotiated inline thresholds and the computed maximum byte size of transport headers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Refine trace_xprtrdma_fixupChuck Lever2019-10-241-3/+2
| | | | | | | Slightly reduce overhead and display more useful information. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Pull up sometimesChuck Lever2019-10-241-5/+77
| | | | | | | | | | | | | | | | | | On some platforms, DMA mapping part of a page is more costly than copying bytes. Restore the pull-up code and use that when we think it's going to be faster. The heuristic for now is to pull-up when the size of the RPC message body fits in the buffer underlying the head iovec. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated, as is handling a Send completion, for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Refactor rpcrdma_prepare_msg_sges()Chuck Lever2019-10-241-117/+146
| | | | | | | | | Refactor: Replace spaghetti with code that makes it plain what needs to be done for each rtype. This makes it easier to add features and optimizations. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Move the rpcrdma_sendctx::sc_wr fieldChuck Lever2019-10-241-3/+6
| | | | | | | | | | | Clean up: This field is not needed in the Send completion handler, so it can be moved to struct rpcrdma_req to reduce the size of struct rpcrdma_sendctx, and to reduce the amount of memory that is sloshed between the sending process and the Send completion process. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove rpcrdma_sendctx::sc_deviceChuck Lever2019-10-241-2/+2
| | | | | | | | Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Manage MRs in context of a single connectionChuck Lever2019-10-241-8/+1
| | | | | | | | | | | | | | | | | | | | | MRs are now allocated on demand so we can safely throw them away on disconnect. This way an idle transport can disconnect and it won't pin hardware MR resources. Two additional changes: - Now that all MRs are destroyed on disconnect, there's no need to check during header marshaling if a req has MRs to recycle. Each req is sent only once per connection, and now rl_registered is guaranteed to be empty when rpcrdma_marshal_req is invoked. - Because MRs are now destroyed in a WQ_MEM_RECLAIM context, they also must be allocated in a WQ_MEM_RECLAIM context. This reduces the likelihood that device driver memory allocation will trigger memory reclaim during NFS writeback. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Close window between waking RPC senders and posting ReceivesChuck Lever2019-10-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | A recent clean up attempted to separate Receive handling and RPC Reply processing, in the name of clean layering. Unfortunately, we can't do this because the Receive Queue has to be refilled _after_ the most recent credit update from the responder is parsed from the transport header, but _before_ we wake up the next RPC sender. That is right in the middle of rpcrdma_reply_handler(). Usually this isn't a problem because current responder implementations don't vary their credit grant. The one exception is when a connection is established: the grant goes from one to a much larger number on the first Receive. The requester MUST post enough Receives right then so that any outstanding requests can be sent without risking RNR and connection loss. Fixes: 6ceea36890a0 ("xprtrdma: Refactor Receive accounting") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Initialize rb_credits in one placeChuck Lever2019-10-241-6/+36
| | | | | | | | | | | | Clean up/code de-duplication. Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd. This mistake does not appear to have operational consequences, since the cwnd value is replaced with a valid value upon the first Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clear xprt->reestablish_timeout on closeChuck Lever2019-08-261-2/+6
| | | | | | | | | Ensure that the re-establishment delay does not grow exponentially on each good reconnect. This probably should have been part of commit 675dd90ad093 ("xprtrdma: Modernize ops->connect"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Recycle MRs after disconnectChuck Lever2019-08-261-1/+1
| | | | | | | | | | The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a bit too optimistic. MRs left over after a reconnect still need to be recycled, not added back to the free list, since they could be in flight or actually fully registered. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Inline XDR chunk encoder functionsChuck Lever2019-08-211-9/+12
| | | | | | | | | | | Micro-optimization: Save the cost of three function calls during transport header encoding. These were "noinline" before to generate more meaningful call stacks during debugging, but this code is now pretty stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Cache free MRs in each rpcrdma_reqChuck Lever2019-08-211-3/+8
| | | | | | | | | | | | | | Instead of a globally-contended MR free list, cache MRs in each rpcrdma_req as they are released. This means acquiring and releasing an MR will be lock-free in the common case, even outside the transport send lock. The original idea of per-rpcrdma_req MR free lists was suggested by Shirley Ma <shirley.ma@oracle.com> several years ago. I just now figured out how to make that idea work with on-demand MR allocation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Move rpcrdma_mr_get out of frwr_mapChuck Lever2019-08-201-6/+24
| | | | | | | | | | | | | Refactor: Retrieve an MR and handle error recovery entirely in rpc_rdma.c, as this is not a device-specific function. Note that since commit 89f90fe1ad8b ("SUNRPC: Allow calls to xprt_transmit() to drain the entire transmit queue"), the xprt_transmit function handles the cond_resched. The transport no longer has to do this itself. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Simplify rpcrdma_mr_popChuck Lever2019-08-201-6/+1
| | | | | | | | Clean up: rpcrdma_mr_pop call sites check if the list is empty first. Let's replace the list_empty with less costly logic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* Merge tag 'nfs-rdma-for-5.3-1' of ↵Trond Myklebust2019-07-121-85/+63
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
| * xprtrdma: Refactor chunk encodingChuck Lever2019-07-091-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up. Move the "not present" case into the individual chunk encoders. This improves code organization and readability. The reason for the original organization was to optimize for the case where there there are no chunks. The optimization turned out to be inconsequential, so let's err on the side of code readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Wake RPCs directly in rpcrdma_wc_send pathChuck Lever2019-07-091-38/+23
| | | | | | | | | | | | | | | | Eliminate a context switch in the path that handles RPC wake-ups when a Receive completion has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Reduce context switching due to Local InvalidationChuck Lever2019-07-091-31/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit ba69cd122ece ("xprtrdma: Remove support for FMR memory registration"), FRWR is the only supported memory registration mode. We can take advantage of the asynchronous nature of FRWR's LOCAL_INV Work Requests to get rid of the completion wait by having the LOCAL_INV completion handler take care of DMA unmapping MRs and waking the upper layer RPC waiter. This eliminates two context switches when local invalidation is necessary. As a side benefit, we will no longer need the per-xprt deferred completion work queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Add mechanism to place MRs back on the free listChuck Lever2019-07-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a marshal operation fails, any MRs that were already set up for that request are recycled. Recycling releases MRs and creates new ones, which is expensive. Since commit f2877623082b ("xprtrdma: Chain Send to FastReg WRs") was merged, recycling FRWRs is unnecessary. This is because before that commit, frwr_map had already posted FAST_REG Work Requests, so ownership of the MRs had already been passed to the NIC and thus dealing with them had to be delayed until they completed. Since that commit, however, FAST_REG WRs are posted at the same time as the Send WR. This means that if marshaling fails, we are certain the MRs are safe to simply unmap and place back on the free list because neither the Send nor the FAST_REG WRs have been posted yet. The kernel still has ownership of the MRs at this point. This reduces the total number of MRs that the xprt has to create under heavy workloads and makes the marshaling logic less brittle. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Remove fr_stateChuck Lever2019-07-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that both the Send and Receive completions are handled in process context, it is safe to DMA unmap and return MRs to the free or recycle lists directly in the completion handlers. Doing this means rpcrdma_frwr no longer needs to track the state of each MR, meaning that a VALID or FLUSHED MR can no longer appear on an xprt's MR free list. Thus there is no longer a need to track the MR's registration state in rpcrdma_frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flagChuck Lever2019-07-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | Commit 9590d083c1bb ("xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they can be left in the pending requests queue while they are being processed without introducing a race between ->buf_free and the transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Fix occasional transport deadlockChuck Lever2019-07-091-14/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under high I/O workloads, I've noticed that an RPC/RDMA transport occasionally deadlocks (IOPS goes to zero, and doesn't recover). Diagnosis shows that the sendctx queue is empty, but when sendctxs are returned to the queue, the xprt_write_space wake-up never occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy. I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented via an atomic bit. Just one of those is sufficient. Removing EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock un-reproducible. Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and is therefore removed. Unfortunately this patch does not apply cleanly to stable. If needed, someone will have to port it and test it. Fixes: 2fad659209d5 ("xprtrdma: Wait on empty sendctx queue") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_reqChuck Lever2019-07-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a latent bug. xdr_stream_pos works by subtracting xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not initialized by xdr_init_encode(). It works today only because all fields in rpcrdma_req::rl_stream are initialized to zero by rpcrdma_req_create, making the subtraction in xdr_stream_pos always a no-op. I found this issue via code inspection. It was introduced by commit 39f4cd9e9982 ("xprtrdma: Harden chunk list encoding against send buffer overflow"), but the code has changed enough since then that this fix can't be automatically applied to stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lockTrond Myklebust2019-07-061-2/+2
|/ | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* xprtrdma: Aggregate the inline settings in struct rpcrdma_epChuck Lever2019-04-251-15/+19
| | | | | | | | | | | | | | | | Clean up. The inline settings are actually a characteristic of the endpoint, and not related to the device. They are also modified after the transport instance is created, so they do not belong in the cdata structure either. Lastly, let's use names that are more natural to RDMA than to NFS: inline_write -> inline_send and inline_read -> inline_recv. The /proc files retain their names to avoid breaking user space. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up sendctx functionsChuck Lever2019-04-251-15/+12
| | | | | | | | | | | | Minor clean-ups I've stumbled on since sendctx was merged last year. In particular, making Send completion processing more efficient appears to have a measurable impact on IOPS throughput. Note: test_and_clear_bit() returns a value, thus an explicit memory barrier is not necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Trace marshaling failuresChuck Lever2019-04-251-0/+1
| | | | | | | | Record an event when rpcrdma_marshal_req returns a non-zero return value to help track down why an xprt close might have occurred. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up regbuf helpersChuck Lever2019-04-251-24/+23
| | | | | | | | | | | For code legibility, clean up the function names to be consistent with the pattern: "rpcrdma" _ object-type _ action Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any callers outside of verbs.c, and can thus be made static. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: rpcrdma_regbuf alignmentChuck Lever2019-04-251-2/+2
| | | | | | | | | | Allocate the struct rpcrdma_regbuf separately from the I/O buffer to better guarantee the alignment of the I/O buffer and eliminate the wasted space between the rpcrdma_regbuf metadata and the buffer itself. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Avoid digging into the ATOMIC poolChuck Lever2019-04-251-1/+1
| | | | | | | | | Page allocation requests made when the SPARSE_PAGES flag is set are allowed to fail, and are not critical. No need to spend a rare resource. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Add xdr_stream::rqst fieldChuck Lever2019-02-131-2/+2
| | | | | | | | | | | | | | Having access to the controlling rpc_rqst means a trace point in the XDR code can report: - the XID - the task ID and client ID - the p_name of RPC being processed Subsequent patches will introduce such trace points. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Check inline size before providing a Write chunkChuck Lever2019-02-131-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In very rare cases, an NFS READ operation might predict that the non-payload part of the RPC Call is large. For instance, an NFSv4 COMPOUND with a large GETATTR result, in combination with a large Kerberos credential, could push the non-payload part to be several kilobytes. If the non-payload part is larger than the connection's inline threshold, the client is required to provision a Reply chunk. The current Linux client does not check for this case. There are two obvious ways to handle it: a. Provision a Write chunk for the payload and a Reply chunk for the non-payload part b. Provision a Reply chunk for the whole RPC Reply Some testing at a recent NFS bake-a-thon showed that servers can mostly handle a. but there are some corner cases that do not work yet. b. already works (it has to, to handle krb5i/p), but could be somewhat less efficient. However, I expect this scenario to be very rare -- no-one has reported a problem yet. So I'm going to implement b. Sometime later I will provide some patches to help make b. a little more efficient by more carefully choosing the Reply chunk's segment sizes to ensure the payload is optimally aligned. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Prevent leak of rpcrdma_rep objectsChuck Lever2019-01-021-0/+4
| | | | | | | | | | | | | | | | | If a reply has been processed but the RPC is later retransmitted anyway, the req->rl_reply field still contains the only pointer to the old rpcrdma rep. When the next reply comes in, the reply handler will stomp on the rl_reply field, leaking the old rep. A trace event is added to capture such leaks. This problem seems to be worsened by the restructuring of the RPC Call path in v4.20. Fully addressing this issue will require at least a re-architecture of the disconnect logic, which is not appropriate during -rc. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>