summaryrefslogtreecommitdiffstats
path: root/net/sunrpc
Commit message (Collapse)AuthorAgeFilesLines
* svcrdma: Tail iovec leaves an orphaned DMA mappingChuck Lever2016-11-184-26/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cace564f8b6260e806f5e28d7f192fd0e0c603ed upstream. The ctxt's count field is overloaded to mean the number of pages in the ctxt->page array and the number of SGEs in the ctxt->sge array. Typically these two numbers are the same. However, when an inline RPC reply is constructed from an xdr_buf with a tail iovec, the head and tail often occupy the same page, but each are DMA mapped independently. In that case, ->count equals the number of pages, but it does not equal the number of SGEs. There's one more SGE, for the tail iovec. Hence there is one more DMA mapping than there are pages in the ctxt->page array. This isn't a real problem until the server's iommu is enabled. Then each RPC reply that has content in that iovec orphans a DMA mapping that consists of real resources. krb5i and krb5p always populate that tail iovec. After a couple million sent krb5i/p RPC replies, the NFS server starts behaving erratically. Reboot is needed to clear the problem. Fixes: 9d11b51ce7c1 ("svcrdma: Fix send_reply() scatter/gather set-up") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* svcrdma: Skip put_page() when send_reply() failsChuck Lever2016-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 9995237bba702281e0e8e677edd5bb225f4f6c30 upstream. Message from syslogd@klimt at Aug 18 17:00:37 ... kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping: (null) index:0x0 Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000() Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445! Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma] send_reply() assigns its page argument as the first page of ctxt. On error, send_reply() already invokes svc_rdma_put_context(ctxt, 1); which does a put_page() on that very page. No need to do that again as svc_rdma_sendto exits. Fixes: 3e1eeb980822 ("svcrdma: Close connection when a send error occurs") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xprtrdma: Fix DMAR failure in frwr_op_map() after reconnectChuck Lever2016-11-182-16/+24
| | | | | | | | | | | | | | | | | | | | | | | | | commit 62bdf94a2049822ef8c6d4b0e83cd9c3a1663ab4 upstream. When a LOCALINV WR is flushed, the frmr is marked STALE, then frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs are then recovered when frwr_op_map hunts for an INVALID frmr to use. All other cases that need frmr recovery leave that SGL DMA-mapped. The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL. To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs, alter the recovery logic (rather than the hot frwr_op_unmap_sync path) to distinguish among these cases. This solution also takes care of the case where multiple LOCAL_INV WRs are issued for the same rpcrdma_req, some complete successfully, but some are flushed. Reported-by: Vasco Steinmetz <linux@kyberraum.net> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Vasco Steinmetz <linux@kyberraum.net> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xprtrdma: use complete() instead complete_all()Daniel Wagner2016-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5690a22d8612e1788b48b4ea53c59868589cd2db upstream. There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). The usage pattern of the completion is: waiter context waker context frwr_op_unmap_sync() reinit_completion() ib_post_send() wait_for_completion() frwr_wc_localinv_wake() complete() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sunrpc: fix write space race causing stallsDavid Vrabel2016-10-281-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d48f9ce73c997573e1b512893fa6eddf353a6f69 upstream. Write space becoming available may race with putting the task to sleep in xprt_wait_for_buffer_space(). The existing mechanism to avoid the race does not work. This (edited) partial trace illustrates the problem: [1] rpc_task_run_action: task:43546@5 ... action=call_transmit [2] xs_write_space <-xs_tcp_write_space [3] xprt_write_space <-xs_write_space [4] rpc_task_sleep: task:43546@5 ... [5] xs_write_space <-xs_tcp_write_space [1] Task 43546 runs but is out of write space. [2] Space becomes available, xs_write_space() clears the SOCKWQ_ASYNC_NOSPACE bit. [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but this has not yet been queued and the wake up is lost. [4] xs_nospace() is called which calls xprt_wait_for_buffer_space() which queues task 43546. [5] The call to sk->sk_write_space() at the end of xs_nospace() (which is supposed to handle the above race) does not call xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and thus the task is not woken. Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace() so the second call to sk->sk_write_space() calls xprt_write_space(). Suggested-by: Trond Myklebust <trondmy@primarydata.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2016-09-161-2/+3
|\ | | | | | | | | | | | | | | Pull nfsd bugfix from Bruce Fields: "Fix a memory corruption bug that I introduced in 4.7" * tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux: svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
| * svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")Chuck Lever2016-09-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rsc_lookup steals the passed-in memory to avoid doing an allocation of its own, so we can't just pass in a pointer to memory that someone else is using. If we really want to avoid allocation there then maybe we should preallocate somwhere, or reference count these handles. For now we should revert. On occasion I see this on my server: kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851! kernel: invalid opcode: 0000 [#1] SMP kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15 kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 kernel: Workqueue: events do_cache_clean [sunrpc] kernel: task: ffff8808541d8000 task.stack: ffff880854344000 kernel: RIP: 0010:[<ffffffff811e7075>] [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP: 0018:ffff880854347d70 EFLAGS: 00010246 kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600 kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064 kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780 kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500 kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0 kernel: FS: 0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0 kernel: Stack: kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0 kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006 kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f kernel: Call Trace: kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss] kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss] kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc] kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc] kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0 kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0 kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0 kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0 kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40 kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110 kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff kernel: RIP [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP <ffff880854347d70> kernel: ---[ end trace 3fdec044969def26 ]--- It seems to be most common after a server reboot where a client has been using a Kerberos mount, and reconnects to continue its workload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | xprtrdma: Fix receive buffer accountingChuck Lever2016-09-062-12/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An RPC can terminate before its reply arrives, if a credential problem or a soft timeout occurs. After this happens, xprtrdma reports it is out of Receive buffers. A Receive buffer is posted before each RPC is sent, and returned to the buffer pool when a reply is received. If no reply is received for an RPC, that Receive buffer remains posted. But xprtrdma tries to post another when the next RPC is sent. If this happens a few dozen times, there are no receive buffers left to be posted at send time. I don't see a way for a transport connection to recover at that point, and it will spit warnings and unnecessarily delay RPCs on occasion for its remaining lifetime. Commit 1e465fd4ff47 ("xprtrdma: Replace send and receive arrays") removed a little bit of logic to detect this case and not provide a Receive buffer so no more buffers are posted, and then transport operation continues correctly. We didn't understand what that logic did, and it wasn't commented, so it was removed as part of the overhaul to support backchannel requests. Restore it, but be wary of the need to keep extra Receives posted to deal with backchannel requests. Fixes: 1e465fd4ff47 ("xprtrdma: Replace send and receive arrays") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | xprtrdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...")Chuck Lever2016-09-061-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Receive buffer exhaustion, if it were to actually occur, would be catastrophic. However, when there are no reply buffers to post, that means all of them have already been posted and are waiting for incoming replies. By design, there can never be more RPCs in flight than there are available receive buffers. A receive buffer can be left posted after an RPC exits without a received reply; say, due to a credential problem or a soft timeout. This does not result in fewer posted receive buffers than there are pending RPCs, and there is already logic in xprtrdma to deal appropriately with this case. It also looks like the "+ 2" that was removed was accidentally accommodating the number of extra receive buffers needed for receiving backchannel requests. That will need to be addressed by another patch. Fixes: 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion can be...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | sunrpc: fix UDP memory accountingPaolo Abeni2016-09-031-1/+1
|/ | | | | | | | | | | | | | | | The commit f9b2ee714c5c ("SUNRPC: Move UDP receive data path into a workqueue context"), as a side effect, moved the skb_free_datagram() call outside the scope of the related socket lock, but UDP sockets require such lock to be held for proper memory accounting. Fix it by replacing skb_free_datagram() with skb_free_datagram_locked(). Fixes: f9b2ee714c5c ("SUNRPC: Move UDP receive data path into a workqueue context") Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in useChuck Lever2016-08-241-2/+2
| | | | | | | | | | | | | Using NFSv4.1 on RDMA should be safe, so broaden the new checks in rpc_create(). WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c. Fixes: 39a9beab5acb ("rpc: share one xps between all backchannels") Fixes: d50039ea5ee6 ("nfsd4/rpc: move backchannel create logic...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2016-08-124-24/+94
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user needs multiple different security services (e.g. krb5i and krb5p). - Stable patch to fix a regression introduced by the use of SO_REUSEPORT, and that prevented the use of multiple different NFS versions to the same server. - TCP socket reconnection timer fixes. - Patch from Neil to disable the use of IPv6 temporary addresses" * tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Cap the transport reconnection timer at 1/2 lease period NFSv4: Cleanup the setting of the nfs4 lease period SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout SUNRPC: Fix reconnection timeouts NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED SUNRPC: disable the use of IPv6 temporary addresses. SUNRPC: allow for upcalls for same uid but different gss service SUNRPC: Fix up socket autodisconnect SUNRPC: Handle EADDRNOTAVAIL on connection failures
| * NFSv4: Cap the transport reconnection timer at 1/2 lease periodTrond Myklebust2016-08-051-0/+21
| | | | | | | | | | | | | | | | | | We don't want to miss a lease period renewal due to the TCP connection failing to reconnect in a timely fashion. To ensure this doesn't happen, cap the reconnection timer so that we retry the connection attempt at least every 1/2 lease period. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: Limit the reconnect backoff timer to the max RPC message timeoutTrond Myklebust2016-08-052-6/+15
| | | | | | | | | | | | | | ...and ensure that we propagate it to new transports on the same client. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: Fix reconnection timeoutsTrond Myklebust2016-08-051-7/+20
| | | | | | | | | | | | | | | | When the connect attempt fails and backs off, we should start the clock at the last connection attempt, not time at which we queue up the reconnect job. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: disable the use of IPv6 temporary addresses.NeilBrown2016-08-051-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the net.ipv6.conf.*.use_temp_addr sysctl is set to '2', then TCP connections over IPv6 will prefer a 'private' source address. These eventually expire and become invalid, typically after a week, but the time is configurable. When the local address becomes invalid the client will not be able to receive replies from the server. Eventually the connection will timeout or break and a new connection will be established, but this can take half an hour (typically TCP connection break time). RFC 4941, which describes private IPv6 addresses, acknowledges that some applications might not work well with them and that the application may explicitly a request non-temporary (i.e. "public") address. I believe this is correct for SUNRPC clients. Without this change, a client will occasionally experience a long delay if private addresses have been enabled. The privacy offered by private addresses is of little value for an NFS server which requires client authentication. For NFSv3 this will often not be a problem because idle connections are closed after 5 minutes. For NFSv4 connections never go idle due to the period RENEW (or equivalent) request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: allow for upcalls for same uid but different gss serviceOlga Kornievskaia2016-08-051-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | It's possible to have simultaneous upcalls for the same UIDs but different GSS service. In that case, we need to allow for the upcall to gssd to proceed so that not the same context is used by two different GSS services. Some servers lock the use of context to the GSS service. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Cc: stable@vger.kernel.org # v3.9+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: Fix up socket autodisconnectTrond Myklebust2016-08-021-8/+18
| | | | | | | | | | | | | | | | Ensure that we don't forget to set up the disconnection timer for the case when a connect request is fulfilled after the RPC request that initiated it has timed out or been interrupted. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * SUNRPC: Handle EADDRNOTAVAIL on connection failuresTrond Myklebust2016-08-011-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | If the connect attempt immediately fails with an EADDRNOTAVAIL error, then that means our choice of source port number was bad. This error is expected when we set the SO_REUSEPORT socket option and we have 2 sockets sharing the same source and destination address and port combinations. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Fixes: 402e23b4ed9ed ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport") Cc: stable@vger.kernel.org # v4.0+
* | Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2016-08-044-122/+86
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "Highlights: - Trond made a change to the server's tcp logic that allows a fast client to better take advantage of high bandwidth networks, but may increase the risk that a single client could starve other clients; a new sunrpc.svc_rpc_per_connection_limit parameter should help mitigate this in the (hopefully unlikely) event this becomes a problem in practice. - Tom Haynes added a minimal flex-layout pnfs server, which is of no use in production for now--don't build it unless you're doing client testing or further server development" * tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd: remove some dead code in nfsd_create_locked() nfsd: drop unnecessary MAY_EXEC check from create nfsd: clean up bad-type check in nfsd_create_locked nfsd: remove unnecessary positive-dentry check nfsd: reorganize nfsd_create nfsd: check d_can_lookup in fh_verify of directories nfsd: remove redundant zero-length check from create nfsd: Make creates return EEXIST instead of EACCES SUNRPC: Detect immediate closure of accepted sockets SUNRPC: accept() may return sockets that are still in SYN_RECV nfsd: allow nfsd to advertise multiple layout types nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock nfsd/blocklayout: Make sure calculate signature/designator length aligned xfs: abstract block export operations from nfsd layouts SUNRPC: Remove unused callback xpo_adjust_wspace() SUNRPC: Change TCP socket space reservation SUNRPC: Add a server side per-connection limit SUNRPC: Micro optimisation for svc_data_ready SUNRPC: Call the default socket callbacks instead of open coding SUNRPC: lock the socket while detaching it ...
| * | SUNRPC: Detect immediate closure of accepted socketsTrond Myklebust2016-08-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | This modification is useful for debugging issues that happen while the socket is being initialised. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: accept() may return sockets that are still in SYN_RECVTrond Myklebust2016-08-011-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're seeing traces of the following form: [10952.396347] svc: transport ffff88042ba4a 000 dequeued, inuse=2 [10952.396351] svc: tcp_accept ffff88042ba4 a000 sock ffff88042a6e4c80 [10952.396362] nfsd: connect from 10.2.6.1, port=187 [10952.396364] svc: svc_setup_socket ffff8800b99bcf00 [10952.396368] setting up TCP socket for reading [10952.396370] svc: svc_setup_socket created ffff8803eb10a000 (inet ffff88042b75b800) [10952.396373] svc: transport ffff8803eb10a000 put into queue [10952.396375] svc: transport ffff88042ba4a000 put into queue [10952.396377] svc: server ffff8800bb0ec000 waiting for data (to = 3600000) [10952.396380] svc: transport ffff8803eb10a000 dequeued, inuse=2 [10952.396381] svc_recv: found XPT_CLOSE [10952.396397] svc: svc_delete_xprt(ffff8803eb10a000) [10952.396398] svc: svc_tcp_sock_detach(ffff8803eb10a000) [10952.396399] svc: svc_sock_detach(ffff8803eb10a000) [10952.396412] svc: svc_sock_free(ffff8803eb10a000) i.e. an immediate close of the socket after initialisation. The culprit appears to be the test at the end of svc_tcp_init, which checks if the newly created socket is in the TCP_ESTABLISHED state, and immediately closes it if not. The evidence appears to suggest that the socket might still be in the SYN_RECV state at this time. The fix is to check for both states, and then to add a check in svc_tcp_state_change() to ensure we don't close the socket when it transitions into TCP_ESTABLISHED. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Remove unused callback xpo_adjust_wspace()Trond Myklebust2016-07-131-2/+0
| | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Change TCP socket space reservationTrond Myklebust2016-07-131-43/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current server rpc tcp code attempts to predict how much writeable socket space will be available to a given RPC call before accepting it for processing. On a 40GigE network, we've found this throttles individual clients long before the network or disk is saturated. The server may handle more clients easily, but the bandwidth of individual clients is still artificially limited. Instead of trying (and failing) to predict how much writeable socket space will be available to the RPC call, just fall back to the simple model of deferring processing until the socket is uncongested. This may increase the risk of fast clients starving slower clients; in such cases, the previous patch allows setting a hard per-connection limit. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Add a server side per-connection limitTrond Myklebust2016-07-131-3/+36
| | | | | | | | | | | | | | | | | | | | | | | | Allow the user to limit the number of requests serviced through a single connection, to help prevent faster clients from starving slower clients. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Micro optimisation for svc_data_readyTrond Myklebust2016-07-131-2/+2
| | | | | | | | | | | | | | | | | | | | | Don't call svc_xprt_enqueue() if the XPT_DATA flag is already set. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Call the default socket callbacks instead of open codingTrond Myklebust2016-07-131-69/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than code up our own versions of the socket callbacks, just call the defaults. This also allows us to merge svc_udp_data_ready() and svc_tcp_data_ready(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: lock the socket while detaching itTrond Myklebust2016-07-131-0/+3
| | | | | | | | | | | | | | | | | | | | | Prevent callbacks from triggering while we're detaching the socket. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Add tracepoints for dropped and deferred requestsTrond Myklebust2016-07-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | Dropping and/or deferring requests has an impact on performance. Let's make sure we can trace those events. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | SUNRPC: Add a tracepoint for server socket out-of-space conditionsTrond Myklebust2016-07-131-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint to track when the processing of incoming RPC data gets deferred due to out-of-space issues on the outgoing transport. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | sunrpc: add gss minor status to svcauth_gss_proxy_initScott Mayhew2016-07-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | GSS-Proxy doesn't produce very much debug logging at all. Printing out the gss minor status will aid in troubleshooting if the GSS_Accept_sec_context upcall fails. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | sunrpc: remove 'inuse' flag from struct cache_detail.NeilBrown2016-07-131-1/+1
| | | | | | | | | | | | | | | | | | | | | This field is not currently in use. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2016-07-3021-941/+864
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - nfs: don't create zero-length requests - several LAYOUTGET bugfixes Features: - several performance related features - more aggressive caching when we can rely on close-to-open cache consistency - remove serialisation of O_DIRECT reads and writes - optimise several code paths to not flush to disk unnecessarily. However allow for the idiosyncracies of pNFS for those layout types that need to issue a LAYOUTCOMMIT before the metadata can be updated on the server. - SUNRPC updates to the client data receive path - pNFS/SCSI support RH/Fedora dm-mpath device nodes - pNFS files/flexfiles can now use unprivileged ports when the generic NFS mount options allow it. Bugfixes: - Don't use RDMA direct data placement together with data integrity or privacy security flavours - Remove the RDMA ALLPHYSICAL memory registration mode as it has potential security holes. - Several layout recall fixes to improve NFSv4.1 protocol compliance. - Fix an Oops in the pNFS files and flexfiles connection setup to the DS - Allow retry of operations that used a returned delegation stateid - Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding - Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()" * tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits) pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding NFSv4: Clean up lookup of SECINFO_NO_NAME NFSv4.2: Fix warning "variable ‘stateids’ set but not used" NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’" SUNRPC: Fix a compiler warning in fs/nfs/clnt.c pNFS: Remove redundant smp_mb() from pnfs_init_lseg() pNFS: Cleanup - do layout segment initialisation in one place pNFS: Remove redundant stateid invalidation pNFS: Remove redundant pnfs_mark_layout_returned_if_empty() pNFS: Clear the layout metadata if the server changed the layout stateid pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid() NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id pNFS: Do not set plh_return_seq for non-callback related layoutreturns pNFS: Ensure layoutreturn acts as a completion for layout callbacks pNFS: Fix CB_LAYOUTRECALL stateid verification pNFS: Always update the layout barrier seqid on LAYOUTGET pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set pNFS: Clear the layout return tracking on layout reinitialisation pNFS: LAYOUTRETURN should only update the stateid if the layout is valid nfs: don't create zero-length requests ...
| * | Merge branch 'nfs-rdma'Trond Myklebust2016-07-2412-851/+718
| |\ \
| | * | xprtrdma: fix semicolon.cocci warningskbuild test robot2016-07-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/sunrpc/xprtrdma/verbs.c:798:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci CC: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | NFS: Don't drop CB requests with invalid principalsChuck Lever2016-07-111-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before commit 778be232a207 ("NFS do not find client in NFSv4 pg_authenticate"), the Linux callback server replied with RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB request. Let's restore that behavior so the server has a chance to do something useful about it, and provide a warning that helps admins correct the problem. Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | svc: Avoid garbage replies when pc_func() returns rpc_drop_replyChuck Lever2016-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an RPC program does not set vs_dispatch and pc_func() returns rpc_drop_reply, the server sends a reply anyway containing a single word containing the value RPC_DROP_REPLY (in network byte-order, of course). This is a nonsense RPC message. Fixes: 9e701c610923 ("svcrpc: simpler request dropping") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: No direct data placement with krb5i and krb5pChuck Lever2016-07-114-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct data placement is not allowed when using flavors that guarantee integrity or privacy. When such security flavors are in effect, don't allow the use of Read and Write chunks for moving individual data items. All messages larger than the inline threshold are sent via Long Call or Long Reply. On my systems (CX-3 Pro on FDR), for small I/O operations, the use of Long messages adds only around 5 usecs of latency in each direction. Note that when integrity or encryption is used, the host CPU touches every byte in these messages. Even if it could be used, data movement offload doesn't buy much in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up fixup_copy_count accountingChuck Lever2016-07-111-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fixup_copy_count should count only the number of bytes copied to the page list. The head and tail are now always handled without a data copy. And the debugging at the end of rpcrdma_inline_fixup() is also no longer necessary, since copy_len will be non-zero when there is reply data in the tail (a normal and valid case). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Update only specific fields in private receive bufferChuck Lever2016-07-111-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that rpcrdma_inline_fixup() updates only two fields in rq_rcv_buf, a full memcpy of that structure to rq_private_buf is unwarranted. Updating rq_private_buf fields only where needed also better documents what is going on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()Chuck Lever2016-07-111-28/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: rpcrdma_inline_fixup() overruns the receive page listChuck Lever2016-07-111-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the remaining length of an incoming reply is longer than the XDR buf's page_len, switch over to the tail iovec instead of copying more than page_len bytes into the page list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Chunk list encoders no longer share one rl_segments arrayChuck Lever2016-07-112-53/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, all three chunk list encoders each use a portion of the one rl_segments array in rpcrdma_req. This is because the MWs for each chunk list were preserved in rl_segments so that ro_unmap could find and invalidate them after the RPC was complete. However, now that MWs are placed on a per-req linked list as they are registered, there is no longer any information in rpcrdma_mr_seg that is shared between ro_map and ro_unmap_{sync,safe}, and thus nothing in rl_segments needs to be preserved after rpcrdma_marshal_req is complete. Thus the rl_segments array can be used now just for the needs of each rpcrdma_convert_iovs call. Once each chunk list is encoded, the next chunk list encoder is free to re-use all of rl_segments. This means all three chunk lists in one RPC request can now each encode a full size data payload with no increase in the size of rl_segments. This is a key requirement for Kerberos support, since both the Call and Reply for a single RPC transaction are conveyed via Long messages (RDMA Read/Write). Both can be large. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Place registered MWs on a per-req listChuck Lever2016-07-116-139/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of placing registered MWs sparsely into the rl_segments array, place these MWs on a per-req list. ro_unmap_{sync,safe} can then simply pull those MWs off the list instead of walking through the array. This change significantly reduces the size of struct rpcrdma_req by removing nsegs and rl_mw from every array element. As an additional clean-up, chunk co-ordinates are returned in the "*mw" output argument so they are no longer needed in every array element. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Release orphaned MRs immediatelyChuck Lever2016-07-112-12/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of leaving orphaned MRs to be released when the transport is destroyed, release them immediately. The MR free list can now be replenished if it becomes exhausted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Allocate MRs on demandChuck Lever2016-07-115-124/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Frequent MR list exhaustion can impact I/O throughput, so enough MRs are always created during transport set-up to prevent running out. This means more MRs are created than most workloads need. Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk simultaneously") introduced support for sending two chunk lists per RPC, which consumes more MRs per RPC. Instead of trying to provision more MRs, introduce a mechanism for allocating MRs on demand. A few MRs are allocated during transport set-up to kick things off. This significantly reduces the average number of MRs per transport while allowing the MR count to grow for workloads or devices that need more MRs. FRWR with mlx4 allocated almost 400 MRs per transport before this patch. Now it starts with 32. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Chunk list encoders must not return zeroChuck Lever2016-07-113-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up, based on code audit: Remove the possibility that the chunk list XDR encoders can return zero, which would be interpreted as a NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Honor ->send_request API contractChuck Lever2016-07-115-24/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure") added a disconnect for some RPC marshaling failures. This is needed only in a handful of cases, but it was triggering for simple stuff like temporary resource shortages. Try to straighten this out. Fix up the lower layers so they don't return -ENOMEM or other error codes that the RPC client's FSM doesn't explicitly recognize. Also fix up the places in the send_request path that do want a disconnect. For example, when ib_post_send or ib_post_recv fail, this is a sign that there is a send or receive queue resource miscalculation. That should be rare, and is a sign of a software bug. But xprtrdma can recover: disconnect to reset the transport and start over. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Reply buffer exhaustion can be catastrophicChuck Lever2016-07-111-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not having an rpcrdma_rep at call_allocate time can be a problem. It means that send_request can't post a receive buffer to catch the RPC's reply. Possible consequences are RPC timeouts or even transport deadlock. Instead of allowing an RPC to proceed if an rpcrdma_rep is not available, return NULL to force call_allocate to wait and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up device capability detectionChuck Lever2016-07-114-29/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Move device capability detection into memreg-specific source files. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>