summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* tree-wide: replace config_enabled() with IS_ENABLED()Masahiro Yamada2016-08-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The use of config_enabled() against config options is ambiguous. In practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the author might have used it for the meaning of IS_ENABLED(). Using IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention clearer. This commit replaces config_enabled() with IS_ENABLED() where possible. This commit is only touching bool config options. I noticed two cases where config_enabled() is used against a tristate option: - config_enabled(CONFIG_HWMON) [ drivers/net/wireless/ath/ath10k/thermal.c ] - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE) [ drivers/gpu/drm/gma500/opregion.c ] I did not touch them because they should be converted to IS_BUILTIN() in order to keep the logic, but I was not sure it was the authors' intention. Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Stas Sergeev <stsp@list.ru> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Joshua Kinard <kumba@gentoo.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: yu-cheng yu <yu-cheng.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Will Drewry <wad@chromium.org> Cc: Nikolay Martynov <mar.kolya@gmail.com> Cc: Huacai Chen <chenhc@lemote.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Cc: Rafal Milecki <zajec5@gmail.com> Cc: James Cowgill <James.Cowgill@imgtec.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Alex Smith <alex.smith@imgtec.com> Cc: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Mikko Rapeli <mikko.rapeli@iki.fi> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Kalle Valo <kvalo@qca.qualcomm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Tony Wu <tung7970@gmail.com> Cc: Huaitong Han <huaitong.han@intel.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Gelmini <andrea.gelmini@gelma.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Rabin Vincent <rabin@rab.in> Cc: "Maciej W. Rozycki" <macro@imgtec.com> Cc: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-08-036-7/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix several cases of missing of_node_put() calls in various networking drivers. From Peter Chen. 2) Don't try to remove unconfigured VLANs in qed driver, from Yuval Mintz. 3) Unbalanced locking in TIPC error handling, from Wei Yongjun. 4) Fix lockups in CPDMA driver, from Grygorii Strashko. 5) More MACSEC refcount et al fixes, from Sabrina Dubroca. 6) Fix MAC address setting in r8169 during runtime suspend, from Chun-Hao Lin. 7) Various printf format specifier fixes, from Heinrich Schuchardt. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits) qed: Fail driver load in 100g MSI mode. ethernet: ti: davinci_emac: add missing of_node_put after calling of_parse_phandle ethernet: stmicro: stmmac: add missing of_node_put after calling of_parse_phandle ethernet: stmicro: stmmac: dwmac-socfpga: add missing of_node_put after calling of_parse_phandle ethernet: renesas: sh_eth: add missing of_node_put after calling of_parse_phandle ethernet: renesas: ravb_main: add missing of_node_put after calling of_parse_phandle ethernet: marvell: pxa168_eth: add missing of_node_put after calling of_parse_phandle ethernet: marvell: mvpp2: add missing of_node_put after calling of_parse_phandle ethernet: marvell: mvneta: add missing of_node_put after calling of_parse_phandle ethernet: hisilicon: hns: hns_dsaf_main: add missing of_node_put after calling of_parse_phandle ethernet: hisilicon: hns: hns_dsaf_mac: add missing of_node_put after calling of_parse_phandle ethernet: cavium: octeon: add missing of_node_put after calling of_parse_phandle ethernet: aurora: nb8800: add missing of_node_put after calling of_parse_phandle ethernet: arc: emac_main: add missing of_node_put after calling of_parse_phandle ethernet: apm: xgene: add missing of_node_put after calling of_parse_phandle ethernet: altera: add missing of_node_put 8139too: fix system hang when there is a tx timeout event. qed: Fix error return code in qed_resc_alloc() net: qlcnic: avoid superfluous assignement dsa: b53: remove redundant if ...
| * sctp: allow receiving msg when TCP-style sk is in CLOSED stateXin Long2016-07-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 141ddefce7c8 ("sctp: change sk state to CLOSED instead of CLOSING in sctp_sock_migrate") changed sk state to CLOSED if the assoc is closed when sctp_accept clones a new sk. If there is still data in sk receive queue, users will not be able to read it any more, as sctp_recvmsg returns directly if sk state is CLOSED. This patch is to add CLOSED state check in sctp_recvmsg to allow reading data from TCP-style sk with CLOSED state as what TCP does. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: allow delivering notifications after receiving SHUTDOWNXin Long2016-07-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this patch, once sctp received SHUTDOWN or shutdown with RD, sk->sk_shutdown would be set with RCV_SHUTDOWN, and all events would be dropped in sctp_ulpq_tail_event(). It would cause: 1. some notifications couldn't be received by users. like SCTP_SHUTDOWN_COMP generated by sctp_sf_do_4_C(). 2. sctp would also never trigger sk_data_ready when the association was closed, making it harder to identify the end of the association by calling recvmsg() and getting an EOF. It was not convenient for kernel users. The check here should be stopping delivering DATA chunks after receiving SHUTDOWN, and stopping delivering ANY chunks after sctp_close(). So this patch is to allow notifications to enqueue into receive queue even if sk->sk_shutdown is set to RCV_SHUTDOWN in sctp_ulpq_tail_event, but if sk->sk_shutdown == RCV_SHUTDOWN | SEND_SHUTDOWN, it drops all events. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: fix the issue sctp requeue auth chunk incorrectlyXin Long2016-07-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sctp needs to queue auth chunk back when we know that we are going to generate another segment. But commit f1533cce60d1 ("sctp: fix panic when sending auth chunks") requeues the last chunk processed which is probably not the auth chunk. It causes panic when calculating the MAC in sctp_auth_calculate_hmac(), as the incorrect offset of the auth chunk in skb->data. This fix is to requeue it by using packet->auth. Fixes: f1533cce60d1 ("sctp: fix panic when sending auth chunks") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: consider recv buf for the initial window scaleSoheil Hassas Yeganeh2016-07-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_select_initial_window() intends to advertise a window scaling for the maximum possible window size. To do so, it considers the maximum of net.ipv4.tcp_rmem[2] and net.core.rmem_max as the only possible upper-bounds. However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE to set the socket's receive buffer size to values larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max. Thus, SO_RCVBUFFORCE is effectively ignored by tcp_select_initial_window(). To fix this, consider the maximum of net.ipv4.tcp_rmem[2], net.core.rmem_max and socket's initial buffer space. Fixes: b0573dea1fb3 ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv6: use list_move instead of list_del/list_addWei Yongjun2016-07-301-2/+1
| | | | | | | | | | | | | | Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix imbalance read_unlock_bh in __tipc_nl_add_monitor()Wei Yongjun2016-07-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the error handling case of nla_nest_start() failed read_unlock_bh() is called to unlock a lock that had not been taken yet. sparse warns about the context imbalance as the following: net/tipc/monitor.c:799:23: warning: context imbalance in '__tipc_nl_add_monitor' - different lock contexts for basic block Fixes: cf6f7e1d5109 ('tipc: dump monitor attributes') Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2016-08-029-24/+245
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Ceph updates from Ilya Dryomov: "The highlights are: - RADOS namespace support in libceph and CephFS (Zheng Yan and myself). The stopgaps added in 4.5 to deny access to inodes in namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully supported - A large rework of the MDS cap flushing code (Zheng Yan) - Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were overly pessimistic before, bailing at the first sight of LOOKUP_RCU On top of that we've got a few CephFS bug fixes, a couple of cleanups and Arnd's workaround for a weird genksyms issue" * tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits) ceph: fix symbol versioning for ceph_monc_do_statfs ceph: Correctly return NXIO errors from ceph_llseek ceph: Mark the file cache as unreclaimable ceph: optimize cap flush waiting ceph: cleanup ceph_flush_snaps() ceph: kick cap flushes before sending other cap message ceph: introduce an inode flag to indicates if snapflush is needed ceph: avoid sending duplicated cap flush message ceph: unify cap flush and snapcap flush ceph: use list instead of rbtree to track cap flushes ceph: update types of some local varibles ceph: include 'follows' of pending snapflush in cap reconnect message ceph: update cap reconnect message to version 3 ceph: mount non-default filesystem by name libceph: fsmap.user subscription support ceph: handle LOOKUP_RCU in ceph_d_revalidate ceph: allow dentry_lease_is_valid to work under RCU walk ceph: clear d_fsinfo pointer under d_lock ceph: remove ceph_mdsc_lease_release ceph: don't use ->d_time ...
| * | libceph: fsmap.user subscription supportYan, Zheng2016-07-281-1/+3
| | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: make sure redirect does not change namespaceYan, Zheng2016-07-281-0/+15
| | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: rados pool namespace supportYan, Zheng2016-07-283-15/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add pool namesapce pointer to struct ceph_file_layout and struct ceph_object_locator. Pool namespace is used by when mapping object to PG, it's also used when composing OSD request. The namespace pointer in struct ceph_file_layout is RCU protected. So libceph can read namespace without taking lock. Signed-off-by: Yan, Zheng <zyan@redhat.com> [idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: introduce reference counted stringYan, Zheng2016-07-283-1/+114
| | | | | | | | | | | | | | | | | | | | | | | | The data structure is for storing namesapce string. It allows namespace string to be shared between cephfs inodes with same layout. This data structure can also be referenced by OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | libceph: define new ceph_file_layout structureYan, Zheng2016-07-283-8/+32
| | | | | | | | | | | | | | | | | | | | | | | | Define new ceph_file_layout structure and rename old ceph_file_layout to ceph_file_layout_legacy. This is preparation for adding namespace to ceph_file_layout structure. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | libceph: fix some missing includesIlya Dryomov2016-07-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | - decode.h needs slab.h for kmalloc() - osd_client.h needs msgpool.h for struct ceph_msgpool - msgpool.h doesn't need messenger.h Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2016-07-3021-941/+864
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - nfs: don't create zero-length requests - several LAYOUTGET bugfixes Features: - several performance related features - more aggressive caching when we can rely on close-to-open cache consistency - remove serialisation of O_DIRECT reads and writes - optimise several code paths to not flush to disk unnecessarily. However allow for the idiosyncracies of pNFS for those layout types that need to issue a LAYOUTCOMMIT before the metadata can be updated on the server. - SUNRPC updates to the client data receive path - pNFS/SCSI support RH/Fedora dm-mpath device nodes - pNFS files/flexfiles can now use unprivileged ports when the generic NFS mount options allow it. Bugfixes: - Don't use RDMA direct data placement together with data integrity or privacy security flavours - Remove the RDMA ALLPHYSICAL memory registration mode as it has potential security holes. - Several layout recall fixes to improve NFSv4.1 protocol compliance. - Fix an Oops in the pNFS files and flexfiles connection setup to the DS - Allow retry of operations that used a returned delegation stateid - Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding - Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()" * tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits) pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding NFSv4: Clean up lookup of SECINFO_NO_NAME NFSv4.2: Fix warning "variable ‘stateids’ set but not used" NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’" SUNRPC: Fix a compiler warning in fs/nfs/clnt.c pNFS: Remove redundant smp_mb() from pnfs_init_lseg() pNFS: Cleanup - do layout segment initialisation in one place pNFS: Remove redundant stateid invalidation pNFS: Remove redundant pnfs_mark_layout_returned_if_empty() pNFS: Clear the layout metadata if the server changed the layout stateid pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid() NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id pNFS: Do not set plh_return_seq for non-callback related layoutreturns pNFS: Ensure layoutreturn acts as a completion for layout callbacks pNFS: Fix CB_LAYOUTRECALL stateid verification pNFS: Always update the layout barrier seqid on LAYOUTGET pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set pNFS: Clear the layout return tracking on layout reinitialisation pNFS: LAYOUTRETURN should only update the stateid if the layout is valid nfs: don't create zero-length requests ...
| * | Merge branch 'nfs-rdma'Trond Myklebust2016-07-2422-886/+783
| |\ \
| | * | xprtrdma: fix semicolon.cocci warningskbuild test robot2016-07-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/sunrpc/xprtrdma/verbs.c:798:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci CC: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | NFS: Don't drop CB requests with invalid principalsChuck Lever2016-07-111-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before commit 778be232a207 ("NFS do not find client in NFSv4 pg_authenticate"), the Linux callback server replied with RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB request. Let's restore that behavior so the server has a chance to do something useful about it, and provide a warning that helps admins correct the problem. Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | svc: Avoid garbage replies when pc_func() returns rpc_drop_replyChuck Lever2016-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an RPC program does not set vs_dispatch and pc_func() returns rpc_drop_reply, the server sends a reply anyway containing a single word containing the value RPC_DROP_REPLY (in network byte-order, of course). This is a nonsense RPC message. Fixes: 9e701c610923 ("svcrpc: simpler request dropping") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: No direct data placement with krb5i and krb5pChuck Lever2016-07-114-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct data placement is not allowed when using flavors that guarantee integrity or privacy. When such security flavors are in effect, don't allow the use of Read and Write chunks for moving individual data items. All messages larger than the inline threshold are sent via Long Call or Long Reply. On my systems (CX-3 Pro on FDR), for small I/O operations, the use of Long messages adds only around 5 usecs of latency in each direction. Note that when integrity or encryption is used, the host CPU touches every byte in these messages. Even if it could be used, data movement offload doesn't buy much in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up fixup_copy_count accountingChuck Lever2016-07-111-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fixup_copy_count should count only the number of bytes copied to the page list. The head and tail are now always handled without a data copy. And the debugging at the end of rpcrdma_inline_fixup() is also no longer necessary, since copy_len will be non-zero when there is reply data in the tail (a normal and valid case). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Update only specific fields in private receive bufferChuck Lever2016-07-111-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that rpcrdma_inline_fixup() updates only two fields in rq_rcv_buf, a full memcpy of that structure to rq_private_buf is unwarranted. Updating rq_private_buf fields only where needed also better documents what is going on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()Chuck Lever2016-07-111-28/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: rpcrdma_inline_fixup() overruns the receive page listChuck Lever2016-07-111-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the remaining length of an incoming reply is longer than the XDR buf's page_len, switch over to the tail iovec instead of copying more than page_len bytes into the page list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Chunk list encoders no longer share one rl_segments arrayChuck Lever2016-07-112-53/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, all three chunk list encoders each use a portion of the one rl_segments array in rpcrdma_req. This is because the MWs for each chunk list were preserved in rl_segments so that ro_unmap could find and invalidate them after the RPC was complete. However, now that MWs are placed on a per-req linked list as they are registered, there is no longer any information in rpcrdma_mr_seg that is shared between ro_map and ro_unmap_{sync,safe}, and thus nothing in rl_segments needs to be preserved after rpcrdma_marshal_req is complete. Thus the rl_segments array can be used now just for the needs of each rpcrdma_convert_iovs call. Once each chunk list is encoded, the next chunk list encoder is free to re-use all of rl_segments. This means all three chunk lists in one RPC request can now each encode a full size data payload with no increase in the size of rl_segments. This is a key requirement for Kerberos support, since both the Call and Reply for a single RPC transaction are conveyed via Long messages (RDMA Read/Write). Both can be large. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Place registered MWs on a per-req listChuck Lever2016-07-116-139/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of placing registered MWs sparsely into the rl_segments array, place these MWs on a per-req list. ro_unmap_{sync,safe} can then simply pull those MWs off the list instead of walking through the array. This change significantly reduces the size of struct rpcrdma_req by removing nsegs and rl_mw from every array element. As an additional clean-up, chunk co-ordinates are returned in the "*mw" output argument so they are no longer needed in every array element. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Release orphaned MRs immediatelyChuck Lever2016-07-112-12/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of leaving orphaned MRs to be released when the transport is destroyed, release them immediately. The MR free list can now be replenished if it becomes exhausted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Allocate MRs on demandChuck Lever2016-07-115-124/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Frequent MR list exhaustion can impact I/O throughput, so enough MRs are always created during transport set-up to prevent running out. This means more MRs are created than most workloads need. Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk simultaneously") introduced support for sending two chunk lists per RPC, which consumes more MRs per RPC. Instead of trying to provision more MRs, introduce a mechanism for allocating MRs on demand. A few MRs are allocated during transport set-up to kick things off. This significantly reduces the average number of MRs per transport while allowing the MR count to grow for workloads or devices that need more MRs. FRWR with mlx4 allocated almost 400 MRs per transport before this patch. Now it starts with 32. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Chunk list encoders must not return zeroChuck Lever2016-07-113-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up, based on code audit: Remove the possibility that the chunk list XDR encoders can return zero, which would be interpreted as a NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Honor ->send_request API contractChuck Lever2016-07-115-24/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure") added a disconnect for some RPC marshaling failures. This is needed only in a handful of cases, but it was triggering for simple stuff like temporary resource shortages. Try to straighten this out. Fix up the lower layers so they don't return -ENOMEM or other error codes that the RPC client's FSM doesn't explicitly recognize. Also fix up the places in the send_request path that do want a disconnect. For example, when ib_post_send or ib_post_recv fail, this is a sign that there is a send or receive queue resource miscalculation. That should be rare, and is a sign of a software bug. But xprtrdma can recover: disconnect to reset the transport and start over. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Reply buffer exhaustion can be catastrophicChuck Lever2016-07-111-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not having an rpcrdma_rep at call_allocate time can be a problem. It means that send_request can't post a receive buffer to catch the RPC's reply. Possible consequences are RPC timeouts or even transport deadlock. Instead of allowing an RPC to proceed if an rpcrdma_rep is not available, return NULL to force call_allocate to wait and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up device capability detectionChuck Lever2016-07-114-29/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Move device capability detection into memreg-specific source files. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Remove rpcrdma_map_one() and friendsChuck Lever2016-07-112-44/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: ALLPHYSICAL is gone and FMR has been converted to use scatterlists. There are no more users of these functions. This patch shrinks the size of struct rpcrdma_req by about 3500 bytes on x86_64. There is one of these structs for each RPC credit (128 credits per transport connection). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Remove ALLPHYSICAL memory registration modeChuck Lever2016-07-114-142/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL. ALLPHYSICAL advertises in the clear on the network fabric an R_key that is good for all of the client's memory. No known exploit exists, but theoretically any user on the server can use that R_key on the client's QP to read or update any part of the client's memory. ALLPHYSICAL exposes the client to server bugs, including: o base/bounds errors causing data outside the i/o buffer to be accessed o RDMA access after reply causing data corruption and/or integrity fail ALLPHYSICAL can't protect application memory regions from server update after a local signal or soft timeout has terminated an RPC. ALLPHYSICAL chunks are no larger than a page. Special cases to handle small chunks and long chunk lists have been a source of implementation complexity and bugs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Do not leak an MW during a DMA map failureChuck Lever2016-07-112-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on code audit. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Refactor MR recovery work queuesChuck Lever2016-07-115-166/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found that commit ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method"), which introduces ro_unmap_safe, never wired up the FMR recovery worker. The FMR and FRWR recovery work queues both do the same thing. Instead of setting up separate individual work queues for this, schedule a delayed worker to deal with them, since recovering MRs is not performance-critical. Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Use scatterlist for DMA mapping and unmapping under FMRChuck Lever2016-07-111-39/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The use of a scatterlist for handling DMA mapping and unmapping was recently introduced in frwr_ops.c in commit 4143f34e01e9 ("xprtrdma: Port to new memory registration API"). That commit did not make a similar update to xprtrdma's FMR support because the core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed to take a scatterlist argument. However, FMR still needs to do DMA mapping and unmapping. It appears that RDS, for example, uses a scatterlist for this, then builds the DMA addr array for the ib_map_phys_fmr call separately. I see that SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do something similar. This modernization is used immediately to properly defer DMA unmapping during fmr_unmap_safe (a FIXME). It separates the DMA unmapping coordinates from the rl_segments array. This array, being part of an rpcrdma_req, is always re-used immediately when an RPC exits. A scatterlist is allocated in memory independent of the rl_segments array, so it can be preserved indefinitely (ie, until the MR invalidation and DMA unmapping can actually be done by a worker thread). The FRWR and FMR DMA mapping code are slightly different from each other now, and will diverge further when the "Check for holes" logic can be removed from FRWR (support for SG_GAP MRs). So I chose not to create helpers for the common-looking code. Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method") Suggested-by: Sagi Grimberg <sagi@lightbits.io> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Rename fields in rpcrdma_fmrChuck Lever2016-07-112-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Use the same naming convention used in other RPC/RDMA-related data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Move init and release helpersChuck Lever2016-07-112-89/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Moving these helpers in a separate patch makes later patches more readable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Create common scatterlist fields in rpcrdma_mwChuck Lever2016-07-112-47/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: FMR is about to replace the rpcrdma_map_one code with scatterlists. Move the scatterlist fields out of the FRWR-specific union and into the generic part of rpcrdma_mw. One minor change: -EIO is now returned if FRWR registration fails. The RPC is terminated immediately, since the problem is likely due to a software bug, thus retrying likely won't help. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Remove FMRs from the unmap list after unmappingChuck Lever2016-07-111-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not remove the FMRs from this list as it processes them. Other ib_unmap_fmr() call sites are careful to remove FMRs from the list after ib_unmap_fmr() returns. Since commit 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR") fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but it didn't bother to remove the FMRs from that list once the call was complete. I've noticed some instability that could be related to list tangling by the new fmr_op_unmap_sync() logic. In an abundance of caution, add some defensive logic to clean up properly after ib_unmap_fmr(). Fixes: 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | Merge branch 'sunrpc'Trond Myklebust2016-07-245-74/+130
| |\ \ \
| | * | | SUNRPC: Fix a compiler warning in fs/nfs/clnt.cTrond Myklebust2016-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the report: net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprtTrond Myklebust2016-07-161-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there were less than 2 entries in the multipath list, then xprt_iter_next_entry_multiple() would never advance beyond the first entry, which is correct for round robin behaviour, but not for the list iteration. The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt() as we would never see the xprt == NULL condition fulfilled. Reported-by: Oleg Drokin <green@linuxhacker.ru> Fixes: 80b14d5e61ca ("SUNRPC: Add a structure to track multiple transports") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: Fix suspicious enobufs issues.Trond Myklebust2016-06-131-6/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current test is racy when dealing with fast NICs. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: Reduce latency when send queue is congestedTrond Myklebust2016-06-132-12/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the low latency transport workqueue to process the task that is next in line on the xprt->sending queue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: RPC transport queue must be low latencyTrond Myklebust2016-06-133-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rpciod can easily get congested due to the long list of queued rpc_tasks. Having the receive queue wait in turn for those tasks to complete can therefore be a bottleneck. Address the problem by separating the workqueues into: - rpciod: manages rpc_tasks - xprtiod: manages transport related work. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: Consolidate xs_tcp_data_ready and xs_data_readyTrond Myklebust2016-06-131-31/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only difference between the two at this point is the reset of the connection timeout, and since everyone expect tcp ignore that value, we can just throw it into the generic function. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | SUNRPC: Small optimisation of client receiveTrond Myklebust2016-06-131-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not queue the client receive work if we're still processing. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>