summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds2020-12-1514-650/+1315
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "Several substantial changes this time around: - Previously, exporting an NFS mount via NFSD was considered to be an unsupported feature. With v5.11, the community has attempted to make re-exporting a first-class feature of NFSD. This would enable the Linux in-kernel NFS server to be used as an intermediate cache for a remotely-located primary NFS server, for example, even with other NFS server implementations, like a NetApp filer, as the primary. - A short series of patches brings support for multiple RPC/RDMA data chunks per RPC transaction to the Linux NFS server's RPC/RDMA transport implementation. This is a part of the RPC/RDMA spec that the other premiere NFS/RDMA implementation (Solaris) has had for a very long time, and completes the implementation of RPC/RDMA version 1 in the Linux kernel's NFS server. - Long ago, NFSv4 support was introduced to NFSD using a series of C macros that hid dprintk's and goto's. Over time, the kernel's XDR implementation has been greatly improved, but these C macros have remained and become fallow. A series of patches in this pull request completely replaces those macros with the use of current kernel XDR infrastructure. Benefits include: - More robust input sanitization in NFSD's NFSv4 XDR decoders. - Make it easier to use common kernel library functions that use XDR stream APIs (for example, GSS-API). - Align the structure of the source code with the RFCs so it is easier to learn, verify, and maintain our XDR implementation. - Removal of more than a hundred hidden dprintk() call sites. - Removal of some explicit manipulation of pages to help make the eventual transition to xdr->bvec smoother. - On top of several related fixes in 5.10-rc, there are a few more fixes to get the Linux NFSD implementation of NFSv4.2 inter-server copy up to speed. And as usual, there is a pinch of seasoning in the form of a collection of unrelated minor bug fixes and clean-ups. Many thanks to all who contributed this time around!" * tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6: (131 commits) nfsd: Record NFSv4 pre/post-op attributes as non-atomic nfsd: Set PF_LOCAL_THROTTLE on local filesystems only nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE exportfs: Add a function to return the raw output from fh_to_dentry() nfsd: close cached files prior to a REMOVE or RENAME that would replace target nfsd: allow filesystems to opt out of subtree checking nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations Revert "nfsd4: support change_attr_type attribute" nfsd4: don't query change attribute in v2/v3 case nfsd: minor nfsd4_change_attribute cleanup nfsd: simplify nfsd4_change_info nfsd: only call inode_query_iversion in the I_VERSION case nfs_common: need lock during iterate through the list NFSD: Fix 5 seconds delay when doing inter server copy NFSD: Fix sparse warning in nfs4proc.c SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall sunrpc: clean-up cache downcall nfsd: Fix message level for normal termination NFSD: Remove macros that are no longer used NFSD: Replace READ* macros in nfsd4_decode_compound() ...
| * SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcallChuck Lever2020-12-092-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | There's no need to defer allocation of pages for the receive buffer. - This upcall is quite infrequent - gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL, unlike the transport - gssp_alloc_receive_pages() knows exactly how many pages are needed Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
| * sunrpc: clean-up cache downcallRoberto Bergantinos Corpas2020-12-091-30/+11
| | | | | | | | | | | | | | | | | | | | We can simplify code around cache_downcall unifying memory allocations using kvmalloc. This has the benefit of getting rid of cache_slow_downcall (and queue_io_mutex), and also matches userland allocation size and limits. Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * NFSD: Replace the internals of the READ_BUF() macroChuck Lever2020-11-301-0/+45
| | | | | | | | | | | | | | | | | | | | Convert the READ_BUF macro in nfs4xdr.c from open code to instead use the new xdr_stream-style decoders already in use by the encode side (and by the in-kernel NFS client implementation). Once this conversion is done, each individual NFSv4 argument decoder can be independently cleaned up to replace these macros with C code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Prepare for xdr_stream-style decoding on the server-sideChuck Lever2020-11-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A "permanent" struct xdr_stream is allocated in struct svc_rqst so that it is usable by all server-side decoders. A per-rqst scratch buffer is also allocated to handle decoding XDR data items that cross page boundaries. To demonstrate how it will be used, add the first call site for the new svcxdr_init_decode() API. As an additional part of the overall conversion, add symbolic constants for successful and failed XDR operations. Returning "0" is overloaded. Sometimes it means something failed, but sometimes it means success. To make it more clear when XDR decoding functions succeed or fail, introduce symbolic constants. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()Chuck Lever2020-11-302-25/+5
| | | | | | | | | | | | Clean up: De-duplicate some frequently-used code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Move the svc_xdr_recvfrom() tracepointChuck Lever2020-11-301-3/+1
| | | | | | | | | | | | | | | | | | | | Commit c509f15a5801 ("SUNRPC: Split the xdr_buf event class") added display of the rqst's XID to the svc_xdr_buf_class. However, when the recvfrom tracepoint fires, rq_xid has yet to be filled in with the current XID. So it ends up recording the previous XID that was handled by that svc_rqst. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: support multiple Read chunks per RPCChuck Lever2020-11-301-14/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An efficient way to handle multiple Read chunks is to post them all together and then take a single completion. This is also how the code is already structured: when the Read completion fires, all portions of the incoming RPC message are available to be assembled. The difficult problem is setting up the Read sink buffers so that the server pulls the client's data into place, making subsequent pull-up unnecessary. There are several cases: * No Read chunks. No-op. * One data item Read chunk. This is the fast case, where the inline part of the RPC-over-RDMA message becomes the head and tail, and the data item chunk is placed in buf->pages. * A Position-zero Read chunk. Treated like TCP: the Read chunk is pulled into contiguous pages. + A Position-zero Read chunk with data item chunks. Treated like TCP: all of the Read chunks are pulled into contiguous pages. + Multiple data item chunks. Treated like TCP: the inline part is copied and the data item chunks are pulled into contiguous pages. The "*" cases are already supported. This patch adds support for the "+" cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Use the new parsed chunk list when pulling Read chunksChuck Lever2020-11-302-68/+108
| | | | | | | | | | | | | | | | As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Rename info::ri_chunklenChuck Lever2020-11-301-16/+15
| | | | | | | | | | | | | | | | I'm about to change the purpose of ri_chunklen: Instead of tracking the number of bytes in one Read chunk, it will track the total number of bytes in the Read list. Rename it for clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Clean up chunk tracepointsChuck Lever2020-11-302-17/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have trace_svcrdma_decode_rseg(), which records each ingress Read segment. Instead of reporting those again when they are about to be posted as RDMA Reads, let's fire one tracepoint before posting each type of chunk. So we'll get: nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905 nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100 nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3 kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0) Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Remove chunk list pointersChuck Lever2020-11-301-7/+1
| | | | | | | | | | | | Clean up: These pointers are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunkChuck Lever2020-11-301-27/+9
| | | | | | | | | | | | | | Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts of rq_res that do not contain a result payload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg()Chuck Lever2020-11-301-76/+98
| | | | | | | | | | | | | | | | | | | | | | Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only the parts of rq_res that do not contain a result payload. This change has been tested to confirm that it does not cause a regression in the no Write chunk and single Write chunk cases. Multiple Write chunks have not been tested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Support multiple write chunks when pulling upChuck Lever2020-11-303-82/+129
| | | | | | | | | | | | | | | | | | When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Use parsed chunk lists to encode Reply transport headersChuck Lever2020-11-301-61/+44
| | | | | | | | | | | | | | | | | | Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing the egress RPC Reply transport header, use the new parsed Write list and Reply chunk, which are version- agnostic and already XDR decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Use parsed chunk lists to construct RDMA WritesChuck Lever2020-11-303-34/+36
| | | | | | | | | | | | | | | | | | Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Use parsed chunk lists to detect reverse direction repliesChuck Lever2020-11-301-15/+14
| | | | | | | | | | | | | | | | | | | | Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Use parsed chunk lists to derive the inv_rkeyChuck Lever2020-11-301-37/+30
| | | | | | | | | | | | | | Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Add a "parsed chunk list" data structureChuck Lever2020-11-303-82/+422
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Clean up svc_rdma_encode_reply_chunk()Chuck Lever2020-11-302-12/+14
| | | | | | | | | | | | Refactor: Match the control flow of svc_rdma_encode_write_list(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Post RDMA Writes while XDR encoding repliesChuck Lever2020-11-302-36/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only RPC/RDMA ordering requirement between RDMA Writes and RDMA Sends is that the responder must post the Writes on the Send queue before posting the Send that conveys the RPC Reply for that Write payload. The Linux NFS server implementation now has a transport method that can post result Payload Writes earlier than svc_rdma_sendto: ->xpo_result_payload() This gets RDMA Writes going earlier so they are more likely to be complete at the remote end before the Send completes. Some care must be taken with pulled-up Replies. We don't want to push the Write chunk and then send the same payload data via Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * NFSD: Invoke svc_encode_result_payload() in "read" NFSD encodersChuck Lever2020-11-301-17/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Have the NFSD encoders annotate the boundaries of every direct-data-placement eligible result data payload. Then change svcrdma to use that annotation instead of the xdr->page_len when handling Write chunks. For NFSv4 on RDMA, that enables the ability to recognize multiple result payloads per compound. This is a pre-requisite for supporting multiple Write chunks per RPC transaction. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Rename svc_encode_read_payload()Chuck Lever2020-11-304-14/+15
| | | | | | | | | | | | | | Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Refactor the RDMA Write pathChuck Lever2020-11-301-21/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | Refactor for subsequent changes. Constify the xdr_buf argument to ensure the code here does not modify it, and to enable callers to pass in a "const struct xdr_buf *". At the same time, rename the helper functions, which emit RDMA Writes, not RDMA Sends, and add documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Const-ify the xdr_buf argumentsChuck Lever2020-11-301-2/+2
| | | | | | | | | | | | | | Clean up: Ensure the code in rw.c does not modify the argument, and enable callers to also use "const struct xdr_buf *". Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Adjust synopsis of xdr_buf_subsegment()Chuck Lever2020-11-301-3/+2
| | | | | | | | | | | | | | Clean up: This enables xdr_buf_subsegment()'s callers to pass in a const pointer to that buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * svcrdma: Catch another Reply chunk overflow caseChuck Lever2020-11-301-0/+2
| | | | | | | | | | | | | | | | When space in the Reply chunk runs out in the middle of a segment, we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it oopses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge tag 'net-next-5.11' of ↵Linus Torvalds2020-12-15405-10696/+12807
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...
| * | net: fix proc_fs init handling in af_packet and tlsYonatan Linik2020-12-142-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | proc_fs was used, in af_packet, without a surrounding #ifdef, although there is no hard dependency on proc_fs. That caused the initialization of the af_packet module to fail when CONFIG_PROC_FS=n. Specifically, proc_create_net() was used in af_packet.c, and when it fails, packet_net_init() returns -ENOMEM. It will always fail when the kernel is compiled without proc_fs, because, proc_create_net() for example always returns NULL. The calling order that starts in af_packet.c is as follows: packet_init() register_pernet_subsys() register_pernet_operations() __register_pernet_operations() ops_init() ops->init() (packet_net_ops.init=packet_net_init()) proc_create_net() It worked in the past because register_pernet_subsys()'s return value wasn't checked before this Commit 36096f2f4fa0 ("packet: Fix error path in packet_init."). It always returned an error, but was not checked before, so everything was working even when CONFIG_PROC_FS=n. The fix here is simply to add the necessary #ifdef. This also fixes a similar error in tls_proc.c, that was found by Jakub Kicinski. Fixes: d26b698dd3cd ("net/tls: add skeleton of MIB statistics") Fixes: 36096f2f4fa0 ("packet: Fix error path in packet_init") Signed-off-by: Yonatan Linik <yonatanlinik@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | af_vsock: Assign the vsock transport considering the vsock address flagsAndra Paraschiv2020-12-141-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vsock flags field can be set in the connect path (user space app) and the (listen) receive path (kernel space logic). When the vsock transport is assigned, the remote CID is used to distinguish between types of connection. Use the vsock flags value (in addition to the CID) from the remote address to decide which vsock transport to assign. For the sibling VMs use case, all the vsock packets need to be forwarded to the host, so always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag is set. For the other use cases, the vsock transport assignment logic is not changed. Changelog v3 -> v4 * Update the "remote_flags" local variable type to reflect the change of the "svm_flags" field to be 1 byte in size. v2 -> v3 * Update bitwise check logic to not compare result to the flag value. v1 -> v2 * Use bitwise operator to check the vsock flag. * Use the updated "VMADDR_FLAG_TO_HOST" flag naming. * Merge the checks for the g2h transport assignment in one "if" block. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive pathAndra Paraschiv2020-12-141-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vsock flags can be set during the connect() setup logic, when initializing the vsock address data structure variable. Then the vsock transport is assigned, also considering this flags field. The vsock transport is also assigned on the (listen) receive path. The flags field needs to be set considering the use case. Set the value of the vsock flags of the remote address to the one targeted for packets forwarding to the host, if the following conditions are met: * The source CID of the packet is higher than VMADDR_CID_HOST. * The destination CID of the packet is higher than VMADDR_CID_HOST. Changelog v3 -> v4 * No changes. v2 -> v3 * No changes. v1 -> v2 * Set the vsock flag on the receive path in the vsock transport assignment logic. * Use bitwise operator for the vsock flag setup. * Use the updated "VMADDR_FLAG_TO_HOST" flag naming. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | vsock_addr: Check for supported flag valuesAndra Paraschiv2020-12-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check if the provided flags value from the vsock address data structure includes the supported flags in the corresponding kernel version. The first byte of the "svm_zero" field is used as "svm_flags", so add the flags check instead. Changelog v3 -> v4 * New patch in v4. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabledTariq Toukan2020-12-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be logically done when HW_CSUM offload is off. Fixes: 2342a8512a1e ("net: Add TLS TX offload features") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | tcp: Add logic to check for SYN w/ data in tcp_simple_retransmitAlexander Duyck2020-12-141-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the case of IPv4. This results in the socket stalling for a second or more as it does not respond to the message by retransmitting the SYN frame. Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that makes use of the entire MSS. In the case of fastopen it does, and an additional complication is that the retransmit queue doesn't contain the original frames. As a result when tcp_simple_retransmit is called and walks the list of frames in the queue it may not mark the frames as lost because both the SYN and the data packet each individually are smaller than the MSS size after the adjustment. This results in the socket being stalled until the retransmit timer kicks in and forces the SYN frame out again without the data attached. In order to resolve this we can reduce the MSS the packets are compared to in tcp_simple_retransmit to -1 for cases where we are still in the TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of the packets related to the fastopen SYN as lost. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed ↵Vasily Averin2020-12-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | packet syzbot reproduces BUG_ON in skb_checksum_help(): tun creates (bogus) skb with huge partial-checksummed area and small ip packet inside. Then ip_rcv trims the skb based on size of internal ip packet, after that csum offset points beyond of trimmed skb. Then checksum_tg() called via netfilter hook triggers BUG_ON: offset = skb_checksum_start_offset(skb); BUG_ON(offset >= skb_headlen(skb)); To work around the problem this patch forces pskb_trim_rcsum_slow() to return -EINVAL in described scenario. It allows its callers to drop such kind of packets. Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0 Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: bridge: Fix a warning when del bridge sysfsWang Hai2020-12-141-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a warining report: br_sysfs_addbr: can't create group bridge4/bridge ------------[ cut here ]------------ sysfs group 'bridge' not found for kobject 'bridge4' WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline] WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270 Modules linked in: iptable_nat ... Call Trace: br_dev_delete+0x112/0x190 net/bridge/br_if.c:384 br_dev_newlink net/bridge/br_netlink.c:1381 [inline] br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362 __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562 netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0x139/0x170 net/socket.c:671 ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353 ___sys_sendmsg+0xf8/0x170 net/socket.c:2407 __sys_sendmsg+0xd3/0x190 net/socket.c:2440 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In br_device_event(), if the bridge sysfs fails to be added, br_device_event() should return error. This can prevent warining when removing bridge sysfs that do not exist. Fixes: bb900b27a2f4 ("bridge: allow creating bridge devices with netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: let MPTCP create max size skbsPaolo Abeni2020-12-141-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the xmit path of the MPTCP protocol creates smaller- than-max-size skbs, which is suboptimal for the performances. There are a few things to improve: - when coalescing to an existing skb, must clear the PUSH flag - tcp_build_frag() expect the available space as an argument. When coalescing is enable MPTCP already subtracted the to-be-coalesced skb len. We must increment said argument accordingly. Before: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.00 24414.86 After: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.05 28357.69 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: pm: simplify select_local_address()Paolo Abeni2020-12-141-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to unconditionally acquire the join list lock, we can simply splice the join list into the subflow list and traverse only the latter. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: parse and act on incoming FASTCLOSE optionFlorian Westphal2020-12-143-0/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | parse the MPTCP FASTCLOSE subtype. If provided key matches the local one, schedule the work queue to close (with tcp reset) all subflows. The MPTCP socket moves to closed state immediately. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | tcp: parse mptcp options contained in reset packetsFlorian Westphal2020-12-142-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because TCP-level resets only affect the subflow, there is a MPTCP option to indicate that the MPTCP-level connection should be closed immediately without a mptcp-level fin exchange. This is the 'MPTCP fast close option'. It can be carried on ack segments or TCP resets. In the latter case, its needed to parse mptcp options also for reset packets so that MPTCP can act accordingly. Next patch will add receive side fastclose support in MPTCP. Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: hold mptcp socket before calling tcp_doneFlorian Westphal2020-12-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When processing options from tcp reset path its possible that tcp_done(ssk) drops the last reference on the mptcp socket which results in use-after-free. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: use MPTCPOPT_HMAC_LEN macroGeliang Tang2020-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct mptcp_options_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: remove address when netlink flushes addrsGeliang Tang2020-12-141-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the PM netlink flushes the addresses, invoke the remove address function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses and the subflows. Since this function should not be invoked under lock, move __flush_addrs out of the pernet->lock. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | mptcp: attach subflow socket to parent cgroupNicolas Rybowski2020-12-141-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been observed that the kernel sockets created for the subflows (except the first one) are not in the same cgroup as their parents. That's because the additional subflows are created by kernel workers. This is a problem with eBPF programs attached to the parent's cgroup won't be executed for the children. But also with any other features of CGroup linked to a sk. This patch fixes this behaviour. As the subflow sockets are created by the kernel, we can't use 'mem_cgroup_sk_alloc' because of the current context being the one of the kworker. This is why we have to do low level memcg manipulation, if required. Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: openvswitch: fix TTL decrement exception action executionEelco Chaudron2020-12-141-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the exception actions are not processed correctly as the wrong dataset is passed. This change fixes this, including the misleading comment. In addition, a check was added to make sure we work on an IPv4 packet, and not just assume if it's not IPv6 it's IPv4. This was all tested using OVS with patch, https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639, applied and sending packets with a TTL of 1 (and 0), both with IPv4 and IPv6. Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski2020-12-1419-114/+465
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap. 2) Use atomic_inc_return() instead of atomic_add_return() in IPVS, from Yejune Deng. 3) Simplify check for overquota in xt_nfacct, from Kaixu Xia. 4) Move nfnl_acct_list away from struct net, from Miao Wang. 5) Pass actual sk in reject actions, from Jan Engelhardt. 6) Add timeout and protoinfo to ctnetlink destroy events, from Florian Westphal. 7) Four patches to generalize set infrastructure to support for multiple expressions per set element. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: netlink support for several set element expressions netfilter: nftables: generalize set extension to support for several expressions netfilter: nftables: move nft_expr before nft_set netfilter: nftables: generalize set expressions support netfilter: ctnetlink: add timeout and protoinfo to destroy events netfilter: use actual socket sk for REJECT action netfilter: nfnl_acct: remove data from struct net netfilter: Remove unnecessary conversion to bool ipvs: replace atomic_add_return() netfilter: nft_reject_bridge: fix build errors due to code movement ==================== Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | netfilter: nftables: netlink support for several set element expressionsPablo Neira Ayuso2020-12-122-6/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds three new netlink attributes to encapsulate a list of expressions per set elements: - NFTA_SET_EXPRESSIONS: this attribute provides the set definition in terms of expressions. New set elements get attached the list of expressions that is specified by this new netlink attribute. - NFTA_SET_ELEM_EXPRESSIONS: this attribute allows users to restore (or initialize) the stateful information of set elements when adding an element to the set. - NFTA_DYNSET_EXPRESSIONS: this attribute specifies the list of expressions that the set element gets when it is inserted from the packet path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nftables: generalize set extension to support for several expressionsPablo Neira Ayuso2020-12-123-47/+168
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new extension allows to attach several expressions to one set element (not only one single expression as NFT_SET_EXPR provides). This patch prepares for support for several expressions per set element in the netlink userspace API. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nftables: generalize set expressions supportPablo Neira Ayuso2020-12-122-27/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the set infrastucture allows for one single expressions per element. This patch extends the existing infrastructure to allow for up to two expressions. This is not updating the netlink API yet, this is coming as an initial preparation patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>