From 5b441ac8784c1e7f3c619f14da4c3f52e87348d5 Mon Sep 17 00:00:00 2001 From: Robert Shearman Date: Fri, 10 Mar 2017 20:43:24 +0000 Subject: mpls: allow TTL propagation to IP packets to be configured Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman Acked-by: David Ahern Tested-by: David Ahern Signed-off-by: David S. Miller --- Documentation/networking/mpls-sysctl.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 15d8d16934fd..9badd1d6685f 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,17 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled / RFC 3443 [Short] Pipe Model + 1 - enabled / RFC 3443 Uniform Model (default) + conf//input - BOOL Control whether packets can be input on this interface. -- cgit v1.2.3 From a59166e470868d92f0813977817e99e699398af5 Mon Sep 17 00:00:00 2001 From: Robert Shearman Date: Fri, 10 Mar 2017 20:43:25 +0000 Subject: mpls: allow TTL propagation from IP packets to be configured Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman Acked-by: David Ahern Tested-by: David Ahern Signed-off-by: David S. Miller --- Documentation/networking/mpls-sysctl.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 9badd1d6685f..2f24a1912a48 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -30,6 +30,14 @@ ip_ttl_propagate - BOOL 0 - disabled / RFC 3443 [Short] Pipe Model 1 - enabled / RFC 3443 Uniform Model (default) +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + conf//input - BOOL Control whether packets can be input on this interface. -- cgit v1.2.3 From a2f346d82bcab1927a199f475dc299c00413ec39 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 20 Feb 2017 16:31:35 +0800 Subject: ipvs: fix sync_threshold description and add sync_refresh_period, sync_retries Fix sync_threshold description which should have two values. Also add sync_refresh_period and sync_retries based on commit 749c42b620a9 ("ipvs: reduce sync rate with time thresholds"). Signed-off-by: Hangbin Liu Signed-off-by: Simon Horman --- Documentation/networking/ipvs-sysctl.txt | 40 +++++++++++++++++++++++++------- 1 file changed, 31 insertions(+), 9 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index e6b1c025fdd8..7acaaa65451e 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -185,15 +185,37 @@ secure_tcp - INTEGER The value definition is the same as that of drop_entry and drop_packet. -sync_threshold - INTEGER - default 3 - - It sets synchronization threshold, which is the minimum number - of incoming packets that a connection needs to receive before - the connection will be synchronized. A connection will be - synchronized, every time the number of its incoming packets - modulus 50 equals the threshold. The range of the threshold is - from 0 to 49. +sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period + default 3 50 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus sync_period equals the threshold. The range of the + threshold is from 0 to sync_period. + + When sync_period and sync_refresh_period are 0, send sync only + for state changes or only once when pkts matches sync_threshold + +sync_refresh_period - UNSIGNED INTEGER + default 0 + + In seconds, difference in reported connection timer that triggers + new sync message. It can be used to avoid sync messages for the + specified period (or half of the connection timeout if it is lower) + if connection state is not changed since last sync. + + This is useful for normal connections with high traffic to reduce + sync rate. Additionally, retry sync_retries times with period of + sync_refresh_period/8. + +sync_retries - INTEGER + default 0 + + Defines sync retries with period of sync_refresh_period/8. Useful + to protect against loss of sync messages. The range of the + sync_retries is from 0 to 3. snat_reroute - BOOLEAN 0 - disabled -- cgit v1.2.3 From 237e5722bc615c3440d8476a694448074380fa01 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 20 Feb 2017 16:31:36 +0800 Subject: ipvs: Document sysctl sync_qlen_max and sync_sock_size Document sysctl sync_qlen_max and sync_sock_size based on commit 1c003b1580e2 ("ipvs: wakeup master thread"). Signed-off-by: Hangbin Liu Signed-off-by: Simon Horman --- Documentation/networking/ipvs-sysctl.txt | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index 7acaaa65451e..159d70b6dff3 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -217,6 +217,20 @@ sync_retries - INTEGER to protect against loss of sync messages. The range of the sync_retries is from 0 to 3. +sync_qlen_max - UNSIGNED LONG + + Hard limit for queued sync messages that are not sent yet. It + defaults to 1/32 of the memory pages but actually represents + number of messages. It will protect us from allocating large + parts of memory when the sending rate is lower than the queuing + rate. + +sync_sock_size - INTEGER + default 0 + + Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. + Default value is 0 (preserve system defaults). + snat_reroute - BOOLEAN 0 - disabled not 0 - enabled (default) -- cgit v1.2.3 From 24b444155b1e237accb2571ac656cf40c941a2a7 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 20 Feb 2017 16:31:37 +0800 Subject: ipvs: Document sysctl sync_ports Document sysctl sync_ports based on commit f73181c8288f ("ipvs: add support for sync threads"). Signed-off-by: Hangbin Liu Signed-off-by: Simon Horman --- Documentation/networking/ipvs-sysctl.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index 159d70b6dff3..a6feecd467cd 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -231,6 +231,14 @@ sync_sock_size - INTEGER Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. Default value is 0 (preserve system defaults). +sync_ports - INTEGER + default 1 + + The number of threads that master and backup servers can use for + sync traffic. Every thread will use single UDP port, thread 0 will + use the default port 8848 while last thread will use port + 8848+sync_ports-1. + snat_reroute - BOOLEAN 0 - disabled not 0 - enabled (default) -- cgit v1.2.3 From 3c679cba588a46ba81a264673e192bbd3c92455b Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 20 Feb 2017 16:31:38 +0800 Subject: ipvs: Document sysctl pmtu_disc Document sysctl pmtu_disc based on commit 3654e61137db ("ipvs: add pmtu_disc option to disable IP DF for TUN packets"). Signed-off-by: Hangbin Liu Signed-off-by: Simon Horman --- Documentation/networking/ipvs-sysctl.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index a6feecd467cd..056898685d40 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -175,6 +175,14 @@ nat_icmp_send - BOOLEAN for VS/NAT when the load balancer receives packets from real servers but the connection entries don't exist. +pmtu_disc - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + By default, reject with FRAG_NEEDED all DF packets that exceed + the PMTU, irrespective of the forwarding method. For TUN method + the flag can be disabled to fragment such packets. + secure_tcp - INTEGER 0 - disabled (default) -- cgit v1.2.3 From 4396e46187ca5070219b81773c4e65088dac50cc Mon Sep 17 00:00:00 2001 From: Soheil Hassas Yeganeh Date: Wed, 15 Mar 2017 16:30:46 -0400 Subject: tcp: remove tcp_tw_recycle The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Eric Dumazet Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Cc: Lutz Vieweg Cc: Florian Westphal Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 5 ----- 1 file changed, 5 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ab0230461377..ed3d0791eb27 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -640,11 +640,6 @@ tcp_tso_win_divisor - INTEGER building larger TSO frames. Default: 3 -tcp_tw_recycle - BOOLEAN - Enable fast recycling TIME-WAIT sockets. Default value is 0. - It should not be changed without advice/request of technical - experts. - tcp_tw_reuse - BOOLEAN Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. -- cgit v1.2.3 From bf4e0a3db97eb882368fd82980b3b1fa0b5b9778 Mon Sep 17 00:00:00 2001 From: Nikolay Aleksandrov Date: Thu, 16 Mar 2017 15:28:00 +0200 Subject: net: ipv4: add support for ECMP hash policy choice This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ed3d0791eb27..b57308e76b1d 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -73,6 +73,14 @@ fib_multipath_use_neigh - BOOLEAN 0 - disabled 1 - enabled +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. Only valid + for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. + Default: 0 (Layer 3) + Possible values: + 0 - Layer 3 + 1 - Layer 4 + route/max_size - INTEGER Maximum number of routes allowed in the kernel. Increase this when using large numbers of interfaces and/or routes. -- cgit v1.2.3 From bbea124bc99df968011e76eba105fe964a4eceab Mon Sep 17 00:00:00 2001 From: Joel Scherpelz Date: Wed, 22 Mar 2017 18:19:04 +0900 Subject: net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs. This commit adds a new sysctl accept_ra_rt_info_min_plen that defines the minimum acceptable prefix length of Route Information Options. The new sysctl is intended to be used together with accept_ra_rt_info_max_plen to configure a range of acceptable prefix lengths. It is useful to prevent misconfigurations from unintentionally blackholing too much of the IPv6 address space (e.g., home routers announcing RIOs for fc00::/7, which is incorrect). Signed-off-by: Joel Scherpelz Acked-by: Lorenzo Colitti Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b57308e76b1d..eaee2c8d4c00 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1461,11 +1461,20 @@ accept_ra_pinfo - BOOLEAN Functional default: enabled if accept_ra is enabled. disabled if accept_ra is disabled. +accept_ra_rt_info_min_plen - INTEGER + Minimum prefix length of Route Information in RA. + + Route Information w/ prefix smaller than this variable shall + be ignored. + + Functional default: 0 if accept_ra_rtr_pref is enabled. + -1 if accept_ra_rtr_pref is disabled. + accept_ra_rt_info_max_plen - INTEGER Maximum prefix length of Route Information in RA. - Route Information w/ prefix larger than or equal to this - variable shall be ignored. + Route Information w/ prefix larger than this variable shall + be ignored. Functional default: 0 if accept_ra_rtr_pref is enabled. -1 if accept_ra_rtr_pref is disabled. -- cgit v1.2.3 From 55877012d5588ce7427919d6b869922f1a5f60bc Mon Sep 17 00:00:00 2001 From: Jacob Keller Date: Mon, 6 Feb 2017 14:38:52 -0800 Subject: i40e: document drivers use of ntuple filters Add documentation describing the drivers use of ethtool ntuple filters, including the limitations that it has due to hardware, as well as how it reads and parses the user-def data block. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- Documentation/networking/i40e.txt | 72 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt index a251bf4fe9c9..57e616ed10b0 100644 --- a/Documentation/networking/i40e.txt +++ b/Documentation/networking/i40e.txt @@ -63,6 +63,78 @@ Additional Configurations The latest release of ethtool can be found from https://www.kernel.org/pub/software/network/ethtool + + Flow Director n-ntuple traffic filters (FDir) + --------------------------------------------- + The driver utilizes the ethtool interface for configuring ntuple filters, + via "ethtool -N ". + + The sctp4, ip4, udp4, and tcp4 flow types are supported with the standard + fields including src-ip, dst-ip, src-port and dst-port. The driver only + supports fully enabling or fully masking the fields, so use of the mask + fields for partial matches is not supported. + + Additionally, the driver supports using the action to specify filters for a + Virtual Function. You can specify the action as a 64bit value, where the + lower 32 bits represents the queue number, while the next 8 bits represent + which VF. Note that 0 is the PF, so the VF identifier is offset by 1. For + example: + + ... action 0x800000002 ... + + Would indicate to direct traffic for Virtual Function 7 (8 minus 1) on queue + 2 of that VF. + + The driver also supports using the user-defined field to specify 2 bytes of + arbitrary data to match within the packet payload in addition to the regular + fields. The data is specified in the lower 32bits of the user-def field in + the following way: + + +----------------------------+---------------------------+ + | 31 28 24 20 16 | 15 12 8 4 0| + +----------------------------+---------------------------+ + | offset into packet payload | 2 bytes of flexible data | + +----------------------------+---------------------------+ + + As an example, + + ... user-def 0x4FFFF .... + + means to match the value 0xFFFF 4 bytes into the packet payload. Note that + the offset is based on the beginning of the payload, and not the beginning + of the packet. Thus + + flow-type tcp4 ... user-def 0x8BEAF .... + + would match TCP/IPv4 packets which have the value 0xBEAF 8bytes into the + TCP/IPv4 payload. + + For ICMP, the hardware parses the ICMP header as 4 bytes of header and 4 + bytes of payload, so if you want to match an ICMP frames payload you may need + to add 4 to the offset in order to match the data. + + Furthermore, the offset can only be up to a value of 64, as the hardware + will only read up to 64 bytes of data from the payload. It must also be even + as the flexible data is 2 bytes long and must be aligned to byte 0 of the + packet payload. + + When programming filters, the hardware is limited to using a single input + set for each flow type. This means that it is an error to program two + different filters with the same type that don't match on the same fields. + Thus the second of the following two commands will fail: + + ethtool -N flow-type tcp4 src-ip 192.168.0.7 action 5 + ethtool -N flow-type tcp4 dst-ip 192.168.15.18 action 1 + + This is because the first filter will be accepted and reprogram the input + set for TCPv4 filters, but the second filter will be unable to reprogram the + input set until all the conflicting TCPv4 filters are first removed. + + Note that the user-defined flexible offset is also considered part of the + input set and cannot be programmed separately for multiple filters of the + same type. However, the flexible data is not part of the input set and + multiple filters may use the same offset but match against different data. + Data Center Bridging (DCB) -------------------------- DCB configuration is not currently supported. -- cgit v1.2.3 From dddb64bcb34615bf48a2c9cb9881eb76795cc5c5 Mon Sep 17 00:00:00 2001 From: "subashab@codeaurora.org" Date: Thu, 23 Mar 2017 13:34:16 -0600 Subject: net: Add sysctl to toggle early demux for tcp and udp Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: Subash Abhinov Kasiviswanathan Suggested-by: Eric Dumazet Cc: Stephen Hemminger Cc: Tom Herbert Cc: David Miller Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index eaee2c8d4c00..b1c6500e7a8d 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -856,12 +856,21 @@ ip_dynaddr - BOOLEAN ip_early_demux - BOOLEAN Optimize input packet processing down to one demux for certain kinds of local sockets. Currently we only do this - for established TCP sockets. + for established TCP and connected UDP sockets. It may add an additional cost for pure routing workloads that reduces overall throughput, in such case you should disable it. Default: 1 +tcp_early_demux - BOOLEAN + Enable early demux for established TCP sockets. + Default: 1 + +udp_early_demux - BOOLEAN + Enable early demux for connected UDP sockets. Disable this if + your system could experience more unconnected load. + Default: 1 + icmp_echo_ignore_all - BOOLEAN If set non-zero, then the kernel will ignore all ICMP ECHO requests sent to it. -- cgit v1.2.3 From e2989ee9746b3f2e78d1a39bbc402d884e8b8bf1 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Sun, 23 Apr 2017 09:01:00 -0700 Subject: bpf, doc: update list of architectures that do eBPF JIT update the list and remove 'in the future' statement, since all still alive 64-bit architectures now do eBPF JIT. Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Signed-off-by: David S. Miller --- Documentation/networking/filter.txt | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 683ada5ad81d..b69b205501de 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -595,10 +595,9 @@ got from bpf_prog_create(), and 'ctx' the given context (e.g. skb pointer). All constraints and restrictions from bpf_check_classic() apply before a conversion to the new layout is being done behind the scenes! -Currently, the classic BPF format is being used for JITing on most of the -architectures. x86-64, aarch64 and s390x perform JIT compilation from eBPF -instruction set, however, future work will migrate other JIT compilers as well, -so that they will profit from the very same benefits. +Currently, the classic BPF format is being used for JITing on most 32-bit +architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT +compilation from eBPF instruction set. Some core changes of the new internal format: -- cgit v1.2.3 From cf1ef3f0719b4dcb74810ed507e2a2540f9811b4 Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Thu, 20 Apr 2017 14:45:46 -0700 Subject: net/tcp_fastopen: Disable active side TFO in certain scenarios Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: Wei Wang Acked-by: Yuchung Cheng Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b1c6500e7a8d..974ab47ae53a 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -602,6 +602,14 @@ tcp_fastopen - INTEGER Note that that additional client or server features are only effective if the basic support (0x1 and 0x2) are enabled respectively. +tcp_fastopen_blackhole_timeout_sec - INTEGER + Initial time period in second to disable Fastopen on active TCP sockets + when a TFO firewall blackhole issue happens. + This time period will grow exponentially when more blackhole issues + get detected right after Fastopen is re-enabled and will reset to + initial value when the blackhole issue goes away. + By default, it is set to 1hr. + tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value -- cgit v1.2.3 From d5066c467ee3b8eb8716e776584b3953a0bb218a Mon Sep 17 00:00:00 2001 From: Liam Beguin Date: Mon, 1 May 2017 11:02:01 -0400 Subject: switchdev: documentation: fix whitespace issues Figure 1 is full of whitespaces; fix it Signed-off-by: Liam Beguin Signed-off-by: Sylvain Lemieux Acked-by: Ivan Vecera Signed-off-by: David S. Miller --- Documentation/networking/switchdev.txt | 70 +++++++++++++++++----------------- 1 file changed, 35 insertions(+), 35 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt index 2bbac05ab9e2..3e7b946dea27 100644 --- a/Documentation/networking/switchdev.txt +++ b/Documentation/networking/switchdev.txt @@ -13,43 +13,43 @@ an example setup using a data-center-class switch ASIC chip. Other setups with SR-IOV or soft switches, such as OVS, are possible. -                             User-space tools                                  -                                                                               -       user space                   |                                          -      +-------------------------------------------------------------------+    -       kernel                       | Netlink                                  -                                    |                                          -                     +--------------+-------------------------------+          -                     |         Network stack                        |          -                     |           (Linux)                            |          -                     |                                              |          -                     +----------------------------------------------+          -                                                                               +                             User-space tools + +       user space                   | +      +-------------------------------------------------------------------+ +       kernel                       | Netlink +                                    | +                     +--------------+-------------------------------+ +                     |         Network stack                        | +                     |           (Linux)                            | +                     |                                              | +                     +----------------------------------------------+ + sw1p2 sw1p4 sw1p6 -                      sw1p1  + sw1p3 +  sw1p5 +         eth1              -                        +    |    +    |    +    |            +                -                        |    |    |    |    |    |            |                -                     +--+----+----+----+-+--+----+---+  +-----+-----+          -                     |         Switch driver         |  |    mgmt   |          -                     |        (this document)        |  |   driver  |          -                     |                               |  |           |          -                     +--------------+----------------+  +-----------+          -                                    |                                          -       kernel                       | HW bus (eg PCI)                          -      +-------------------------------------------------------------------+    -       hardware                     |                                          -                     +--------------+---+------------+                         -                     |         Switch device (sw1)   |                         -                     |  +----+                       +--------+                -                     |  |    v offloaded data path   | mgmt port               -                     |  |    |                       |                         -                     +--|----|----+----+----+----+---+                         -                        |    |    |    |    |    |                             -                        +    +    +    +    +    +                             +                      sw1p1  + sw1p3 +  sw1p5 +         eth1 +                        +    |    +    |    +    |            + +                        |    |    |    |    |    |            | +                     +--+----+----+----+-+--+----+---+  +-----+-----+ +                     |         Switch driver         |  |    mgmt   | +                     |        (this document)        |  |   driver  | +                     |                               |  |           | +                     +--------------+----------------+  +-----------+ +                                    | +       kernel                       | HW bus (eg PCI) +      +-------------------------------------------------------------------+ +       hardware                     | +                     +--------------+---+------------+ +                     |         Switch device (sw1)   | +                     |  +----+                       +--------+ +                     |  |    v offloaded data path   | mgmt port +                     |  |    |                       | +                     +--|----|----+----+----+----+---+ +                        |    |    |    |    |    | +                        +    +    +    +    +    +                        p1   p2   p3   p4   p5   p6 -                                        -                             front-panel ports                                 -                                                                               + +                             front-panel ports + Fig 1. -- cgit v1.2.3