summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: refactor to consolidate TFO passive open codeYuchung Cheng2019-05-011-27/+25
| | | | | | | | | | Use a helper to consolidate two identical code block for passive TFO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: undo cwnd on Fast Open spurious SYNACK retransmitYuchung Cheng2019-05-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | This patch makes passive Fast Open reverts the cwnd to default initial cwnd (10 packets) if the SYNACK timeout is spurious. Passive Fast Open uses a full socket during handshake so it can use the existing undo logic to detect spurious retransmission by recording the first SYNACK timeout in key state variable retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket has sent some data before the timeout, the spurious timeout is detected by tcp_try_undo_recovery() in tcp_process_loss() in tcp_ack(). But if the socket has not send any data yet, tcp_ack() does not execute the undo code since no data is acknowledged. The fix is to check such case explicitly after tcp_ack() during the ACK processing in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state in case the server closes the socket before handshake completes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: lower congestion window on Fast Open SYNACK timeoutYuchung Cheng2019-05-011-0/+3
| | | | | | | | | | | | | | | | | | TCP sender would use congestion window of 1 packet on the second SYN and SYNACK timeout except passive TCP Fast Open. This makes passive TFO too aggressive and unfair during congestion at handshake. This patch fixes this issue so TCP (fast open or not, passive or active) always conforms to the RFC6298. Note that tcp_enter_loss() is called only once during recurring timeouts. This is because during handshake, high_seq and snd_una are the same so tcp_enter_loss() would incorrect set the undo state variables multiple times. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: undo init congestion window on false SYNACK timeoutYuchung Cheng2019-05-012-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYNACK packet is retransmitted 2 or more times. In cellular networks SYNACK timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYNACK timeout. In both cases falsely starting with a minimal cwnd are detrimental to performance. This patch avoids doing so when the final ACK's TCP timestamp indicates the original SYNACK was delivered. It remembers the original SYNACK timestamp when SYNACK timeout has occurred and re-uses the function to detect spurious SYN timeout conveniently. Note that a server may receives multiple SYNs from and immediately retransmits SYNACKs without any SYNACK timeout. This often happens on when the client SYNs have timed out due to wireless delay above. In this case since the server will still use the default initial congestion (e.g. 10) because tp->undo_marker is reset in tcp_init_metrics(). This is an intentional design because packets are not lost but delayed. This patch only covers regular TCP passive open. Fast Open is supported in the next patch. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: better SYNACK sent timestampYuchung Cheng2019-05-012-1/+5
| | | | | | | | | | | | | Detecting spurious SYNACK timeout using timestamp option requires recording the exact SYNACK skb timestamp. Previously the SYNACK sent timestamp was stamped slightly earlier before the skb was transmitted. This patch uses the SYNACK skb transmission timestamp directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: undo initial congestion window on false SYN timeoutYuchung Cheng2019-05-012-1/+17
| | | | | | | | | | | | | | | | | | | | | | | Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYN packet is retransmitted 2 or more times. In cellular networks SYN timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYN timeout. Having a minimal cwnd on both cases are detrimental to TCP startup performance. This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp records the initial SYN timestamp instead of first retransmission, we have to implement a different undo code additionally. The detection also must happen before tcp_ack() as retrans_stamp is reset when SYN is acknowledged. Note this patch covers both active regular and fast open. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid unconditional congestion window undo on SYN retransmitYuchung Cheng2019-05-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously if an active TCP open has SYN timeout, it always undo the cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue would reset tp->retrans_stamp when SYN is acked, which fools then tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is required to properly support undo for spurious SYN timeout. Fixing this is tricky -- for active TCP open tp->retrans_stamp records the time when the handshake starts, not the first retransmission time as the name may suggest. The simplest fix is for tcp_packet_delayed to ensure it is valid before comparing with other timestamp. One side effect of this change is active TCP Fast Open that incurred SYN timeout. Upon receiving a SYN-ACK that only acknowledged the SYN, it would immediately retransmit unacknowledged data in tcp_ack() because the data is marked lost after SYN timeout. But the retransmission would have an incorrect ack sequence number since rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the retransmission needs to properly handed by tcp_rcv_fastopen_synack() like before. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2019-04-3010-642/+84
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2019-04-30 1) A lot of work to remove indirections from the xfrm code. From Florian Westphal. 2) Support ESP offload in combination with gso partial. From Boris Pismenny. 3) Remove some duplicated code from vti4. From Jeremy Sowden. Please note that there is merge conflict between commit: 8742dc86d0c7 ("xfrm4: Fix uninitialized memory read in _decode_session4") from the ipsec tree and commit: c53ac41e3720 ("xfrm: remove decode_session indirection from afinfo_policy") from the ipsec-next tree. The merge conflict will appear when those trees get merged during the merge window. The conflict can be solved as it is done in linux-next: https://lkml.org/lkml/2019/4/25/1207 Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * xfrm: remove unneeded export_symbolsFlorian Westphal2019-04-231-2/+1
| | | | | | | | | | | | | | None of them have any external callers, make them static. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove decode_session indirection from afinfo_policyFlorian Westphal2019-04-231-114/+0
| | | | | | | | | | | | | | | | No external dependencies, might as well handle this directly. xfrm_afinfo_policy is now 40 bytes on x86_64. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove init_path indirection from afinfo_policyFlorian Westphal2019-04-231-7/+0
| | | | | | | | | | | | | | handle this directly, its only used by ipv6. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove tos indirection from afinfo_policyFlorian Westphal2019-04-231-6/+0
| | | | | | | | | | | | | | Only used by ipv4, we can read the fl4 tos value directly instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: store xfrm_mode directly, not its addressFlorian Westphal2019-04-083-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This structure is now only 4 bytes, so its more efficient to cache a copy rather than its address. No significant size difference in allmodconfig vmlinux. With non-modular kernel that has all XFRM options enabled, this series reduces vmlinux image size by ~11kb. All xfrm_mode indirections are gone and all modes are built-in. before (ipsec-next master): text data bss dec filename 21071494 7233140 11104324 39408958 vmlinux.master after this series: 21066448 7226772 11104324 39397544 vmlinux.patched With allmodconfig kernel, the size increase is only 362 bytes, even all the xfrm config options removed in this series are modular. before: text data bss dec filename 15731286 6936912 4046908 26715106 vmlinux.master after this series: 15731492 6937068 4046908 26715468 vmlinux Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: make xfrm modes builtinFlorian Westphal2019-04-086-147/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | after previous changes, xfrm_mode contains no function pointers anymore and all modules defining such struct contain no code except an init/exit functions to register the xfrm_mode struct with the xfrm core. Just place the xfrm modes core and remove the modules, the run-time xfrm_mode register/unregister functionality is removed. Before: text data bss dec filename 7523 200 2364 10087 net/xfrm/xfrm_input.o 40003 628 440 41071 net/xfrm/xfrm_state.o 15730338 6937080 4046908 26714326 vmlinux 7389 200 2364 9953 net/xfrm/xfrm_input.o 40574 656 440 41670 net/xfrm/xfrm_state.o 15730084 6937068 4046908 26714060 vmlinux The xfrm*_mode_{transport,tunnel,beet} modules are gone. v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6 ones rather than removing them. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove afinfo pointer from xfrm_modeFlorian Westphal2019-04-081-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from ipv6 in case of CONFIG_IPV6=m. This change has virtually no effect on vmlinux size, but it reduces afinfo size and allows followup patch to make xfrm modes const. v2: mark if (afinfo) tests as likely (Sabrina) re-fetch afinfo according to inner_mode in xfrm_prepare_input(). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove output2 indirection from xfrm_modeFlorian Westphal2019-04-082-112/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | similar to previous patch: no external module dependencies, so we can avoid the indirection by placing this in the core. This change removes the last indirection from xfrm_mode and the xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore. Before: text data bss dec hex filename 3957 136 0 4093 ffd net/xfrm/xfrm_output.o 587 44 0 631 277 net/ipv4/xfrm4_mode_beet.o 649 32 0 681 2a9 net/ipv4/xfrm4_mode_tunnel.o 625 44 0 669 29d net/ipv6/xfrm6_mode_beet.o 599 32 0 631 277 net/ipv6/xfrm6_mode_tunnel.o After: text data bss dec hex filename 5359 184 0 5543 15a7 net/xfrm/xfrm_output.o 171 24 0 195 c3 net/ipv4/xfrm4_mode_beet.o 171 24 0 195 c3 net/ipv4/xfrm4_mode_tunnel.o 172 24 0 196 c4 net/ipv6/xfrm6_mode_beet.o 172 24 0 196 c4 net/ipv6/xfrm6_mode_tunnel.o v2: fold the *encap_add functions into xfrm*_prepare_output preserve (move) output2 comment (Sabrina) use x->outer_mode->encap, not inner fix a build breakage on ppc (kbuild robot) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove input2 indirection from xfrm_modeFlorian Westphal2019-04-082-86/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No external dependencies on any module, place this in the core. Increase is about 1800 byte for xfrm_input.o. The beet helpers get added to internal header, as they can be reused from xfrm_output.c in the next patch (kernel contains several copies of them in the xfrm{4,6}_mode_beet.c files). Before: text data bss dec filename 5578 176 2364 8118 net/xfrm/xfrm_input.o 1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o 171 40 0 211 net/ipv4/xfrm4_mode_transport.o 1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o 1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o 172 40 0 212 net/ipv6/xfrm6_mode_ro.o 172 40 0 212 net/ipv6/xfrm6_mode_transport.o 1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o After: text data bss dec filename 7373 200 2364 9937 net/xfrm/xfrm_input.o 587 44 0 631 net/ipv4/xfrm4_mode_beet.o 171 32 0 203 net/ipv4/xfrm4_mode_transport.o 649 32 0 681 net/ipv4/xfrm4_mode_tunnel.o 625 44 0 669 net/ipv6/xfrm6_mode_beet.o 172 32 0 204 net/ipv6/xfrm6_mode_ro.o 172 32 0 204 net/ipv6/xfrm6_mode_transport.o 599 32 0 631 net/ipv6/xfrm6_mode_tunnel.o v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix AF_UNSPEC selector breakage (bisected by Benedict Wong) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove gso_segment indirection from xfrm_modeFlorian Westphal2019-04-083-27/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions are small and we only have versions for tunnel and transport mode for ipv4 and ipv6 respectively. Just place the 'transport or tunnel' conditional in the protocol specific function instead of using an indirection. Before: 3226 12 0 3238 net/ipv4/esp4_offload.o 7004 492 0 7496 net/ipv4/ip_vti.o 3339 12 0 3351 net/ipv6/esp6_offload.o 11294 460 0 11754 net/ipv6/ip6_vti.o 1180 72 0 1252 net/ipv4/xfrm4_mode_beet.o 428 48 0 476 net/ipv4/xfrm4_mode_transport.o 1271 48 0 1319 net/ipv4/xfrm4_mode_tunnel.o 1083 60 0 1143 net/ipv6/xfrm6_mode_beet.o 172 48 0 220 net/ipv6/xfrm6_mode_ro.o 429 48 0 477 net/ipv6/xfrm6_mode_transport.o 1164 48 0 1212 net/ipv6/xfrm6_mode_tunnel.o 15730428 6937008 4046908 26714344 vmlinux After: 3461 12 0 3473 net/ipv4/esp4_offload.o 7000 492 0 7492 net/ipv4/ip_vti.o 3574 12 0 3586 net/ipv6/esp6_offload.o 11295 460 0 11755 net/ipv6/ip6_vti.o 1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o 171 40 0 211 net/ipv4/xfrm4_mode_transport.o 1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o 1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o 172 40 0 212 net/ipv6/xfrm6_mode_ro.o 172 40 0 212 net/ipv6/xfrm6_mode_transport.o 1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o 15730424 6937008 4046908 26714340 vmlinux Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove xmit indirection from xfrm_modeFlorian Westphal2019-04-082-27/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are only two versions (tunnel and transport). The ip/ipv6 versions are only differ in sizeof(iphdr) vs ipv6hdr. Place this in the core and use x->outer_mode->encap type to call the correct adjustment helper. Before: text data bss dec filename 15730311 6937008 4046908 26714227 vmlinux After: 15730428 6937008 4046908 26714344 vmlinux (about 117 byte increase) v2: use family from x->outer_mode, not inner Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove output indirection from xfrm_modeFlorian Westphal2019-04-084-39/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same is input indirection. Only exception: we need to export xfrm_outer_mode_output for pktgen. Increases size of vmlinux by about 163 byte: Before: text data bss dec filename 15730208 6936948 4046908 26714064 vmlinux After: 15730311 6937008 4046908 26714227 vmlinux xfrm_inner_extract_output has no more external callers, make it static. v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca) add WARN_ON_ONCE for 'call AF_INET6 related output function, but CONFIG_IPV6=n' case. make xfrm_inner_extract_output static Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: remove input indirection from xfrm_modeFlorian Westphal2019-04-083-25/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need for any indirection or abstraction here, both functions are pretty much the same and quite small, they also have no external dependencies. xfrm_prepare_input can then be made static. With allmodconfig build, size increase of vmlinux is 25 byte: Before: text data bss dec filename 15730207 6936924 4046908 26714039 vmlinux After: 15730208 6936948 4046908 26714064 vmlinux v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca) change copied comment to refer to transport and network header, not skb->{h,nh}, which don't exist anymore. (Sabrina) make xfrm_prepare_input static (Eyal Birger) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: prefer family stored in xfrm_mode structFlorian Westphal2019-04-081-1/+1
| | | | | | | | | | | | | | | | | | Now that we have the family available directly in the xfrm_mode struct, we can use that and avoid one extra dereference. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: place af number into xfrm_mode structFlorian Westphal2019-04-083-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | This will be useful to know if we're supposed to decode ipv4 or ipv6. While at it, make the unregister function return void, all module_exit functions did just BUG(); there is never a point in doing error checks if there is no way to handle such error. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * vti4: eliminated some duplicate code.Jeremy Sowden2019-03-241-38/+22
| | | | | | | | | | | | | | | | | | | | The ipip tunnel introduced in commit dd9ee3444014 ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel") largely duplicated the existing vti_input and vti_recv functions. Refactored to deduplicate the common code. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: gso partial offload supportBoris Pismenny2019-03-241-3/+7
| | | | | | | | | | | | | | | | This patch introduces support for gso partial ESP offload. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Raed Salem <raeds@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | genetlink: optionally validate strictly/dumpsJohannes Berg2019-04-272-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add options to strictly validate messages and dump messages, sometimes perhaps validating dump messages non-strictly may be required, so add an option for that as well. Since none of this can really be applied to existing commands, set the options everwhere using the following spatch: @@ identifier ops; expression X; @@ struct genl_ops ops[] = { ..., { .cmd = X, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, ... }, ... }; For new commands one should just not copy the .validate 'opt-out' flags and thus get strict validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: make validation more configurable for future strictnessJohannes Berg2019-04-275-30/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek2019-04-274-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2019-04-252-10/+27
|\ \ | | | | | | | | | | | | | | | Two easy cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: add sanity checks in ipv4_link_failure()Eric Dumazet2019-04-241-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before calling __ip_options_compile(), we need to ensure the network header is a an IPv4 one, and that it is already pulled in skb->head. RAW sockets going through a tunnel can end up calling ipv4_link_failure() with total garbage in the skb, or arbitrary lengthes. syzbot report : BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204 CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x38/0x50 mm/kasan/common.c:133 memcpy include/linux/string.h:355 [inline] __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204 dst_link_failure include/net/dst.h:427 [inline] vti6_xmit net/ipv6/ip6_vti.c:514 [inline] vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553 __netdev_start_xmit include/linux/netdevice.h:4414 [inline] netdev_start_xmit include/linux/netdevice.h:4423 [inline] xmit_one net/core/dev.c:3292 [inline] dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527 neigh_output include/net/neighbour.h:508 [inline] ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:278 [inline] ip_output+0x21f/0x670 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] raw_send_hdrinc net/ipv4/raw.c:432 [inline] raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:661 sock_write_iter+0x27c/0x3e0 net/socket.c:988 call_write_iter include/linux/fs.h:1866 [inline] new_sync_write+0x4c7/0x760 fs/read_write.c:474 __vfs_write+0xe4/0x110 fs/read_write.c:487 vfs_write+0x20c/0x580 fs/read_write.c:549 ksys_write+0x14f/0x2d0 fs/read_write.c:599 __do_sys_write fs/read_write.c:611 [inline] __se_sys_write fs/read_write.c:608 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:608 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458c29 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29 RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4 R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff The buggy address belongs to the page: page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1fffc0000000000() raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ^ ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: set the tcp_min_rtt_wlen range from 0 to one dayZhangXiaoxu2019-04-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a UBSAN report as below: UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56 signed integer overflow: 2147483647 * 1000 cannot be represented in type 'int' CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1 Call Trace: <IRQ> dump_stack+0x8c/0xba ubsan_epilogue+0x11/0x60 handle_overflow+0x12d/0x170 ? ttwu_do_wakeup+0x21/0x320 __ubsan_handle_mul_overflow+0x12/0x20 tcp_ack_update_rtt+0x76c/0x780 tcp_clean_rtx_queue+0x499/0x14d0 tcp_ack+0x69e/0x1240 ? __wake_up_sync_key+0x2c/0x50 ? update_group_capacity+0x50/0x680 tcp_rcv_established+0x4e2/0xe10 tcp_v4_do_rcv+0x22b/0x420 tcp_v4_rcv+0xfe8/0x1190 ip_protocol_deliver_rcu+0x36/0x180 ip_local_deliver+0x15b/0x1a0 ip_rcv+0xac/0xd0 __netif_receive_skb_one_core+0x7f/0xb0 __netif_receive_skb+0x33/0xc0 netif_receive_skb_internal+0x84/0x1c0 napi_gro_receive+0x2a0/0x300 receive_buf+0x3d4/0x2350 ? detach_buf_split+0x159/0x390 virtnet_poll+0x198/0x840 ? reweight_entity+0x243/0x4b0 net_rx_action+0x25c/0x770 __do_softirq+0x19b/0x66d irq_exit+0x1eb/0x230 do_IRQ+0x7a/0x150 common_interrupt+0xf/0xf </IRQ> It can be reproduced by: echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Change nhc_flags to unsigned charDavid Ahern2019-04-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh). All of the RTNH_F_ flags fit in an unsigned char, and since the API to userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the size of the struct by 8, from 56 to 48 bytes. Update the flags arguments for up netdevice events and fib_nexthop_info which determines the RTNH_F flags to return on a dump/event. The RTNH_F flags are passed in the lower byte of rtm_flags which is an unsigned int so use a temp variable for the flags to fib_nexthop_info and combine with rtm_flags in the caller. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | lwtunnel: Pass encap and encap type attributes to lwtunnel_fill_encapDavid Ahern2019-04-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, lwtunnel_fill_encap hardcodes the encap and encap type attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop objects want to re-use this code but the encap attributes passed to userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only difference, change lwtunnel_fill_encap to take the attribute type as an input. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: bpfilter: dont use module_init in non-modular codePaul Gortmaker2019-04-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kconfig controlling this code is: bpfilter/Kconfig:menuconfig BPFILTER bpfilter/Kconfig: bool "BPF based packet filtering framework (BPFILTER)" Since it isn't a module, we shouldn't use module_init(). Instead we use device_initcall() - which is exactly what module_init() defaults to for non-modular code/builds. We don't remove <linux/module.h> from the includes since this file does a request_module() and hence is a valid user of that header file, even though it is not modular itself. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Rename net/nexthop.h net/rtnh.hDavid Ahern2019-04-222-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The header contains rtnh_ macros so rename the file accordingly. Allows a later patch to use the nexthop.h name for the new nexthop code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: properly reset skb->truesize for tx recyclingEric Dumazet2019-04-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp sendmsg() and sendpage() normally advance skb->data_len and skb->truesize by the payload added to an skb. But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages, even if a single byte of payload is used in the page. This means that we can not assume skb->truesize can be adjusted by skb->data_len. We must instead overwrite its value. Otherwise skb->truesize is too big and can hit socket sndbuf limit, especially if the skb is recycled multiple times :/ Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: rework SIOCGSTAMP ioctl handlingArnd Bergmann2019-04-191-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many socket protocol handlers, and all of those end up calling the same sock_get_timestamp()/sock_get_timestampns() helper functions, which results in a lot of duplicate code. With the introduction of 64-bit time_t on 32-bit architectures, this gets worse, as we then need four different ioctl commands in each socket protocol implementation. To simplify that, let's add a new .gettstamp() operation in struct proto_ops, and move ioctl implementation into the common sock_ioctl()/compat_sock_ioctl_trans() functions that these all go through. We can reuse the sock_get_timestamp() implementation, but generalize it so it can deal with both native and compat mode, as well as timeval and timespec structures. Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2019-04-174-35/+40
|\| | | | | | | | | | | | | | | | | Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: tcp_grow_window() needs to respect tcp_space()Eric Dumazet2019-04-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: ensure rcu_read_lock() in ipv4_link_failure()Eric Dumazet2019-04-141-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fib_compute_spec_dst() needs to be called under rcu protection. syzbot reported : WARNING: suspicious RCU usage 5.1.0-rc4+ #165 Not tainted include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by swapper/0/0: #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline] #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline] fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294 spec_dst_fill net/ipv4/ip_options.c:245 [inline] __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: recompile ip options in ipv4_link_failureStephen Suryaputra2019-04-121-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recompile IP options since IPCB may not be valid anymore when ipv4_link_failure is called from arp_error_report. Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") and the commit before that (9ef6b42ad6fd) for a similar issue. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | dctcp: more accurate tracking of packets deliveryEric Dumazet2019-04-111-28/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE marks") core TCP stack does a very good job tracking ECN signals. The "sender's best estimate of CE information" Yuchung mentioned in his patch is indeed the best we can do. DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic, and use the existing best estimate. This solves some problems, since current DCTCP logic does not deal with losses and/or GRO or ack aggregation very well. This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss (this should have been tp->mss_cache), and a 64 bit divide. Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for every ACK could be done differently, calling it only once per RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Abdul Kabbani <akabbani@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recvLorenzo Bianconi2019-04-101-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gue tunnels run iptunnel_pull_offloads on received skbs. This can determine a possible use-after-free accessing guehdr pointer since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device) Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2019-04-154-209/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | netfilter: x_tables: merge ip and ipv6 masquerade modulesFlorian Westphal2019-04-113-111/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need to have separate modules for this. before: text data bss dec filename 2038 1168 0 3206 net/ipv4/netfilter/ipt_MASQUERADE.ko 1526 1024 0 2550 net/ipv6/netfilter/ip6t_MASQUERADE.ko after: text data bss dec filename 2521 1296 0 3817 net/netfilter/xt_MASQUERADE.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_nat: merge ip/ip6 masquerade headersFlorian Westphal2019-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both are now implemented by nf_nat_masquerade.c, so no need to keep different headers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_tables: merge route type into coreFlorian Westphal2019-04-083-98/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | very little code, so it really doesn't make sense to have extra modules or even a kconfig knob for this. Merge them and make functionality available unconditionally. The merge makes inet family route support trivial, so add it as well here. Before: text data bss dec hex filename 835 832 0 1667 683 nft_chain_route_ipv4.ko 870 832 0 1702 6a6 nft_chain_route_ipv6.ko 111568 2556 529 114653 1bfdd nf_tables.ko After: text data bss dec hex filename 113133 2556 529 116218 1c5fa nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | net: fou: remove redundant code in gue_udp_recvLorenzo Bianconi2019-04-111-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove not useful protocol version check in gue_udp_recv since just gue version 0 can hit that code. Moreover remove duplicated hdrlen computation Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | fou: correct spelling of encapsulationSimon Horman2019-04-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Correct spelling of encapsulation. Found by inspection. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: Handle RTA_GATEWAY set to 0David Ahern2019-04-102-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Govindarajulu reported a regression with Network Manager which sends an RTA_GATEWAY attribute with the address set to 0. Fixup the handling of RTA_GATEWAY to only set fc_gw_family if the gateway address is actually set. Fixes: f35b794b3b405 ("ipv4: Prepare fib_config for IPv6 gateway") Reported-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>