summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
...
* | ipv4: Namespaceify tcp_orphan_retries sysctl knobNikolay Borisov2016-02-073-9/+9
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp_retries2 sysctl knobNikolay Borisov2016-02-074-11/+12
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp_retries1 sysctl knobNikolay Borisov2016-02-073-12/+13
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp reordering sysctl knobNikolay Borisov2016-02-075-16/+17
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp syncookies sysctl knobNikolay Borisov2016-02-075-20/+18
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp synack retries sysctl knobNikolay Borisov2016-02-074-14/+11
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp syn retries sysctl knobNikolay Borisov2016-02-074-12/+15
| | | | | | | | | | Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tcp_cong_control helperYuchung Cheng2016-02-071-12/+19
| | | | | | | | | | | | | | | | | | | | Refactor and consolidate cwnd and rate updates into a new function tcp_cong_control(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: make congestion control more robust against reorderingYuchung Cheng2016-02-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This change enables congestion control to update cwnd based on not only packet cumulatively acked but also packets delivered out-of-order. This makes congestion control robust against packet reordering because it may raise cwnd as long as packets are being delivered once reordering has been detected (i.e., it only cares the amount of packets delivered, not the ordering among them). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: refactor pkts acked accountingYuchung Cheng2016-02-071-4/+3
| | | | | | | | | | | | | | | | | | | | A small refactoring that gets number of packets cumulatively acked from tcp_clean_rtx_queue() directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: new delivery accountingYuchung Cheng2016-02-071-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move cwnd reduction after recovery state procesingYuchung Cheng2016-02-071-32/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the cwnd is reduced and increased in various different places. The reduction happens in various places in the recovery state processing (tcp_fastretrans_alert) while the increase happens afterward. A better sequence is to identify lost packets and update the congestion control state (icsk_ca_state) first. Then base on the new state, up/down the cwnd in one central place. It's more clear to reason cwnd changes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: retransmit after recovery processing and congestion controlYuchung Cheng2016-02-071-12/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | The retransmission and F-RTO transmission currently happen inside recovery state processing (tcp_fastretrans_alert) but before congestion control. This refactoring moves the logic after both s.t. we can determine how much to send (cwnd) before deciding what to send. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: fastopen: call tcp_fin() if FIN present in SYNACKEric Dumazet2016-02-062-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we acknowledge a FIN, it is not enough to ack the sequence number and queue the skb into receive queue. We also have to call tcp_fin() to properly update socket state and send proper poll() notifications. It seems we also had the problem if we received a SYN packet with the FIN flag set, but it does not seem an urgent issue, as no known implementation can do that. Fixes: 61d2bcae99f6 ("tcp: fastopen: accept data/FIN present in SYNACK message") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: do not enqueue skb with SYN flagEric Dumazet2016-02-062-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we remove the SYN flag from the skbs that tcp_fastopen_add_skb() places in socket receive queue, then we can remove the test that tcp_recvmsg() has to perform in fast path. All we have to do is to adjust SEQ in the slow path. For the moment, we place an unlikely() and output a message if we find an skb having SYN flag set. Goal would be to get rid of the test completely. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: fastopen: accept data/FIN present in SYNACK messageEric Dumazet2016-02-062-30/+37
|/ | | | | | | | | | | | | | | | | | | | | | | | | RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message MAY include data and/or FIN This patch adds support for the client side : If we receive a SYNACK with payload or FIN, queue the skb instead of ignoring it. Since we already support the same for SYN, we refactor the existing code and reuse it. Note we need to clone the skb, so this operation might fail under memory pressure. Sara Dickinson pointed out FreeBSD server Fast Open implementation was planned to generate such SYNACK in the future. The server side might be implemented on linux later. Reported-by: Sara Dickinson <sara@sinodun.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-02-0111-40/+70
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "This looks like a lot but it's a mixture of regression fixes as well as fixes for longer standing issues. 1) Fix on-channel cancellation in mac80211, from Johannes Berg. 2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables module, from Eric Dumazet. 3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric Dumazet. 4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is bound, from Craig Gallek. 5) GRO key comparisons don't take lightweight tunnels into account, from Jesse Gross. 6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric Dumazet. 7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we register them, otherwise the NEWLINK netlink message is missing the proper attributes. From Thadeu Lima de Souza Cascardo. 8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido Schimmel 9) Handle fragments properly in ipv4 easly socket demux, from Eric Dumazet. 10) Don't ignore the ifindex key specifier on ipv6 output route lookups, from Paolo Abeni" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits) tcp: avoid cwnd undo after receiving ECN irda: fix a potential use-after-free in ircomm_param_request net: tg3: avoid uninitialized variable warning net: nb8800: avoid uninitialized variable warning net: vxge: avoid unused function warnings net: bgmac: clarify CONFIG_BCMA dependency net: hp100: remove unnecessary #ifdefs net: davinci_cpdma: use dma_addr_t for DMA address ipv6/udp: use sticky pktinfo egress ifindex on connect() ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() netlink: not trim skb for mmaped socket when dump vxlan: fix a out of bounds access in __vxlan_find_mac net: dsa: mv88e6xxx: fix port VLAN maps fib_trie: Fix shift by 32 in fib_table_lookup net: moxart: use correct accessors for DMA memory ipv4: ipconfig: avoid unused ic_proto_used symbol bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout. bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter. bnxt_en: Ring free response from close path should use completion ring net_sched: drr: check for NULL pointer in drr_dequeue ...
| * tcp: avoid cwnd undo after receiving ECNYuchung Cheng2016-01-291-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | RFC 4015 section 3.4 says the TCP sender MUST refrain from reversing the congestion control state when the ACK signals congestion through the ECN-Echo flag. Currently we may not always do that when prior_ssthresh is reset upon receiving ACKs with ECE marks. This patch fixes that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Fix shift by 32 in fib_table_lookupAlexander Duyck2016-01-291-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | The fib_table_lookup function had a shift by 32 that triggered a UBSAN warning. This was due to the fact that I had placed the shift first and then followed it with the check for the suffix length to ignore the undefined behavior. If we reorder this so that we verify the suffix is less than 32 before shifting the value we can avoid the issue. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: ipconfig: avoid unused ic_proto_used symbolArnd Bergmann2016-01-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_PROC_FS, CONFIG_IP_PNP_BOOTP, CONFIG_IP_PNP_DHCP and CONFIG_IP_PNP_RARP are all disabled, we get a warning about the ic_proto_used variable being unused: net/ipv4/ipconfig.c:146:12: error: 'ic_proto_used' defined but not used [-Werror=unused-variable] This avoids the warning, by making the definition conditional on whether a dynamic IP configuration protocol is configured. If not, we know that the value is always zero, so we can optimize away the variable and all code that depends on it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: early demux should be aware of fragmentsEric Dumazet2016-01-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | We should not assume a valid protocol header is present, as this is not the case for IPv4 fragments. Lets avoid extra cache line misses and potential bugs if we actually find a socket and incorrectly uses its dst. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: beware of alignments in tcp_get_info()Eric Dumazet2016-01-281-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With some combinations of user provided flags in netlink command, it is possible to call tcp_get_info() with a buffer that is not 8-bytes aligned. It does matter on some arches, so we need to use put_unaligned() to store the u64 fields. Current iproute2 package does not trigger this particular issue. Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix tcp_mark_head_lost to check skb len before fragmentingNeal Cardwell2016-01-281-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes a corner case in tcp_mark_head_lost() which was causing the WARN_ON(len > skb->len) in tcp_fragment() to fire. tcp_mark_head_lost() was assuming that if a packet has tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of M*mss bytes, for any M < N. But with the tricky way TCP pcounts are maintained, this is not always true. For example, suppose the sender sends 4 1-byte packets and have the last 3 packet sacked. It will merge the last 3 packets in the write queue into an skb with pcount = 3 and len = 3 bytes. If another recovery happens after a sack reneging event, tcp_mark_head_lost() may attempt to split the skb assuming it has more than 2*MSS bytes. This sounds very counterintuitive, but as the commit description for the related commit c0638c247f55 ("tcp: don't fragment SACKed skbs in tcp_mark_head_lost()") notes, this is because tcp_shifted_skb() coalesces adjacent regions of SACKed skbs, and when doing this it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. The c0638c247f55 commit tried to avoid problems by not fragmenting SACKed skbs, since SACKed skbs are where the non-proportionality between pcount and skb->len/mss is known to be possible. However, that commit did not handle the case where during a reneging event one of these weird SACKed skbs becomes an un-SACKed skb, which tcp_mark_head_lost() can then try to fragment. The fix is to simply mark the entire skb lost when this happens. This makes the recovery slightly more aggressive in such corner cases before we detect reordering. But once we detect reordering this code path is by-passed because FACK is disabled. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet: frag: Always orphan skbs inside ip_defrag()Joe Stringer2016-01-282-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Later parts of the stack (including fragmentation) expect that there is never a socket attached to frag in a frag_list, however this invariant was not enforced on all defrag paths. This could lead to the BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the end of this commit message. While the call could be added to openvswitch to fix this particular error, the head and tail of the frags list are already orphaned indirectly inside ip_defrag(), so it seems like the remaining fragments should all be orphaned in all circumstances. kernel BUG at net/ipv4/ip_output.c:586! [...] Call Trace: <IRQ> [<ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch] [<ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch] [<ffffffff81667830>] ? dst_discard_out+0x20/0x20 [<ffffffff81667810>] ? dst_ifdown+0x80/0x80 [<ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch] [<ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch] [<ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch] [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch] [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch] [<ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch] [<ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch] [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch] [<ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch] [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch] [<ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0 [<ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0 [<ffffffff81660f20>] __dev_queue_xmit+0x800/0x930 [<ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90 [<ffffffff81669876>] ? neigh_resolve_output+0x106/0x220 [<ffffffff81661060>] dev_queue_xmit+0x10/0x20 [<ffffffff816698e8>] neigh_resolve_output+0x178/0x220 [<ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590 [<ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590 [<ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190 [<ffffffff816ab4c0>] ip_output+0x70/0x110 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70 [<ffffffff816abf89>] ip_send_skb+0x19/0x40 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40 [<ffffffff816df21a>] icmp_push_reply+0xea/0x120 [<ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230 [<ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70 [<ffffffff816dfa06>] icmp_echo+0x36/0x70 [<ffffffff816e0d11>] icmp_rcv+0x271/0x450 [<ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0 [<ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0 [<ffffffff816a5160>] ip_local_deliver+0x60/0xd0 [<ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560 [<ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560 [<ffffffff816a5453>] ip_rcv+0x283/0x3e0 [<ffffffff810b6302>] ? match_held_lock+0x192/0x200 [<ffffffff816a4620>] ? inet_del_offload+0x40/0x40 [<ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0 [<ffffffff8165e68e>] ? process_backlog+0x8e/0x230 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90 [<ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60 [<ffffffff8165e678>] process_backlog+0x78/0x230 [<ffffffff8165e6dd>] ? process_backlog+0xdd/0x230 [<ffffffff8165e355>] net_rx_action+0x155/0x400 [<ffffffff8106b48c>] __do_softirq+0xcc/0x420 [<ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590 [<ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30 <EOI> [<ffffffff8106b88e>] do_softirq+0x4e/0x60 [<ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0 [<ffffffff816a8eb0>] ip_finish_output2+0x240/0x590 [<ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190 [<ffffffff816ab4c0>] ip_output+0x70/0x110 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70 [<ffffffff816abf89>] ip_send_skb+0x19/0x40 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40 [<ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0 [<ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0 [<ffffffff8163e398>] sock_sendmsg+0x38/0x50 [<ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270 [<ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320 [<ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40 [<ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460 [<ffffffff81204886>] ? __fget_light+0x66/0x90 [<ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80 [<ffffffff8163f932>] SyS_sendmsg+0x12/0x20 [<ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 RIP [<ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0 RSP <ffff88006d603170> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4+ipv6: Make INET*_ESP select CRYPTO_ECHAINIVThomas Egerer2016-01-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have to select CRYPTO_ECHAINIV in order to work properly. This solves the issues caused by a misconfiguration as described in [1]. The original approach, patching crypto/Kconfig was turned down by Herbert Xu [2]. [1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html [2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2 Signed-off-by: Thomas Egerer <hakke_007@gmx.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix NULL deref in tcp_v4_send_ack()Eric Dumazet2016-01-211-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neal reported crashes with this stack trace : RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f ... CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0 ... [<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4 [<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0 [<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500 [<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330 [<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30 [<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90 [<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0 [<ffffffff8c17623e>] sock_aio_read+0xee/0x110 [<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0 [<ffffffff8c0673a1>] SyS_read+0x1e1/0x290 [<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b The problem here is the skb we provide to tcp_v4_send_ack() had to be parked in the backlog of a new TCP fastopen child because this child was owned by the user at the time an out of window packet arrived. Before queuing a packet, TCP has to set skb->dev to NULL as the device could disappear before packet is removed from the queue. Fix this issue by using the net pointer provided by the socket (being a timewait or a request socket). IPv6 is immune to the bug : tcp_v6_send_response() already gets the net pointer from the socket if provided. Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: diag: support v4mapped sockets in inet_diag_find_one_icsk()Eric Dumazet2016-01-201-7/+14
| | | | | | | | | | | | | | | | | | | | Lorenzo reported that we could not properly find v4mapped sockets in inet_diag_find_one_icsk(). This patch fixes the issue. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp: fix potential infinite loop in SO_REUSEPORT logicEric Dumazet2016-01-191-11/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a combination of connected and un-connected sockets, Dmitry was able to trigger soft lockups with his fuzzer. The problem is that sockets in the SO_REUSEPORT array might have different scores. Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into the soreuseport array, with a score which is smaller than the score of first socket sk1 found in hash table (I am speaking of the regular UDP hash table), if sk1 had the connect() done, giving a +8 to its score. hash bucket [X] -> sk1 -> sk2 -> NULL sk1 score = 14 (because it did a connect()) sk2 score = 6 SO_REUSEPORT fast selection is an optimization. If it turns out the score of the selected socket does not match score of first socket, just fallback to old SO_REUSEPORT logic instead of trying to be too smart. Normal SO_REUSEPORT users do not mix different kind of sockets, as this mechanism is used for load balance traffic. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Craig Gallek <kraigatgoog@gmail.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa2016-01-221-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: drop tcp_memcontrol.cVladimir Davydov2016-01-204-203/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to network subsys. Let's move it to memcontrol.c. This also allows us to reuse generic code for handling legacy memcg files. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEMJohannes Weiner2016-01-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2 interface. This also makes legacy-only code sections stand out better. [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: memcontrol: drop unused @css argument in memcg_init_kmemJohannes Weiner2016-01-201-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | This series adds accounting of the historical "kmem" memory consumers to the cgroup2 memory controller. These consumers include the dentry cache, the inode cache, kernel stack pages, and a few others that are pointed out in patch 7/8. The footprint of these consumers is directly tied to userspace activity in common workloads, and so they have to be part of the minimally viable configuration in order to present a complete feature to our users. The cgroup2 interface of the memory controller is far from complete, but this series, along with the socket memory accounting series, provides the final semantic changes for the existing memory knobs in the cgroup2 interface, which is scheduled for initial release in the next merge window. This patch (of 8): Remove unused css argument frmo memcg_init_kmem() Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-01-151-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "A quick set of bug fixes after there initial networking merge: 1) Netlink multicast group storage allocator only was tested with nr_groups equal to 1, make it work for other values too. From Matti Vaittinen. 2) Check build_skb() return value in macb and hip04_eth drivers, from Weidong Wang. 3) Don't leak x25_asy on x25_asy_open() failure. 4) More DMA map/unmap fixes in 3c59x from Neil Horman. 5) Don't clobber IP skb control block during GSO segmentation, from Konstantin Khlebnikov. 6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet. 7) Fix SKB segment utilization estimation in xen-netback, from David Vrabel. 8) Fix lockdep splat in bridge addrlist handling, from Nikolay Aleksandrov" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) bgmac: Fix reversed test of build_skb() return value. bridge: fix lockdep addr_list_lock false positive splat net: smsc: Add support h8300 xen-netback: free queues after freeing the net device xen-netback: delete NAPI instance when queue fails to initialize xen-netback: use skb to determine number of required guest Rx requests net: sctp: Move sequence start handling into sctp_transport_get_idx() ipv6: update skb->csum when CE mark is propagated net: phy: turn carrier off on phy attach net: macb: clear interrupts when disabling them sctp: support to lookup with ep+paddr in transport rhashtable net: hns: fixes no syscon error when init mdio dts: hisi: fixes no syscon fault when init mdio net: preserve IP control block during GSO segmentation fsl/fman: Delete one function call "put_device" in dtsec_config() hip04_eth: fix missing error handle for build_skb failed 3c59x: fix another page map/single unmap imbalance 3c59x: balance page maps and unmaps x25_asy: Free x25_asy on x25_asy_open() failure. mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB ...
| * net: preserve IP control block during GSO segmentationKonstantin Khlebnikov2016-01-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Skb_gso_segment() uses skb control block during segmentation. This patch adds 32-bytes room for previous control block which will be copied into all resulting segments. This patch fixes kernel crash during fragmenting forwarded packets. Fragmentation requires valid IP CB in skb for clearing ip options. Also patch removes custom save/restore in ovs code, now it's redundant. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
* | mm: memcontrol: switch to the updated jump-label APIJohannes Weiner2016-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | According to <linux/jump_label.h> the direct use of struct static_key is deprecated. Update the socket and slab accounting code accordingly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reported-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: memcontrol: generalize the socket accounting jump labelJohannes Weiner2016-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The unified hierarchy memory controller is going to use this jump label as well to control the networking callbacks. Move it to the memory controller code and give it a more generic name. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: tcp_memcontrol: simplify linkage between socket and page counterJohannes Weiner2016-01-143-49/+29
| | | | | | | | | | | | | | | | | | | | | | | | There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: tcp_memcontrol: sanitize tcp memory accounting callbacksJohannes Weiner2016-01-141-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: tcp_memcontrol: simplify the per-memcg limit accessJohannes Weiner2016-01-141-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | tcp_memcontrol replicates the global sysctl_mem limit array per cgroup, but it only ever sets these entries to the value of the memory_allocated page_counter limit. Use the latter directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: tcp_memcontrol: remove dead per-memcg count of allocated socketsJohannes Weiner2016-01-141-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The number of allocated sockets is used for calculations in the soft limit phase, where packets are accepted but the socket is under memory pressure. Since there is no soft limit phase in tcp_memcontrol, and memory pressure is only entered when packets are already dropped, this is actually dead code. Remove it. As this is the last user of parent_cg_proto(), remove that too. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-labelJohannes Weiner2016-01-142-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the jump-label from sock_update_memcg() and sock_release_memcg() to the callsite, and so eliminate those function calls when socket accounting is not enabled. This also eliminates the need for dummy functions because the calls will be optimized away if the Kconfig options are not enabled. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | memcg: do not allow to disable tcp accounting after limit is setVladimir Davydov2016-01-141-12/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former is never cleared while the latter can be cleared by unsetting the limit. This allows to disable tcp socket accounting for new sockets after it was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still guaranteeing that memcg_socket_limit_enabled static key will be decremented on memcg destruction. This functionality looks dubious, because it is not clear what a use case would be. By enabling tcp accounting a user accepts the price. If they then find the performance degradation unacceptable, they can always restart their workload with tcp accounting disabled. It does not seem there is any need to flip it while the workload is running. Besides, it contradicts to how kmem accounting API works: writing whatever to memory.kmem.limit_in_bytes enables kmem accounting for the cgroup in question, after which it cannot be disabled. Therefore one might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just enables socket accounting w/o limiting it, which might be useful by itself, but it isn't true. Since this API peculiarity is not documented anywhere, I propose to drop it. This will allow to simplify the code by dropping cg_proto->flags. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-01-114-6/+10
|\ | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/bonding/bond_main.c drivers/net/ethernet/mellanox/mlxsw/spectrum.h drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c The bond_main.c and mellanox switch conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp: disallow UFO for sockets with SO_NO_CHECK optionMichal Kubeček2016-01-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM sockets") disallows UFO for packets sent from raw sockets. We need to do the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if for a bit different reason: while such socket would override the CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and bad offloading flags warning is triggered in __skb_gso_segment(). In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow UFO for packets sent by sockets with UDP_NO_CHECK6_TX option. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp_yeah: don't set ssthresh below 2Neal Cardwell2016-01-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno and CUBIC, per RFC 5681 (equation 4). tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh value if the intended reduction is as big or bigger than the current cwnd. Congestion control modules should never return a zero or negative ssthresh. A zero ssthresh generally results in a zero cwnd, causing the connection to stall. A negative ssthresh value will be interpreted as a u32 and will set a target cwnd for PRR near 4 billion. Oleksandr Natalenko reported that a system using tcp_yeah with ECN could see a warning about a prior_cwnd of 0 in tcp_cwnd_reduction(). Testing verified that this was due to tcp_yeah_ssthresh() misbehaving in this way. Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp: restrict offloads to one namespaceHannes Frederic Sowa2016-01-102-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | udp tunnel offloads tend to aggregate datagrams based on inner headers. gro engine gets notified by tunnel implementations about possible offloads. The match is solely based on the port number. Imagine a tunnel bound to port 53, the offloading will look into all DNS packets and tries to aggregate them based on the inner data found within. This could lead to data corruption and malformed DNS packets. While this patch minimizes the problem and helps an administrator to find the issue by querying ip tunnel/fou, a better way would be to match on the specific destination ip address so if a user space socket is bound to the same address it will conflict. Cc: Tom Herbert <tom@herbertland.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespecify the tcp_keepalive_intvl sysctl knobNikolay Borisov2016-01-103-8/+8
| | | | | | | | | | | | | | | | This is the final part required to namespaceify the tcp keep alive mechanism. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespecify tcp_keepalive_probes sysctl knobNikolay Borisov2016-01-103-8/+8
| | | | | | | | | | | | | | | | This is required to have full tcp keepalive mechanism namespace support. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Namespaceify tcp_keepalive_time sysctl knobNikolay Borisov2016-01-103-8/+9
| | | | | | | | | | | | | | | | | | | | Different net namespaces might have different requirements as to the keepalive time of tcp sockets. This might be required in cases where different firewall rules are in place which require tcp timeout sockets to be increased/decreased independently of the host. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: eliminate lock count warnings in ping.cLance Richardson2016-01-081-0/+2
| | | | | | | | | | | | | | | | Add lock release/acquire annotations to ping_seq_start() and ping_seq_stop() to satisfy sparse. Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>