summaryrefslogtreecommitdiffstats
path: root/net/ipv4/ip_output.c
Commit message (Collapse)AuthorAgeFilesLines
* sk_buff: add skb extension infrastructureFlorian Westphal2018-12-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds an optional extension infrastructure, with ispec (xfrm) and bridge netfilter as first users. objdiff shows no changes if kernel is built without xfrm and br_netfilter support. The third (planned future) user is Multipath TCP which is still out-of-tree. MPTCP needs to map logical mptcp sequence numbers to the tcp sequence numbers used by individual subflows. This DSS mapping is read/written from tcp option space on receive and written to tcp option space on transmitted tcp packets that are part of and MPTCP connection. Extending skb_shared_info or adding a private data field to skb fclones doesn't work for incoming skb, so a different DSS propagation method would be required for the receive side. mptcp has same requirements as secpath/bridge netfilter: 1. extension memory is released when the sk_buff is free'd. 2. data is shared after cloning an skb (clone inherits extension) 3. adding extension to an skb will COW the extension buffer if needed. The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the mapping for tx and rx processing. Two new members are added to sk_buff: 1. 'active_extensions' byte (filling a hole), telling which extensions are available for this skb. This has two purposes. a) avoids the need to initialize the pointer. b) allows to "delete" an extension by clearing its bit value in ->active_extensions. While it would be possible to store the active_extensions byte in the extension struct instead of sk_buff, there is one problem with this: When an extension has to be disabled, we can always clear the bit in skb->active_extensions. But in case it would be stored in the extension buffer itself, we might have to COW it first, if we are dealing with a cloned skb. On kmalloc failure we would be unable to turn an extension off. 2. extension pointer, located at the end of the sk_buff. If the active_extensions byte is 0, the pointer is undefined, it is not initialized on skb allocation. This adds extra code to skb clone and free paths (to deal with refcount/free of extension area) but this replaces similar code that manages skb->nf_bridge and skb->sp structs in the followup patches of the series. It is possible to add support for extensions that are not preseved on clones/copies. To do this, it would be needed to define a bitmask of all extensions that need copy/cow semantics, and change __skb_ext_copy() to check ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set ->active_extensions to 0 on the new clone. This isn't done here because all extensions that get added here need the copy/cow semantics. v2: Allocate entire extension space using kmem_cache. Upside is that this allows better tracking of used memory, downside is that we will allocate more space than strictly needed in most cases (its unlikely that all extensions are active/needed at same time for same skb). The allocated memory (except the small extension header) is not cleared, so no additonal overhead aside from memory usage. Avoid atomic_dec_and_test operation on skb_ext_put() by using similar trick as kfree_skbmem() does with fclone_ref: If recount is 1, there is no concurrent user and we can free right away. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: silence udp zerocopy smatch false positiveWillem de Bruijn2018-12-081-1/+2
| | | | | | | | | | | | | | extra_uref is used in __ip(6)_append_data only if uarg is set. Smatch sees that the variable is passed to sock_zerocopy_put_abort. This function accesses it only when uarg is set, but smatch cannot infer this. Make this dependency explicit. Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: elide zerocopy operation in hot pathWillem de Bruijn2018-12-031-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info. Release of its last reference triggers a completion notification. The TCP stack in tcp_sendmsg_locked holds an extra ref independent of the skbs, because it can build, send and free skbs within its loop, possibly reaching refcount zero and freeing the ubuf_info too soon. The UDP stack currently also takes this extra ref, but does not need it as all skbs are sent after return from __ip(6)_append_data. Avoid the extra refcount_inc and refcount_dec_and_test, and generally the sock_zerocopy_put in the common path, by passing the initial reference to the first skb. This approach is taken instead of initializing the refcount to 0, as that would generate error "refcount_t: increment on 0" on the next skb_zcopy_set. Changes v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: msg_zerocopyWillem de Bruijn2018-12-031-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a8532c99 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: always initialize pagedlenWillem de Bruijn2018-11-241-1/+2
| | | | | | | | | | | | | | | In ip packet generation, pagedlen is initialized for each skb at the start of the loop in __ip(6)_append_data, before label alloc_new_skb. Depending on compiler options, code can be generated that jumps to this label, triggering use of an an uninitialized variable. In practice, at -O2, the generated code moves the initialization below the label. But the code should not rely on that for correctness. Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add and use skb_mark_not_on_list().David S. Miller2018-09-101-2/+2
| | | | | | | | | An SKB is not on a list if skb->next is NULL. Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as such. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-07-241-0/+2
|\
| * ip: hash fragments consistentlyPaolo Abeni2018-07-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skb hash for locally generated ip[v6] fragments belonging to the same datagram can vary in several circumstances: * for connected UDP[v6] sockets, the first fragment get its hash via set_owner_w()/skb_set_hash_from_sk() * for unconnected IPv6 UDPv6 sockets, the first fragment can get its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if auto_flowlabel is enabled For the following frags the hash is usually computed via skb_get_hash(). The above can cause OoO for unconnected IPv6 UDPv6 socket: in that scenario the egress tx queue can be selected on a per packet basis via the skb hash. It may also fool flow-oriented schedulers to place fragments belonging to the same datagram in different flows. Fix the issue by copying the skb hash from the head frag into the others at fragmentation time. Before this commit: perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8" netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n & perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1 perf script probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0 probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0 After this commit: probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0 probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0 Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip: unconditionally set cork gso_sizeWillem de Bruijn2018-07-071-2/+1
| | | | | | | | | | | | | | | | | | Now that ipc(6)->gso_size is correctly initialized in all callers of ip(6)_setup_cork, it is safe to unconditionally pass it to the cork. Link: http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6Willem de Bruijn2018-07-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly after modification by __sock_cmsg_send, by calling sock_tx_timestamp. The IPv4 and IPv6 paths do this conversion differently. In IPv4, the individual protocols that support tx timestamps call this function and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is called in __ip6_append_data. There is no need to store both tx_flags and ts_flags in the cookie as one is derived from the other. Convert when setting up the cork and remove the redundant field. This is similar to IPv6, only have the conversion happen only once per datagram, in ip(6)_setup_cork. Also change __ip6_append_data to match __ip_append_data. Only update tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is redundant: only valid protocols can have non-zero cork->tx_flags. After this change the IPv4 and IPv6 logic is the same. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: ipcm_cookie initializersWillem de Bruijn2018-07-071-5/+1
| | | | | | | | | | | | | | | | | | Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Hook into time based transmissionJesus Sanchez-Palencia2018-07-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Add a transmit_time field to struct inet_cork, then copy the timestamp from the CMSG cookie at ip_setup_cork() so we can safely copy it into the skb later during __ip_make_skb(). For the raw fast path, just perform the copy at raw_send_hdrinc(). Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: add __ip_queue_xmit() that supports tos paramXin Long2018-07-041-4/+5
|/ | | | | | | | | | | | This patch introduces __ip_queue_xmit(), through which the callers can pass tos param into it without having to set inet->tos. For ipv6, ip6_xmit() already allows passing tclass parameter. It's needed when some transport protocol doesn't use inet->tos, like sctp's per transport dscp, which will be added in next patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: limit use of gso_size to udpWillem de Bruijn2018-06-201-1/+2
| | | | | | | | | | | | | | | | | The ipcm(6)_cookie field gso_size is set only in the udp path. The ip layer copies this to cork only if sk_type is SOCK_DGRAM. This check proved too permissive. Ping and l2tp sockets have the same type. Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to exclude ping sockets. v1 -> v2 - remove irrelevant whitespace changes Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-05-211-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: test tailroom before appending to linear skbWillem de Bruijn2018-05-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device features may change during transmission. In particular with corking, a device may toggle scatter-gather in between allocating and writing to an skb. Do not unconditionally assume that !NETIF_F_SG at write time implies that the same held at alloc time and thus the skb has sufficient tailroom. This issue predates git history. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Add mark for TIMEWAIT socketsJon Maxwell2018-05-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This version has some suggestions by Eric Dumazet: - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP races. - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark" statement. - Factorize code as sk_fullsock() check is not necessary. Aidan McGurn from Openwave Mobility systems reported the following bug: "Marked routing is broken on customer deployment. Its effects are large increase in Uplink retransmissions caused by the client never receiving the final ACK to their FINACK - this ACK misses the mark and routes out of the incorrect route." Currently marks are added to sk_buffs for replies when the "fwmark_reflect" sysctl is enabled. But not for TW sockets that had sk->sk_mark set via setsockopt(SO_MARK..). Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. Then progate this so that the skb gets sent with the correct mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that netfilter rules are still honored. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2018-05-071-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Minor conflict in ip_output.c, overlapping changes to the body of an if() statement. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | udp: enable UDP checksum offload for ESPJacek Kalwas2018-04-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case NIC has support for ESP TX CSUM offload skb->ip_summed is not set to CHECKSUM_PARTIAL which results in checksum calculated by SW. Fix enables ESP TX CSUM for UDP by extending condition with check for NETIF_F_HW_ESP_TX_CSUM. Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | udp: paged allocation with gsoWillem de Bruijn2018-04-261-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | When sending large datagrams that are later segmented, store data in page frags to avoid copying from linear in skb_segment. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | udp: generate gso with UDP_SEGMENTWillem de Bruijn2018-04-261-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | udp: expose inet cork to udpWillem de Bruijn2018-04-261-9/+8
| |/ |/| | | | | | | | | | | | | | | | | | | UDP segmentation offload needs access to inet_cork in the udp layer. Pass the struct to ip(6)_make_skb instead of allocating it on the stack in that function itself. This patch is a noop otherwise. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Fix one possible memleak in ip_setup_corkGao Feng2018-04-161-3/+5
|/ | | | | | | | | | | | It would allocate memory in this function when the cork->opt is NULL. But the memory isn't freed if failed in the latter rt check, and return error directly. It causes the memleak if its caller is ip_make_skb which also doesn't free the cork->opt when meet a error. Now move the rt check ahead to avoid the memleak. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: avoid unneeded atomic operation in ip*_append_data()Paolo Abeni2018-04-041-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()"), when transmitting sub MTU datagram, an addtional, unneeded atomic operation is performed in ip*_append_data() to update wmem_alloc: in the above condition the delta is 0. The above cause small but measurable performance regression in UDP xmit tput test with packet size below MTU. This change avoids such overhead updating wmem_alloc only if wmem_alloc_delta is non zero. The error path is left intentionally unmodified: it's a slow path and simplicity is preferred to performances. Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()") Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()Eric Dumazet2018-04-011-5/+12
| | | | | | | | | | | | | | While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Similar change for IPv6 is provided in following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rename skb_gso_validate_mtu -> skb_gso_validate_network_lenDaniel Axtens2018-03-041-1/+1
| | | | | | | | | | | | | | | If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we recently added to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: remove unreachable ufo branchesWillem de Bruijn2017-08-221-8/+4
| | | | | | | | | | | | | | Remove two references to ufo in the udp send path that are no longer reachable now that ufo has been removed. Commit 85f1bd9a7b5a ("udp: consistently apply ufo or fragmentation") is a fix to ufo. It is safe to revert what remains of it. Also, no skb can enter ip_append_page with skb_is_gso true now that skb_shinfo(skb)->gso_type is no longer set in ip_append_page/_data. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip/options: explicitly provide net ns to __ip_options_echo()Paolo Abeni2017-08-061-1/+1
| | | | | | | | | | | | | | | | __ip_options_echo() uses the current network namespace, and currently retrives it via skb->dst->dev. This commit adds an explicit 'net' argument to __ip_options_echo() and update all the call sites to provide it, usually via a simpler sock_net(). After this change, __ip_options_echo() no more needs to access skb->dst and we can drop a couple of hack to preserve such info in the rx path. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-07-211-4/+4
|\
| * ipv4: ip_do_fragment: fix headroom testsVasily Averin2017-07-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some time ago David Woodhouse reported skb_under_panic when we try to push ethernet header to fragmented ipv6 skbs. It was fixed for ipv6 by Florian Westphal in commit 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak") However similar problem still exist in ipv4. It does not trigger skb_under_panic due paranoid check in ip_finish_output2, however according to Alexey Kuznetsov current state is abnormal and ip_fragment should be fixed too. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Stop generating UFO packets.David S. Miller2017-07-171-76/+0
|/ | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: reflect mark on tcp syn ack packetsJamal Hadi Salim2017-07-041-1/+2
| | | | | | | | | | SYN-ACK responses on a server in response to a SYN from a client did not get the injected skb mark that was tagged on the SYN packet. Fixes: 84f39b08d786 ("net: support marking accepting TCP sockets") Reviewed-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena2017-07-011-3/+3
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: account for current skb length when deciding about UFOMichal Kubeček2017-06-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our customer encountered stuck NFS writes for blocks starting at specific offsets w.r.t. page boundary caused by networking stack sending packets via UFO enabled device with wrong checksum. The problem can be reproduced by composing a long UDP datagram from multiple parts using MSG_MORE flag: sendto(sd, buff, 1000, MSG_MORE, ...); sendto(sd, buff, 1000, MSG_MORE, ...); sendto(sd, buff, 3000, 0, ...); Assume this packet is to be routed via a device with MTU 1500 and NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(), this condition is tested (among others) to decide whether to call ip_ufo_append_data(): ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb)) At the moment, we already have skb with 1028 bytes of data which is not marked for GSO so that the test is false (fragheaderlen is usually 20). Thus we append second 1000 bytes to this skb without invoking UFO. Third sendto(), however, has sufficient length to trigger the UFO path so that we end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb() uses udp_csum() to calculate the checksum but that assumes all fragments have correct checksum in skb->csum which is not true for UFO fragments. When checking against MTU, we need to add skb->len to length of new segment if we already have a partially filled skb and fragheaderlen only if there isn't one. In the IPv6 case, skb can only be null if this is the first segment so that we have to use headersize (length of the first IPv6 header) rather than fragheaderlen (length of IPv6 header of further fragments) for skb == NULL. Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: avoid ufo handling on IP payload compression packetsAlexey Kodanev2017-03-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on later transformed packets") added a check that 'rt->dst.header_len' isn't zero in order to skip UFO, but it doesn't include IPcomp in transport mode where it equals zero. Packets, after payload compression, may not require further fragmentation, and if original length exceeds MTU, later compressed packets will be transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000: * IPv4 case, offset is wrong + unnecessary fragmentation udp_ipsec.sh -p comp -m transport -s 2000 & tcpdump -ni ltp_ns_veth2 ... IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+], proto Compressed IP (108), length 49) 10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000) IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none], proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17 * IPv6 case, sending small fragments udp_ipsec.sh -6 -p comp -m transport -s 2000 & tcpdump -ni ltp_ns_veth2 ... IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108) payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000) IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108) payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000) Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO if xfrm is set. So the new check will include both cases: IPcomp and IPsec. Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets") Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed packets") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rename dst_neigh_output back to neigh_outputJulian Anastasov2017-02-111-1/+1
| | | | | | | | | | After the dst->pending_confirm flag was removed, we do not need anymore to provide dst arg to dst_neigh_output. So, rename it to neigh_output as before commit 5110effee8fd ("net: Do delayed neigh confirmation."). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TPJulian Anastasov2017-02-071-0/+6
| | | | | | | | | | | | | | | | | | | When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. The datagram protocols can use MSG_CONFIRM to confirm the neighbour. When used with MSG_PROBE we do not reach the code where neighbour is confirmed, so we have to do the same slow lookup by using the dst_confirm_neigh() helper. When MSG_PROBE is not used, ip_append_data/ip6_append_data will set the skb flag dst_pending_confirm. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add dst_pending_confirm flag to skbuffJulian Anastasov2017-02-071-1/+4
| | | | | | | | | | | | | Add new skbuff flag to allow protocols to confirm neighbour. When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. Add sock_confirm_neigh() helper to confirm the neighbour and use it for IPv4, IPv6 and VRF before dst_neigh_output. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix mark propagation with fwmark_reflect enabledPau Espin Pedrol2017-01-091-0/+1
| | | | | | | | | | | | | Otherwise, RST packets generated by the TCP stack for non-existing sockets always have mark 0. The mark from the original packet is assigned to the netns_ipv4/6 socket used to send the response so that it can get copied into the response skb when the socket sends it. Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2016-12-241-1/+1
| | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipv4: Should use consistent conditional judgement for ip fragment in ↵zheng li2016-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | __ip_append_data and ip_finish_output There is an inconsistent conditional judgement in __ip_append_data and ip_finish_output functions, the variable length in __ip_append_data just include the length of application's payload and udp header, don't include the length of ip header, but in ip_finish_output use (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the length of ip header. That causes some particular application's udp payload whose length is between (MTU - IP Header) and MTU were fragmented by ip_fragment even though the rst->dev support UFO feature. Add the length of ip header to length in __ip_append_data to keep consistent conditional judgement as ip_finish_output for ip fragment. Signed-off-by: Zheng Li <james.z.li@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2016-12-161-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: - more ->d_init() stuff (work.dcache) - pathname resolution cleanups (work.namei) - a few missing iov_iter primitives - copy_from_iter_full() and friends. Either copy the full requested amount, advance the iterator and return true, or fail, return false and do _not_ advance the iterator. Quite a few open-coded callers converted (and became more readable and harder to fuck up that way) (work.iov_iter) - several assorted patches, the big one being logfs removal * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: logfs: remove from tree vfs: fix put_compat_statfs64() does not handle errors namei: fold should_follow_link() with the step into not-followed link namei: pass both WALK_GET and WALK_MORE to should_follow_link() namei: invert WALK_PUT logics namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link() namei: saner calling conventions for mountpoint_last() namei.c: get rid of user_path_parent() switch getfrag callbacks to ..._full() primitives make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success [iov_iter] new primitives - copy_from_iter_full() and friends don't open-code file_inode() ceph: switch to use of ->d_init() ceph: unify dentry_operations instances lustre: switch to use of ->d_init()
| * switch getfrag callbacks to ..._full() primitivesAl Viro2016-12-051-2/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-12-031-0/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: Set skb->protocol properly for local outputEli Cooper2016-12-021-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When xfrm is applied to TSO/GSO packets, it follows this path: xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() where skb_gso_segment() relies on skb->protocol to function properly. This patch sets skb->protocol to ETH_P_IP before dst_output() is called, fixing a bug where GSO packets sent through a sit tunnel are dropped when xfrm is involved. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4, ipv6: run cgroup eBPF egress programsDaniel Mack2016-11-251-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the cgroup associated with the receiving socket has an eBPF programs installed, run them from ip_output(), ip6_output() and ip_mc_output(). From mentioned functions we have two socket contexts as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn()."). We explicitly need to use sk instead of skb->sk here, since otherwise the same program would run multiple times on egress when encap devices are involved, which is not desired in our case. eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix bogus cast in skb_pagelen() and use unsigned variablesAlexey Dobriyan2016-11-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) cast to "int" is unnecessary: u8 will be promoted to int before decrementing, small positive numbers fit into "int", so their values won't be changed during promotion. Once everything is int including loop counters, signedness doesn't matter: 32-bit operations will stay 32-bit operations. But! Someone tried to make this loop smart by making everything of the same type apparently in an attempt to optimise it. Do the optimization, just differently. Do the cast where it matters. :^) 2) frag size is unsigned entity and sum of fragments sizes is also unsigned. Make everything unsigned, leave no MOVSX instruction behind. add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4) function old new delta skb_cow_data 835 834 -1 ip_do_fragment 2549 2548 -1 ip6_fragment 3130 3128 -2 Total: Before=154865032, After=154865028, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-11-151-10/+15
|\| | | | | | | | | | | | | Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: update comment to document GSO fragmentation cases.Lance Richardson2016-11-101-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local fragmentation in ip_finish_output_gso()"), updating the comment documenting cases in which fragmentation is needed for egress GSO packets. Suggested-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tcp response should set oif only if it is L3 masterDavid Ahern2016-11-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lorenzo noted an Android unit test failed due to e0d56fdd7342: "The expectation in the test was that the RST replying to a SYN sent to a closed port should be generated with oif=0. In other words it should not prefer the interface where the SYN came in on, but instead should follow whatever the routing table says it should do." Revert the change to ip_send_unicast_reply and tcp_v6_send_response such that the oif in the flow is set to the skb_iif only if skb_iif is an L3 master. Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls") Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>