summaryrefslogtreecommitdiffstats
path: root/include/net/ipv6.h
Commit message (Collapse)AuthorAgeFilesLines
* drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packetsBen Hutchings2014-10-301-0/+2
| | | | | | | | | | | UFO is now disabled on all drivers that work with virtio net headers, but userland may try to send UFO/IPv6 packets anyway. Instead of sending with ID=0, we should select identifiers on their behalf (as we used to). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data") Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted()Eric Dumazet2014-09-281-1/+2
| | | | | | | | | ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm that it needs. Lets not assume this, as TCP stack might use a different place. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: add sysctl_mld_qrv to configure query robustness variableHannes Frederic Sowa2014-09-041-0/+1
| | | | | | | | | | | | | | | | | | | This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query robustness variable. It specifies how many retransmit of unsolicited mld retransmit should happen. Admins might want to tune this on lossy links. Also reset mld state on interface down/up, so we pick up new sysctl settings during interface up event. IPv6 certification requests this knob to be available. I didn't make this knob netns specific, as it is mostly a setting in a physical environment and should be per host. Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: frag: don't account number of fragment queuesFlorian Westphal2014-07-271-5/+0
| | | | | | | | | | | | | The 'nqueues' counter is protected by the lru list lock, once thats removed this needs to be converted to atomic counter. Given this isn't used for anything except for reporting it to userspace via /proc, just remove it. We still report the memory currently used by fragment reassembly queues. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: frag: constify match, hashfn and constructor argumentsFlorian Westphal2014-07-271-2/+2
| | | | | Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: clean up some sparse endianness warnings in ipv6.hJeff Layton2014-07-161-6/+11
| | | | | | | | | | | | | | | | | | sparse is throwing warnings when building sunrpc modules due to some endianness shenanigans in ipv6.h. Specifically: CHECK net/sunrpc/addr.c include/net/ipv6.h:573:17: warning: restricted __be64 degrades to integer include/net/ipv6.h:577:34: warning: restricted __be32 degrades to integer include/net/ipv6.h:573:17: warning: restricted __be64 degrades to integer include/net/ipv6.h:577:34: warning: restricted __be32 degrades to integer Sprinkle some endianness fixups to silence them. These should all get fixed up at compile time, so I don't think this will add any extra work to be done at runtime. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: provide stubs for ip6_set_txhash and ip6_make_flowlabelFlorian Fainelli2014-07-081-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") introduced ip6_make_flowlabel, while commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit") introduced ip6_set_txhash. ip6_set_tx_hash() uses sk_v6_daddr which references __sk_common.skc_v6_daddr from struct sock_common, which is gated with IS_ENABLED(CONFIG_IPV6). ip6_make_flowlabel() uses the ipv6 member from struct net which is also gated with IS_ENABLED(CONFIG_IPV6). When CONFIG_IPV6 is disabled, we will hit a build failure that looks like this when the compiler attempts inlining these functions: CC [M] drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.o In file included from include/net/inet_sock.h:27:0, from include/net/ip.h:30, from drivers/net/ethernet/broadcom/cnic.c:37: include/net/ipv6.h: In function 'ip6_set_txhash': include/net/sock.h:327:33: error: 'struct sock_common' has no member named 'skc_v6_daddr' #define sk_v6_daddr __sk_common.skc_v6_daddr ^ include/net/ipv6.h:696:49: note: in expansion of macro 'sk_v6_daddr' keys.dst = (__force __be32)ipv6_addr_hash(&sk->sk_v6_daddr); ^ In file included from include/net/inetpeer.h:15:0, from include/net/route.h:28, from include/net/ip.h:31, from drivers/net/ethernet/broadcom/cnic.c:37: include/net/ipv6.h: In function 'ip6_make_flowlabel': include/net/ipv6.h:706:37: error: 'struct net' has no member named 'ipv6' if (!flowlabel && (autolabel || net->ipv6.sysctl.auto_flowlabels)) { ^ Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Implement automatic flow label generation on transmitTom Herbert2014-07-071-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Automatically generate flow labels for IPv6 packets on transmit. The flow label is computed based on skb_get_hash. The flow label will only automatically be set when it is zero otherwise (i.e. flow label manager hasn't set one). This supports the transmit side functionality of RFC 6438. Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this functionality per socket. By default, auto flowlabels are disabled to avoid possible conflicts with flow label manager, however if this feature proves useful we may want to enable it by default. It should also be noted that FreeBSD has already implemented automatic flow labels (including the sysctl and socket option). In FreeBSD, automatic flow labels default to enabled. Performance impact: Running super_netperf with 200 flows for TCP_RR and UDP_RR for IPv6. Note that in UDP case, __skb_get_hash will be called for every packet with explains slight regression. In the TCP case the hash is saved in the socket so there is no regression. Automatic flow labels disabled: TCP_RR: 86.53% CPU utilization 127/195/322 90/95/99% latencies 1.40498e+06 tps UDP_RR: 90.70% CPU utilization 118/168/243 90/95/99% latencies 1.50309e+06 tps Automatic flow labels enabled: TCP_RR: 85.90% CPU utilization 128/199/337 90/95/99% latencies 1.40051e+06 UDP_RR 92.61% CPU utilization 115/164/236 90/95/99% latencies 1.4687e+06 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Save TX flow hash in sock and set in skbuf on xmitTom Herbert2014-07-071-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a connected socket we can precompute the flow hash for setting in skb->hash on output. This is a performance advantage over calculating the skb->hash for every packet on the connection. The computation is done using the common hash algorithm to be consistent with computations done for packets of the connection in other states where thers is no socket (e.g. time-wait, syn-recv, syn-cookies). This patch adds sk_txhash to the sock structure. inet_set_txhash and ip6_set_txhash functions are added which are called from points in TCP and UDP where socket moves to established state. skb_set_hash_from_sk is a function which sets skb->hash from the sock txhash value. This is called in UDP and TCP transmit path when transmitting within the context of a socket. Tested: ran super_netperf with 200 TCP_RR streams over a vxlan interface (in this case skb_get_hash called on every TX packet to create a UDP source port). Before fix: 95.02% CPU utilization 154/256/505 90/95/99% latencies 1.13042e+06 tps Time in functions: 0.28% skb_flow_dissect 0.21% __skb_get_hash After fix: 94.95% CPU utilization 156/254/485 90/95/99% latencies 1.15447e+06 Neither __skb_get_hash nor skb_flow_dissect appear in perf Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inetpeer: get rid of ip_id_countEric Dumazet2014-06-021-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add a sysctl to reflect the fwmark on repliesLorenzo Colitti2014-05-131-0/+3
| | | | | | | | | | | | | | | | | | Kernel-originated IP packets that have no user socket associated with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.) are emitted with a mark of zero. Add a sysctl to make them have the same mark as the packet they are replying to. This allows an administrator that wishes to do so to use mark-based routing, firewalling, etc. for these replies by marking the original packets inbound. Tested using user-mode linux: - ICMP/ICMPv6 echo replies and errors. - TCP RST packets (IPv4 and IPv6). Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: Introduce ip6_sk_dst_hoplimit.Lorenzo Colitti2014-04-301-0/+19
| | | | | | | | This replaces 6 identical code snippets with a call to a new static inline function. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: add a sock pointer to dst->output() path.Eric Dumazet2014-04-151-1/+1
| | | | | | | | | | | | | | | | In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: 8f646c922d55 ("vxlan: keep original skb ownership") Reported-by: lucien xin <lucien.xin@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: protect protocols not handling ipv4 from v4 connection/bind attemptsHannes Frederic Sowa2014-01-211-0/+2
| | | | | | | | | | | | | Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow connecting and binding to them. sendmsg logic does already check msg->name for this but must trust already connected sockets which could be set up for connection to ipv4 address family. Per-socket flag ipv6only is of no use here, as it is under users control by setsockopt. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: add a flag to get the flow label used remotlyFlorent Fourcot2014-01-191-1/+2
| | | | | | | | | | | | This information is already available via IPV6_FLOWINFO of IPV6_2292PKTOPTIONS, and them a filtering to get the flow label information. But it is probably logical and easier for users to add this here, and to control both sent/received flow label values with the IPV6_FLOWLABEL_MGR option. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: move IPV6_TCLASS_SHIFT into ipv6.h and define a helperLi RongQing2014-01-151-0/+5
| | | | | | | | | Two places defined IPV6_TCLASS_SHIFT, so we should move it into ipv6.h, and use this macro as possible. And define ip6_tclass helper to return tclass Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: namespace cleanupsstephen hemminger2014-01-011-4/+0
| | | | | | | | | | | | | | | | | | Running 'make namespacecheck' shows: net/ipv6/route.o ipv6_route_table_template rt6_bind_peer net/ipv6/icmp.o icmpv6_route_lookup ipv6_icmp_table_template This addresses some of those warnings by: * make icmpv6_route_lookup static * move inline's out of ip6_route.h since only used into route.c * move rt6_bind_peer into route.c Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2013-12-191-4/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2013-12-19 1) Use the user supplied policy index instead of a generated one if present. From Fan Du. 2) Make xfrm migration namespace aware. From Fan Du. 3) Make the xfrm state and policy locks namespace aware. From Fan Du. 4) Remove ancient sleeping when the SA is in acquire state, we now queue packets to the policy instead. This replaces the sleeping code. 5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the posibility to sleep. The sleeping code is gone, so remove it. 6) Check user specified spi for IPComp. Thr spi for IPcomp is only 16 bit wide, so check for a valid value. From Fan Du. 7) Export verify_userspi_info to check for valid user supplied spi ranges with pfkey and netlink. From Fan Du. 8) RFC3173 states that if the total size of a compressed payload and the IPComp header is not smaller than the size of the original payload, the IP datagram must be sent in the original non-compressed form. These packets are dropped by the inbound policy check because they are not transformed. Document the need to set 'level use' for IPcomp to receive such packets anyway. From Fan Du. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Remove FLOWI_FLAG_CAN_SLEEPSteffen Klassert2013-12-061-4/+2
| | | | | | | | | | | | | | | | FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility to sleep until the needed states are resolved. This code is gone, so FLOWI_FLAG_CAN_SLEEP is not needed anymore. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | ipv6: add ip6_flowlabel helperFlorent Fourcot2013-12-091-0/+5
| | | | | | | | | | | | | | | | And use it if possible. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: move IPV6_TCLASS_MASK definition in ipv6.hFlorent Fourcot2013-12-091-0/+1
| | | | | | | | | | | | Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | xen-netback: fix fragment detection in checksum setupPaul Durrant2013-12-051-1/+2
|/ | | | | | | | | | | | | | | | | | | The code to detect fragments in checksum_setup() was missing for IPv4 and too eager for IPv6. (It transpires that Windows seems to send IPv6 packets with a fragment header even if they are not a fragment - i.e. offset is zero, and M bit is not set). This patch also incorporates a fix to callers of maybe_pull_tail() where skb->network_header was being erroneously added to the length argument. Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> cc: David Miller <davem@davemloft.net> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu ↵Hannes Frederic Sowa2013-11-231-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | functions Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") conditionally updated addr_len if the msg_name is written to. The recv_error and rxpmtu functions relied on the recvmsg functions to set up addr_len before. As this does not happen any more we have to pass addr_len to those functions as well and set it to the size of the corresponding sockaddr length. This broke traceroute and such. Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") Reported-by: Brad Spengler <spender@grsecurity.net> Reported-by: Tom Labanowski Cc: mpb <mpb.mail@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: enable IPV6_FLOWLABEL_MGR for getsockoptFlorent Fourcot2013-11-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | It is already possible to set/put/renew a label with IPV6_FLOWLABEL_MGR and setsockopt. This patch add the possibility to get information about this label (current value, time before expiration, etc). It helps application to take decision for a renew or a release of the label. v2: * Add spin_lock to prevent race condition * return -ENOENT if no result found * check if flr_action is GET v3: * move the spin_lock to protect only the relevant code Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: split inet6_hash_frag for netfilter and initialize secrets with ↵Hannes Frederic Sowa2013-10-231-2/+0
| | | | | | | | | | | | | | | | | net_get_random_once Defer the fragmentation hash secret initialization for IPv6 like the previous patch did for IPv4. Because the netfilter logic reuses the hash secret we have to split it first. Thus introduce a new nf_hash_frag function which takes care to seed the hash secret. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: split inet6_ehashfn to hash functions per compilation unitHannes Frederic Sowa2013-10-191-2/+2
| | | | | | | | | | | This patch splits the inet6_ehashfn into separate ones in ipv6/inet6_hashtables.o and ipv6/udp.o to ease the introduction of seperate secrets keys later. Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip*.h: Remove extern from function prototypesJoe Perches2013-09-211-149/+119
| | | | | | | | | | | | | There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: move ip6_dst_hoplimit() into core kernelCong Wang2013-08-311-0/+2
| | | | | | | | It will be used by vxlan, and may not be inlined. Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add NEXTHDR_SCTP to ipv6.hJoe Stringer2013-08-231-0/+1
| | | | | Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Jesse Gross <jesse@nicira.com>
* ipv6: Convert use of typedef ctl_table to struct ctl_tableJoe Perches2013-06-191-2/+2
| | | | | | | This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: Add IPv6 support to the ping socket.Lorenzo Colitti2013-05-251-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the ability to send ICMPv6 echo requests without a raw socket. The equivalent ability for ICMPv4 was added in 2011. Instead of having separate code paths for IPv4 and IPv6, make most of the code in net/ipv4/ping.c dual-stack and only add a few IPv6-specific bits (like the protocol definition) to a new net/ipv6/ping.c. Hopefully this will reduce divergence and/or duplication of bugs in the future. Caveats: - Setting options via ancillary data (e.g., using IPV6_PKTINFO to specify the outgoing interface) is not yet supported. - There are no separate security settings for IPv4 and IPv6; everything is controlled by /proc/net/ipv4/ping_group_range. - The proc interface does not yet display IPv6 ping sockets properly. Tested with a patched copy of ping6 and using raw socket calls. Compiles and works with all of CONFIG_IPV6={n,m,y}. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handlingHannes Frederic Sowa2013-03-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Hello! After patch 1 got accepted to net-next I will also send a patch to netfilter-devel to make the corresponding changes to the netfilter reassembly logic. Thanks, Hannes -- >8 -- [PATCH 2/2] ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling This patch also ensures that INET_ECN_CE is propagated if one fragment had the codepoint set. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: introdcue __ipv6_addr_needs_scope_id and ipv6_iface_scope_id helper ↵Hannes Frederic Sowa2013-03-081-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | functions __ipv6_addr_needs_scope_id checks if an ipv6 address needs to supply a 'sin6_scope_id != 0'. 'sin6_scope_id != 0' was enforced in case of link-local addresses. To support interface-local multicast these checks had to be enhanced and are now consolidated into these new helper functions. v2: a) migrated to struct ipv6_addr_props v3: a) reverted changes for ipv6_addr_props b) test for address type instead of comparing scope v4: a) unchanged Suggested-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6 flowlabel: add __rcu annotationsEric Dumazet2013-03-071-4/+4
| | | | | | | | | Commit 18367681a10b (ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.) omitted proper __rcu annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: use a stronger hash for tcpEric Dumazet2013-02-211-0/+12
| | | | | | | | | | | | | | | | | | It looks like its possible to open thousands of TCP IPv6 sessions on a server, all landing in a single slot of TCP hash table. Incoming packets have to lookup sockets in a very long list. We should hash all bits from foreign IPv6 addresses, using a salt and hash mix, not a simple XOR. inet6_ehashfn() can also separately use the ports, instead of xoring them. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.YOSHIFUJI Hideaki / 吉藤英明2013-01-301-0/+1
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6 flowlabel: Convert hash list to RCU.YOSHIFUJI Hideaki / 吉藤英明2013-01-301-0/+1
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: frag helper functions for mem limit trackingJesper Dangaard Brouer2013-01-291-1/+1
| | | | | | | | | | | | | This change is primarily a preparation to ease the extension of memory limit tracking. The change does reduce the number atomic operation, during freeing of a frag queue. This does introduce a some performance improvement, as these atomic operations are at the core of the performance problems seen on NUMA systems. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Unshare ip6_nd_hdr() and change return type to void.YOSHIFUJI Hideaki / 吉藤英明2013-01-211-7/+0
| | | | | | | | | - move ip6_nd_hdr() to its users' source files. In net/ipv6/mcast.c, it will be called ip6_mc_hdr(). - make return type to void since this function never fails. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix ipv6_prefix_equal64_half mask conversionFabio Baltieri2013-01-171-1/+1
| | | | | | | | | | | | Fix the 64bit optimized version of ipv6_prefix_equal to convert the bitmask to network byte order only after the bit-shift. The bug was introduced in: 3867517 ipv6: 64bit version of ipv6_prefix_equal(). Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: increase fragment memory usage limitsJesper Dangaard Brouer2013-01-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increase the amount of memory usage limits for incomplete IP fragments. Arguing for new thresh high/low values: High threshold = 4 MBytes Low threshold = 3 MBytes The fragmentation memory accounting code, tries to account for the real memory usage, by measuring both the size of frag queue struct (inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize. We want to be able to handle/hold-on-to enough fragments, to ensure good performance, without causing incomplete fragments to hurt scalability, by causing the number of inet_frag_queue to grow too much (resulting longer searches for frag queues). For IPv4, how much memory does the largest frag consume. Maximum size fragment is 64K, which is approx 44 fragments with MTU(1500) sized packets. Sizeof(struct ipq) is 200. A 1500 byte packet results in a truesize of 2944 (not 2048 as I first assumed) (44*2944)+200 = 129736 bytes The current default high thresh of 262144 bytes, is obviously problematic, as only two 64K fragments can fit in the queue at the same time. How many 64K fragment can we fit into 4 MBytes: 4*2^20/((44*2944)+200) = 32.34 fragment in queues An attacker could send a separate/distinct fake fragment packets per queue, causing us to allocate one inet_frag_queue per packet, and thus attacking the hash table and its lists. How many frag queue do we need to store, and given a current hash size of 64, what is the average list length. Using one MTU sized fragment per inet_frag_queue, each consuming (2944+200) 3144 bytes. 4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length An attack could send small fragments, the smallest packet I could send resulted in a truesize of 896 bytes (I'm a little surprised by this). 4*2^20/(896+200) = 3827 frag queues -> 59 avg list length When increasing these number, we also need to followup with improvements, that is going to help scalability. Simply increasing the hash size, is not enough as the current implementation does not have a per hash bucket locking. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Fix endianess warning in ip6_flow_hdr().YOSHIFUJI Hideaki2013-01-161-1/+1
| | | | | | | | | | Commit 3e4e4c1f ("ipv6: Introduce ip6_flow_hdr() to fill version, tclass and flowlabel.) uses ntohl(), which should be htonl(). Found by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: 64bit version of ipv6_prefix_equal().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-0/+26
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Remove __ipv6_prefix_equal().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-10/+5
| | | | | | | | ipv6_prefix_equal() just casts its arguments and it is the only user of __ipv6_prefix_equal(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: 64bit version of ipv6_addr_set().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-4/+22
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: 64bit version of ipv6_addr_v4mapped().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-2/+7
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: 64bit version of ipv6_addr_loopback().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-0/+6
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: 64bit version of ipv6_addr_diff().YOSHIFUJI Hideaki / 吉藤英明2013-01-141-1/+28
| | | | | | | | | | | | Introduce __ipv6_addr_diff64() to to find the first different bit between two addresses on 64bit architectures. 32bit version is still available as __ipv6_addr_diff32(), and __ipv6_addr_diff() automatically selects appropriate version. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Introduce ip6_flowinfo() to extract flowinfo (tclass + flowlabel).YOSHIFUJI Hideaki / 吉藤英明2013-01-131-0/+5
| | | | | Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Introduce ip6_flow_hdr() to fill version, tclass and flowlabel.YOSHIFUJI Hideaki / 吉藤英明2013-01-131-0/+9
| | | | | | | | | | This is not only for readability but also for optimization. What we do here is to build the 32bit word at the beginning of the ipv6 header (the "ip6_flow" virtual member of struct ip6_hdr in RFC3542) and we do not need to read the tclass portion of the target buffer. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>