summaryrefslogtreecommitdiffstats
path: root/include/net/dst.h
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-05-261-1/+7
|\ | | | | | | | | | | | | Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: add reference counting to metricsEric Dumazet2017-05-261-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: make struct dst_entry::dev first memberAlexey Dobriyan2017-05-181-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | struct dst_entry::dev is used most often. Move it so it can be accessed without imm8 offset on x86_64. add/remove: 0/0 grow/shrink: 9/239 up/down: 52/-413 (-361) function old new delta dst_rcu_free 126 138 +12 fnhe_flush_routes 211 219 +8 rt_set_nexthop 747 754 +7 rt_cache_route 85 91 +6 rt6_release 209 215 +6 dst_release 107 111 +4 dst_destroy_rcu 29 33 +4 dn_dst_check_expire 329 333 +4 dn_insert_route 484 485 +1 xfrm_resolve_and_create_bundle 2991 2990 -1 ... ip_route_me_harder 1163 1157 -6 __ip_append_data.isra 2730 2724 -6 ip6_forward 3052 3045 -7 callforward_do_filter 659 651 -8 dst_gc_task 571 549 -22 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rename dst_neigh_output back to neigh_outputJulian Anastasov2017-02-111-12/+0
| | | | | | | | | | After the dst->pending_confirm flag was removed, we do not need anymore to provide dst arg to dst_neigh_output. So, rename it to neigh_output as before commit 5110effee8fd ("net: Do delayed neigh confirmation."). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: pending_confirm is not used anymoreJulian Anastasov2017-02-071-12/+2
| | | | | | | | | | | | | When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. As last step, we can remove the pending_confirm flag. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add confirm_neigh method to dst_opsJulian Anastasov2017-02-071-0/+7
| | | | | | | | | | | | | | Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6 to lookup and confirm the neighbour. Its usage via the new helper dst_confirm_neigh() should be restricted to MSG_PROBE users for performance reasons. For XFRM prefer the last tunnel address, if present. With help from Steffen Klassert. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* route: move lwtunnel state to a single placeJiri Benc2016-04-251-4/+1
| | | | | | | | | | | | | | | | | | Commit 751a587ac9f9 ("route: fix breakage after moving lwtunnel state") moved lwtstate to the end of dst_entry for 32bit archs. This makes it share the cacheline with __refcnt which had an unkown effect on performance. For this reason, the pointer was kept in place for 64bit archs. However, later performance measurements showed this is of no concern. It turns out that every performance sensitive path that accesses lwtstate accesses also struct rtable or struct rt6_info which share the same cache line. Thus, to get rid of a few #ifdefs, move the field to the end of the struct also for 64bit. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf, dst: add and use dst_tclassid helperDaniel Borkmann2016-03-181-0/+12
| | | | | | | | | We can just add a small helper dst_tclassid() for retrieving the dst->tclassid value. It makes the code a bit better in that we can get rid of the ifdef from filter.c by moving this into the header. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fix IP early demux racesEric Dumazet2015-12-141-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Wilder reported crashes caused by dst reuse. <quote David> I am seeing a crash on a distro V4.2.3 kernel caused by a double release of a dst_entry. In ipv4_dst_destroy() the call to list_empty() finds a poisoned next pointer, indicating the dst_entry has already been removed from the list and freed. The crash occurs 18 to 24 hours into a run of a network stress exerciser. </quote> Thanks to his detailed report and analysis, we were able to understand the core issue. IP early demux can associate a dst to skb, after a lookup in TCP/UDP sockets. When socket cache is not properly set, we want to store into sk->sk_dst_cache the dst for future IP early demux lookups, by acquiring a stable refcount on the dst. Problem is this acquisition is simply using an atomic_inc(), which works well, unless the dst was queued for destruction from dst_release() noticing dst refcount went to zero, if DST_NOCACHE was set on dst. We need to make sure current refcount is not zero before incrementing it, or risk double free as David reported. This patch, being a stable candidate, adds two new helpers, and use them only from IP early demux problematic paths. It might be possible to merge in net-next skb_dst_force() and skb_dst_force_safe(), but I prefer having the smallest patch for stable kernels : Maybe some skb_dst_force() callers do not expect skb->dst can suddenly be cleared. Can probably be backported back to linux-3.6 kernels Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* dst: Pass net into dst->outputEric W. Biederman2015-10-081-4/+4
| | | | | | | | The network namespace is already passed into dst_output pass it into dst->output lwt->output and friends. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Pass net into dst_output and remove dst_output_okfnEric W. Biederman2015-10-081-5/+1
| | | | | | | Replace dst_output_okfn with dst_output Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: constify ip_route_output_flow() socket argumentEric Dumazet2015-09-251-4/+5
| | | | | | | | | | Very soon, TCP stack might call inet_csk_route_req(), which calls inet_csk_route_req() with an unlocked listener socket, so we need to make sure ip_route_output_flow() is not trying to change any field from its socket argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: Pass net into okfnEric W. Biederman2015-09-171-0/+4
| | | | | | | | | | | | | | | | | | This is immediately motivated by the bridge code that chains functions that call into netfilter. Without passing net into the okfns the bridge code would need to guess about the best expression for the network namespace to process packets in. As net is frequently one of the first things computed in continuation functions after netfilter has done it's job passing in the desired network namespace is in many cases a code simplification. To support this change the function dst_output_okfn is introduced to simplify passing dst_output as an okfn. For the moment dst_output_okfn just silently drops the struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Merge dst_output and dst_output_skEric W. Biederman2015-09-171-5/+1
| | | | | | | | | Add a sock paramter to dst_output making dst_output_sk superfluous. Add a skb->sk parameter to all of the callers of dst_output Have the callers of dst_output_sk call dst_output. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: use dctcp if enabled on the route to the initiatorDaniel Borkmann2015-08-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | Currently, the following case doesn't use DCTCP, even if it should: A responder has f.e. Cubic as system wide default, but for a specific route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The initiating host then uses DCTCP as congestion control, but since the initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok, and we have to fall back to Reno after 3WHS completes. We were thinking on how to solve this in a minimal, non-intrusive way without bloating tcp_ecn_create_request() needlessly: lets cache the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0) is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows to only do a single metric feature lookup inside tcp_ecn_create_request(). Joint work with Florian Westphal. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* dst: Add __skb_dst_copy() variationJoe Stringer2015-08-271-2/+7
| | | | | | | | | This variation on skb_dst_copy() doesn't require two skbs. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* route: fix breakage after moving lwtunnel stateJiri Benc2015-08-231-2/+5
| | | | | | | | | | | | | | | | __recnt and related fields need to be in its own cacheline for performance reasons. Commit 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered. This patch fixes the breakage by moving the lwtunnel state to the end of dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline with __refcnt and may affect performance, thus further patches may be needed. Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* route: move lwtunnel state to dst_entryJiri Benc2015-08-201-1/+2
| | | | | | | | | | | | | | | Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* dst: Metadata destinationsThomas Graf2015-07-211-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb->dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: make skb_dst_pop routine staticYing Xue2015-05-121-12/+0
| | | | | | | | As xfrm_output_one() is the only caller of skb_dst_pop(), we should make skb_dst_pop() localized. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peerMartin KaFai Lau2015-05-011-6/+0
| | | | | | | | | | | | | | | _rt6i_peer is no longer needed after the last patch, 'ipv6: Stop rt6_info from using inet_peer's metrics'. DST_METRICS_FORCE_OVERWRITE is added by commit e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely"). Since inetpeer is no longer used for metrics, this bit is also not needed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: release dst_orig in case of error in xfrm_lookup()huaibin Wang2015-02-121-0/+1
| | | | | | | | | | | | | | dst_orig should be released on error. Function like __xfrm_route_forward() expects that behavior. Since a recent commit, xfrm_lookup() may also be called by xfrm_lookup_route(), which expects the opposite. Let's introduce a new flag (XFRM_LOOKUP_KEEP_DST_REF) to tell what should be done in case of error. Fixes: f92ee61982d("xfrm: Generate blackhole routes only from route lookup functions") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* xfrm: Generate queueing routes only from route lookup functionsSteffen Klassert2014-09-161-0/+1
| | | | | | | | | | | | | | | | Currently we genarate a queueing route if we have matching policies but can not resolve the states and the sysctl xfrm_larval_drop is disabled. Here we assume that dst_output() is called to kill the queued packets. Unfortunately this assumption is not true in all cases, so it is possible that these packets leave the system unwanted. We fix this by generating queueing routes only from the route lookup functions, here we can guarantee a call to dst_output() afterwards. Fixes: a0073fe18e71 ("xfrm: Add a state resolution packet queue") Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* xfrm: Generate blackhole routes only from route lookup functionsSteffen Klassert2014-09-161-1/+14
| | | | | | | | | | | | | | | | Currently we genarate a blackhole route route whenever we have matching policies but can not resolve the states. Here we assume that dst_output() is called to kill the balckholed packets. Unfortunately this assumption is not true in all cases, so it is possible that these packets leave the system unwanted. We fix this by generating blackhole routes only from the route lookup functions, here we can guarantee a call to dst_output() afterwards. Fixes: 2774c131b1d ("xfrm: Handle blackhole route creation via afinfo.") Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* ipv4: add a sock pointer to dst->output() path.Eric Dumazet2014-04-151-3/+11
| | | | | | | | | | | | | | | | In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: 8f646c922d55 ("vxlan: keep original skb ownership") Reported-by: lucien xin <lucien.xin@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: do not overwrite inetpeer metrics prematurelyMichal Kubeček2014-03-271-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an IPv6 host route with metrics exists, an attempt to add a new route for the same target with different metrics fails but rewrites the metrics anyway: 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1s 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500 RTNETLINK answers: File exists 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1.5s This is caused by all IPv6 host routes using the metrics in their inetpeer (or the shared default). This also holds for the new route created in ip6_route_add() which shares the metrics with the already existing route and thus ip6_route_add() rewrites the metrics even if the new route ends up not being used at all. Another problem is that old metrics in inetpeer can reappear unexpectedly for a new route, e.g. 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip route del fec0::1 12sp0:~ # ip route add fec0::1 dev eth0 12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 hoplimit 10 rto_min lock 1s Resolve the first problem by moving the setting of metrics down into fib6_add_rt2node() to the point we are sure we are inserting the new route into the tree. Second problem is addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE which is set for a new host route in ip6_route_add() and makes ipv6_cow_metrics() always overwrite the metrics in inetpeer (even if they are not "new"); it is reset after that. v5: use a flag in _metrics member rather than one in flags v4: fix a typo making a condition always true (thanks to Hannes Frederic Sowa) v3: rewritten based on David Miller's idea to move setting the metrics (and allocation in non-host case) down to the point we already know the route is to be inserted. Also rebased to net-next as it is quite late in the cycle. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: remove now unused flag DST_NOPEERHannes Frederic Sowa2014-03-061-4/+3
| | | | | | | | | | | | | | | Commit e688a604807647 ("net: introduce DST_NOPEER dst flag") introduced DST_NOPEER because because of crashes in ipv6_select_ident called from udp6_ufo_fragment. Since commit 916e4cf46d0204 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data") we don't call ipv6_select_ident any more from ip6_ufo_append_data, thus this flag lost its purpose and can be removed. Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add utility functions to clear rxhashTom Herbert2013-12-171-3/+2
| | | | | | | | | | | | | | | | In several places 'skb->rxhash = 0' is being done to clear the rxhash value in an skb. This does not clear l4_rxhash which could still be set so that the rxhash wouldn't be recalculated on subsequent call to skb_get_rxhash. This patch adds an explict function to clear all the rxhash related information in the skb properly. skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as l4_rxhash. Fixed up places where 'skb->rxhash = 0' was being called. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-10-231-0/+12
|\ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dst: provide accessor function to dst->xfrmVlad Yasevich2013-10-171-0/+12
| | | | | | | | | | | | | | | | | | dst->xfrm is conditionally defined. Provide accessor funtion that is always available. Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | dst.h: Remove extern from function prototypesJoe Perches2013-09-201-13/+12
|/ | | | | | | | | | | | | There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tunnels: harmonize cleanup done on skb on rx pathNicolas Dichtel2013-09-041-5/+7
| | | | | | | | The goal of this patch is to harmonize cleanup done on a skbuff on rx path. Before this patch, behaviors were different depending of the tunnel type. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bugZhouyi Zhou2013-03-151-2/+4
| | | | | | | | | | | | | | | | | When neighbour table is full, dst_neigh_lookup/dst_neigh_lookup_skb will return -ENOBUFS which is absolutely non zero, while all the code in kernel which use above functions assume failure only on zero return which will cause panic. (for example: : https://bugzilla.kernel.org/show_bug.cgi?id=54731). This patch corrects above error with smallest changes to kernel source code and also correct two return value check missing bugs in drivers/infiniband/hw/cxgb4/cm.c Tested on my x86_64 SMP machine Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix race condition regarding dst->expires and dst->from.YOSHIFUJI Hideaki / 吉藤英明2013-02-201-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric Dumazet wrote: | Some strange crashes happen in rt6_check_expired(), with access | to random addresses. | | At first glance, it looks like the RTF_EXPIRES and | stuff added in commit 1716a96101c49186b | (ipv6: fix problem with expired dst cache) | are racy : same dst could be manipulated at the same time | on different cpus. | | At some point, our stack believes rt->dst.from contains a dst pointer, | while its really a jiffie value (as rt->dst.expires shares the same area | of memory) | | rt6_update_expires() should be fixed, or am I missing something ? | | CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060 Because we do not have any locks for dst_entry, we cannot change essential structure in the entry; e.g., we cannot change reference to other entity. To fix this issue, split 'from' and 'expires' field in dst_entry out of union. Once it is 'from' is assigned in the constructor, keep the reference until the very last stage of the life time of the object. Of course, it is unsafe to change 'from', so make rt6_set_from simple just for fresh entries. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Neil Horman <nhorman@tuxdriver.com> CC: Gao Feng <gaofeng@cn.fujitsu.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: Steinar H. Gunderson <sesse@google.com> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Add a state resolution packet queueSteffen Klassert2013-02-061-0/+1
| | | | | | | | | | | | | | | | | As the default, we blackhole packets until the key manager resolves the states. This patch implements a packet queue where IPsec packets are queued until the states are resolved. We generate a dummy xfrm bundle, the output routine of the returned route enqueues the packet to a per policy queue and arms a timer that checks for state resolution when dst_output() is called. Once the states are resolved, the packets are sent out of the queue. If the states are not resolved after some time, the queue is flushed. This patch keeps the defaut behaviour to blackhole packets as long as we have no states. To enable the packet queue the sysctl xfrm_larval_drop must be switched off. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-08-221-1/+1
|\
| * net: force dst_default_metrics to const sectionEric Dumazet2012-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While investigating on network performance problems, I found this little gem : $ nm -v vmlinux | grep -1 dst_default_metrics ffffffff82736540 b busy.46605 ffffffff82736560 B dst_default_metrics ffffffff82736598 b dst_busy_list Apparently, declaring a const array without initializer put it in (writeable) bss section, in middle of possibly often dirtied cache lines. Since we really want dst_default_metrics be const to avoid any possible false sharing and catch any buggy writes, I force a null initializer. ffffffff818a4c20 R dst_default_metrics Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: output path optimizationsEric Dumazet2012-08-071-3/+7
|/ | | | | | | | | | | | | | | | | | | | | | 1) Avoid dirtying neighbour's confirmed field. TCP workloads hits this cache line for each incoming ACK. Lets write n->confirmed only if there is a jiffie change. 2) Optimize neigh_hh_output() for the common Ethernet case, were hh_len is less than 16 bytes. Replace the memcpy() call by two inlined 64bit load/stores on x86_64. Bench results using udpflood test, with -C option (MSG_CONFIRM flag added to sendto(), to reproduce the n->confirmed dirtying on UDP) 24 threads doing 1.000.000 UDP sendto() on dummy device, 4 runs. before : 2.247s, 2.235s, 2.247s, 2.318s after : 1.884s, 1.905s, 1.891s, 1.895s Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Kill routes during PMTU/redirect updates.David S. Miller2012-07-201-0/+1
| | | | | | | Mark them obsolete so there will be a re-lookup to fetch the FIB nexthop exception info. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Document dst->obsolete better.David S. Miller2012-07-201-1/+13
| | | | | | | | | Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill set_dst_metric_rtt().David S. Miller2012-07-101-6/+0
| | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill dst->_neighbour, accessors, and final uses.David S. Miller2012-07-051-16/+1
| | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add optional SKB arg to dst_ops->neigh_lookup().David S. Miller2012-07-051-1/+7
| | | | | | | Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Do delayed neigh confirmation.David S. Miller2012-07-051-8/+21
| | | | | | | | | | | | | | When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>
* include/net/dst.h: neaten asterisk placementEldad Zack2012-06-161-9/+8
| | | | | | | Fix code style - place the asterisk where it belongs. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix incorrect ipsec fragmentGao feng2012-05-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Since commit ad0081e43a "ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed" the fragment of packets is incorrect. because tunnel mode needs IPsec headers and trailer for all fragments, while on transport mode it is sufficient to add the headers to the first fragment and the trailer to the last. so modify mtu and maxfraglen base on ipsec mode and if fragment is first or last. with my test,it work well(every fragment's size is the mtu) and does not trigger slow fragment path. Changes from v1: though optimization, mtu_prev and maxfraglen_prev can be delete. replace xfrm mode codes with dst_entry's new frag DST_XFRM_TUNNEL. add fuction ip6_append_data_mtu to make codes clearer. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* set fake_rtable's dst to NULL to avoid kernel OopsPeter Huang (Peng)2012-04-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | bridge: set fake_rtable's dst to NULL to avoid kernel Oops when bridge is deleted before tap/vif device's delete, kernel may encounter an oops because of NULL reference to fake_rtable's dst. Set fake_rtable's dst to NULL before sending packets out can solve this problem. v4 reformat, change br_drop_fake_rtable(skb) to {} v3 enrich commit header v2 introducing new flag DST_FAKE_RTABLE to dst_entry struct. [ Use "do { } while (0)" for nop br_drop_fake_rtable() implementation -DaveM ] Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Peter Huang <peter.huangpeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix problem with expired dst cacheGao feng2012-04-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the ipv6 dst cache which copy from the dst generated by ICMPV6 RA packet. this dst cache will not check expire because it has no RTF_EXPIRES flag. So this dst cache will always be used until the dst gc run. Change the struct dst_entry,add a union contains new pointer from and expires. When rt6_info.rt6i_flags has no RTF_EXPIRES flag,the dst.expires has no use. we can use this field to point to where the dst cache copy from. The dst.from is only used in IPV6. rt6_check_expired check if rt6_info.dst.from is expired. ip6_rt_copy only set dst.from when the ort has flag RTF_ADDRCONF and RTF_DEFAULT.then hold the ort. ip6_dst_destroy release the ort. Add some functions to operate the RTF_EXPIRES flag and expires(from) together. and change the code to use these new adding functions. Changes from v5: modify ip6_route_add and ndisc_router_discovery to use new adding functions. Only set dst.from when the ort has flag RTF_ADDRCONF and RTF_DEFAULT.then hold the ort. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* BUG: headers with BUG/BUG_ON etc. need linux/bug.hPaul Gortmaker2012-03-041-0/+1
| | | | | | | | | | | | | If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2011-12-231-0/+1
|\ | | | | | | | | | | | | | | | | | | Conflicts: net/bluetooth/l2cap_core.c Just two overlapping changes, one added an initialization of a local variable, and another change added a new local variable. Signed-off-by: David S. Miller <davem@davemloft.net>