summaryrefslogtreecommitdiffstats
path: root/net/ipv6/ip6_fib.c
Commit message (Collapse)AuthorAgeFilesLines
...
| * net/ipv6: Simplify route replace and appending into multipath routeDavid Ahern2018-05-221-86/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bring consistency to ipv6 route replace and append semantics. Remove rt6_qualify_for_ecmp which is just guess work. It fails in 2 cases: 1. can not replace a route with a reject route. Existing code appends a new route instead of replacing the existing one. 2. can not have a multipath route where a leg uses a dev only nexthop Existing use cases affected by this change: 1. adding a route with existing prefix and metric using NLM_F_CREATE without NLM_F_APPEND or NLM_F_EXCL (ie., what iproute2 calls 'prepend'). Existing code auto-determines that the new nexthop can be appended to an existing route to create a multipath route. This change breaks that by requiring the APPEND flag for the new route to be added to an existing one. Instead the prepend just adds another route entry. 2. route replace. Existing code replaces first matching multipath route if new route is multipath capable and fallback to first matching non-ECMP route (reject or dev only route) in case one isn't available. New behavior replaces first matching route. (Thanks to Ido for spotting this one) Note: Newer iproute2 is needed to display multipath routes with a dev-only nexthop. This is due to a bug in iproute2 and parsing nexthops. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Add fib6_lookupDavid Ahern2018-05-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules, but returns a FIB entry, fib6_info, rather than a dst based rt6_info. fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled) to 60% faster than any of the dst based lookup methods (without custom rules) and 25% faster with custom rules (e.g., l3mdev rule). Since the lookup function has a completely different signature, fib6_rule_action is split into 2 paths: the existing one is renamed __fib6_rule_action and a new one for the fib6_info path is added. fib6_rule_action decides which to call based on the lookup_ptr. If it is fib6_table_lookup then the new path is taken. Caller must hold rcu lock as no reference is taken on the returned fib entry. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * net/ipv6: Rename fib6_lookup to fib6_node_lookupDavid Ahern2018-05-111-6/+8
| | | | | | | | | | | | | | | | | | | | Rename fib6_lookup to fib6_node_lookup to better reflect what it returns. The fib6_lookup name will be used in a later patch for an IPv6 equivalent to IPv4's fib_lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * net/ipv6: rename rt6_next to fib6_nextDavid Ahern2018-05-041-13/+13
| | | | | | | | | | | | | | | | This slipped through the cracks in the followup set to the fib6_info flip. Rename rt6_next to fib6_next. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Make from in rt6_info rcu protectedDavid Ahern2018-04-211-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a dst entry is created from a fib entry, the 'from' in rt6_info is set to the fib entry. The 'from' reference is used most notably for cookie checking - making sure stale dst entries are updated if the fib entry is changed. When a fib entry is deleted, the pcpu routes on it are walked releasing the fib6_info reference. This is needed for the fib6_info cleanup to happen and to make sure all device references are released in a timely manner. There is a race window when a FIB entry is deleted and the 'from' on the pcpu route is dropped and the pcpu route hits a cookie check. Handle this race using rcu on from. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Move release of fib6_info from pcpu routes to helperDavid Ahern2018-04-211-18/+23
| | | | | | | | | | | | | | Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Remove fib6_idevDavid Ahern2018-04-191-2/+0
| | | | | | | | | | | | | | | | fib6_idev can be obtained from __in6_dev_get on the nexthop device rather than caching it in the fib6_info. Remove it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Rename fib6_info struct elementsDavid Ahern2018-04-191-85/+85
| | | | | | | | | | | | | | | | | | | | | | Change the prefix for fib6_info struct elements from rt6i_ to fib6_. rt6i_pcpu and rt6i_exception_bucket are left as is given that they point to rt6_info entries. Rename only; not functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Remove unused code and variables for rt6_infoDavid Ahern2018-04-171-22/+0
| | | | | | | | | | | | | | | | Drop unneeded elements from rt6_info struct and rearrange layout to something more relevant for the data path. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Flip FIB entries to fib6_infoDavid Ahern2018-04-171-58/+58
| | | | | | | | | | | | | | | | Convert all code paths referencing a FIB entry from rt6_info to fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: separate handling of FIB entries from dst based routesDavid Ahern2018-04-171-19/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Last step before flipping the data type for FIB entries: - use fib6_info_alloc to create FIB entries in ip6_route_info_create and addrconf_dst_alloc - use fib6_info_release in place of dst_release, ip6_rt_put and rt6_release - remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt - when purging routes, drop per-cpu routes - replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release - use rt->from since it points to the FIB entry - drop references to exception bucket, fib6_metrics and per-cpu from dst entries (those are relevant for fib entries only) Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: introduce fib6_info struct and helpersDavid Ahern2018-04-171-0/+60
| | | | | | | | | | | | | | Add fib6_info struct and alloc, destroy, hold and release helpers. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Cleanup exception and cache route handlingDavid Ahern2018-04-171-15/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPv6 FIB will only contain FIB entries with exception routes added to the FIB entry. Once this transformation is complete, FIB lookups will return a fib6_info with the lookup functions still returning a dst based rt6_info. The current code uses rt6_info for both paths and overloads the rt6_info variable usually called 'rt'. This patch introduces a new 'f6i' variable name for the result of the FIB lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a fib6_info in a later patch which is why it is introduced as f6i now; avoids the additional churn in the later patch. In addition, remove RTF_CACHE and dst checks from fib6 add and delete since they can not happen now and will never happen after the data type flip. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Add fib6_null_entryDavid Ahern2018-04-171-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | ip6_null_entry will stay a dst based return for lookups that fail to match an entry. Add a new fib6_null_entry which constitutes the root node and leafs for fibs. Replace existing references to ip6_null_entry with the new fib6_null_entry when dealing with FIBs. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: move expires into rt6_infoDavid Ahern2018-04-171-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add expires to rt6_info for FIB entries, and add fib6 helpers to manage it. Data path use of dst.expires remains. The transition is fairly straightforward: when working with fib entries, rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and rt6_check_expired becomes fib6_check_expired, where the fib6 versions are added by this patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: move metrics from dst to rt6_infoDavid Ahern2018-04-171-44/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to IPv4, add fib metrics to the fib struct, which at the moment is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics into dst by reference using refcount. To make the transition: - add dst_metrics to rt6_info. Default to dst_default_metrics if no metrics are passed during route add. No need for a separate pmtu entry; it can reference the MTU slot in fib6_metrics - ip6_convert_metrics allocates memory in the FIB entry and uses ip_metrics_convert to copy from netlink attribute to metrics entry - the convert metrics call is done in ip6_route_info_create simplifying the route add path + fib6_commit_metrics and fib6_copy_metrics and the temporary mx6_config are no longer needed - add fib6_metric_set helper to change the value of a metric in the fib entry since dst_metric_set can no longer be used - cow_metrics for IPv6 can drop to dst_cow_metrics_generic - rt6_dst_from_metrics_check is no longer needed - rt6_fill_node needs the FIB entry and dst as separate arguments to keep compatibility with existing output. Current dst address is renamed to dest. (to be consistent with IPv4 rt6_fill_node really should be split into 2 functions similar to fib_dump_info and rt_fill_info) - rt6_fill_node no longer needs the temporary metrics variable Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Move nexthop data to fib6_nhDavid Ahern2018-04-171-2/+4
| | | | | | | | | | | | | | | | | | | | Introduce fib6_nh structure and move nexthop related data from rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or lwtstate from a FIB lookup perspective are converted to use fib6_nh; datapath references to dst version are left as is. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Pass net to fib6_update_sernumDavid Ahern2018-04-171-2/+1
| | | | | | | | | | | | | | | | Pass net namespace to fib6_update_sernum. It can not be marked const as fib6_new_sernum will change ipv6.fib6_sernum. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | proc: introduce proc_create_net{,_data}Christoph Hellwig2018-05-161-17/+1
|/ | | | | | | | | Variants of proc_create{,_data} that directly take a struct seq_operations and deal with network namespaces in ->open and ->release. All callers of proc_create + seq_open_net converted over, and seq_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
* net/ipv6: Move call_fib6_entry_notifiers up for route addsDavid Ahern2018-03-291-4/+12
| | | | | | | | | | | Move call to call_fib6_entry_notifiers for new IPv6 routes to right before the insertion into the FIB. At this point notifier handlers can decide the fate of the new route with a clean path to delete the potential new entry if the notifier returns non-0. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Drop pernet_operations::asyncKirill Tkhai2018-03-271-1/+0
| | | | | | | | Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ipv6: Pass skb to route lookupDavid Ahern2018-03-041-1/+2
| | | | | | | | | | | | | IPv6 does path selection for multipath routes deep in the lookup functions. The next patch adds L4 hash option and needs the skb for the forward path. To get the skb to the relevant FIB lookup functions it needs to go through the fib rules layer, so add a lookup_data argument to the fib_lookup_arg struct. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert fib6_net_ops, ipv6_addr_label_ops and ip6_segments_opsKirill Tkhai2018-02-191-0/+1
| | | | | | | | | | | | | | | | | These pernet_operations register and unregister tables and lists for packets forwarding. All of the entities are per-net. Init methods makes simple initializations, and since net is not visible for foreigners at the time it is working, it can't race with anything. Exit method is executed when there are only local devices, and there mustn't be packets in-flight. Also, it looks like no one pernet_operations want to send ipv6 packets to foreign net. The same reasons are for ipv6_addr_label_ops and ip6_segments_ops. So, we are able to mark all them as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-01-191-2/+8
|\ | | | | | | | | | | | | | | | | | | | | The BPF verifier conflict was some minor contextual issue. The TUN conflict was less trivial. Cong Wang fixed a memory leak of tfile->tx_array in 'net'. This is an skb_array. But meanwhile in net-next tun changed tfile->tx_arry into tfile->tx_ring which is a ptr_ring. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: don't let tb6_root node share routes with other nodeWei Wang2018-01-181-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 4512c43eac7e, if we add a route to the subtree of tb6_root which does not have any route attached to it yet, the current code will let tb6_root and the node in the subtree share the same route. This could cause problem cause tb6_root has RTN_INFO flag marked and the tree repair and clean up code will not work properly. This commit makes sure tb6_root->leaf points back to null_entry instead of sharing route with other node. It fixes the following syzkaller reported issue: BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 [inline] BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 Read of size 8 at addr ffff8801bc043498 by task syz-executor5/19819 CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x25b/0x340 mm/kasan/report.c:409 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 ipv6_prefix_equal include/net/ipv6.h:540 [inline] fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003 ip6_route_add+0x141/0x190 net/ipv6/route.c:2790 ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299 inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520 sock_do_ioctl+0x65/0xb0 net/socket.c:958 sock_ioctl+0x2c2/0x440 net/socket.c:1055 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 entry_SYSCALL_64_fastpath+0x23/0x9a RIP: 0033:0x452ac9 RSP: 002b:00007fd42b321c58 EFLAGS: 00000212 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000000071bea0 RCX: 0000000000452ac9 RDX: 0000000020fd7000 RSI: 000000000000890b RDI: 0000000000000013 RBP: 000000000000049e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000212 R12: 00000000006f4f70 R13: 00000000ffffffff R14: 00007fd42b3226d4 R15: 0000000000000000 Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-01-111-9/+29
|\| | | | | | | | | | | | | | | | | | | | | | | | | BPF alignment tests got a conflict because the registers are output as Rn_w instead of just Rn in net-next, and in net a fixup for a testcase prohibits logical operations on pointers before using them. Also, we should attempt to patch BPF call args if JIT always on is enabled. Instead, if we fail to JIT the subprogs we should pass an error back up and fail immediately. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: remove null_entry before adding default routeWei Wang2018-01-091-9/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current code, when creating a new fib6 table, tb6_root.leaf gets initialized to net->ipv6.ip6_null_entry. If a default route is being added with rt->rt6i_metric = 0xffffffff, fib6_add() will add this route after net->ipv6.ip6_null_entry. As null_entry is shared, it could cause problem. In order to fix it, set fn->leaf to NULL before calling fib6_add_rt2node() when trying to add the first default route. And reset fn->leaf to null_entry when adding fails or when deleting the last default route. syzkaller reported the following issue which is fixed by this commit: WARNING: suspicious RCU usage 4.15.0-rc5+ #171 Not tainted ----------------------------- net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by swapper/0/0: #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline] #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310 #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline] #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007 #2: (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560 #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline] #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline] fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904 </IRQ> Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Calculate hash thresholds for IPv6 nexthopsIdo Schimmel2018-01-101-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before we convert IPv6 to use hash-threshold instead of modulo-N, we first need each nexthop to store its region boundary in the hash function's output space. The boundary is calculated by dividing the output space equally between the different active nexthops. That is, nexthops that are not dead or linkdown. The boundaries are rebalanced whenever a nexthop is added or removed to a multipath route and whenever a nexthop becomes active or inactive. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-01-091-15/+20
|\|
| * ipv6: fix general protection fault in fib6_add()Wei Wang2018-01-041-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In fib6_add(), pn could be NULL if fib6_add_1() failed to return a fib6 node. Checking pn != fn before accessing pn->leaf makes sure pn is not NULL. This fixes the following GPF reported by syzkaller: general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244 RSP: 0018:ffff8801c7626a70 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000020 RCX: ffffffff84794465 RDX: 0000000000000004 RSI: ffff8801d38935f0 RDI: 0000000000000282 RBP: ffff8801c7626da0 R08: 1ffff10038ec4c35 R09: 0000000000000000 R10: ffff8801c7626c68 R11: 0000000000000000 R12: 00000000fffffffe R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8801db200000(0063) knlGS:0000000009b70840 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000020be1000 CR3: 00000001d585a006 CR4: 00000000001606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006 ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833 inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957 rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411 netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423 netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline] netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864 sock_sendmsg_nosec net/socket.c:636 [inline] sock_sendmsg+0xca/0x110 net/socket.c:646 sock_write_iter+0x31a/0x5d0 net/socket.c:915 call_write_iter include/linux/fs.h:1772 [inline] do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653 do_iter_write+0x154/0x540 fs/read_write.c:932 compat_writev+0x225/0x420 fs/read_write.c:1246 do_compat_writev+0x115/0x220 fs/read_write.c:1267 C_SYSC_writev fs/read_write.c:1278 [inline] compat_SyS_writev+0x26/0x30 fs/read_write.c:1274 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline] do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389 entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125 Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Take table lock outside of sernum update functionIdo Schimmel2018-01-071-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch is going to allow dead routes to remain in the FIB tree in certain situations. When this happens we need to be sure to bump the sernum of the nodes where these are stored so that potential copies cached in sockets are invalidated. The function that performs this update assumes the table lock is not taken when it is invoked, but that will not be the case when it is invoked by the tree walker. Have the function assume the lock is taken and make the single caller take the lock itself. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Export sernum update functionIdo Schimmel2018-01-071-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are going to allow dead routes to stay in the FIB tree (e.g., when they are part of a multipath route, directly connected route with no carrier) and revive them when their nexthop device gains carrier or when it is put administratively up. This is equivalent to the addition of the route to the FIB tree and we should therefore take care of updating the sernum of all the parent nodes of the node where the route is stored. Otherwise, we risk sockets caching and using sub-optimal dst entries. Export the function that performs the above, so that it could be invoked from fib6_ifup() later on. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Teach tree walker to skip multipath routesIdo Schimmel2018-01-071-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As explained in previous patch, fib6_ifdown() needs to consider the state of all the sibling routes when a multipath route is traversed. This is done by evaluating all the siblings when the first sibling in a multipath route is traversed. If the multipath route does not need to be flushed (e.g., not all siblings are dead), then we should just skip the multipath route as our work is done. Have the tree walker jump to the last sibling when it is determined that the multipath route needs to be skipped. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Remove redundant route flushing during namespace dismantleIdo Schimmel2018-01-071-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By the time fib6_net_exit() is executed all the netdevs in the namespace have been either unregistered or pushed back to the default namespace. That is because pernet subsys operations are always ordered before pernet device operations and therefore invoked after them during namespace dismantle. Thus, all the routing tables in the namespace are empty by the time fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed. This allows us to simplify the condition in fib6_ifdown() as it's only ever called with an actual netdev. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: remove __rtnl_registerFlorian Westphal2017-12-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | This removes __rtnl_register and switches callers to either rtnl_register or rtnl_register_module. Also, rtnl_register() will now print an error if memory allocation failed rather than panic the kernel. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Move rt6_next from dst_entry into ipv6 route structure.David Miller2017-11-301-13/+13
|/ | | | | Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
* treewide: setup_timer() -> timer_setup() (2 field)Kees Cook2017-11-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This converts all remaining setup_timer() calls that use a nested field to reach a struct timer_list. Coccinelle does not have an easy way to match multiple fields, so a new script is needed to change the matches of "&_E->_timer" into "&_E->_field1._timer" in all the rules. spatch --very-quiet --all-includes --include-headers \ -I ./arch/x86/include -I ./arch/x86/include/generated \ -I ./include -I ./arch/x86/include/uapi \ -I ./arch/x86/include/generated/uapi -I ./include/uapi \ -I ./include/generated/uapi --include ./include/linux/kconfig.h \ --dir . \ --cocci-file ~/src/data/timer_setup-2fields.cocci @fix_address_of depends@ expression e; @@ setup_timer( -&(e) +&e , ...) // Update any raw setup_timer() usages that have a NULL callback, but // would otherwise match change_timer_function_usage, since the latter // will update all function assignments done in the face of a NULL // function initialization in setup_timer(). @change_timer_function_usage_NULL@ expression _E; identifier _field1; identifier _timer; type _cast_data; @@ ( -setup_timer(&_E->_field1._timer, NULL, _E); +timer_setup(&_E->_field1._timer, NULL, 0); | -setup_timer(&_E->_field1._timer, NULL, (_cast_data)_E); +timer_setup(&_E->_field1._timer, NULL, 0); | -setup_timer(&_E._field1._timer, NULL, &_E); +timer_setup(&_E._field1._timer, NULL, 0); | -setup_timer(&_E._field1._timer, NULL, (_cast_data)&_E); +timer_setup(&_E._field1._timer, NULL, 0); ) @change_timer_function_usage@ expression _E; identifier _field1; identifier _timer; struct timer_list _stl; identifier _callback; type _cast_func, _cast_data; @@ ( -setup_timer(&_E->_field1._timer, _callback, _E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, &_callback, _E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, &_callback, (_cast_data)_E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, (_cast_func)_callback, _E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, (_cast_func)&_callback, _E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, _callback, (_cast_data)_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, &_callback, (_cast_data)_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, &_callback, (_cast_data)&_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)&_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)&_E); +timer_setup(&_E._field1._timer, _callback, 0); | _E->_field1._timer@_stl.function = _callback; | _E->_field1._timer@_stl.function = &_callback; | _E->_field1._timer@_stl.function = (_cast_func)_callback; | _E->_field1._timer@_stl.function = (_cast_func)&_callback; | _E._field1._timer@_stl.function = _callback; | _E._field1._timer@_stl.function = &_callback; | _E._field1._timer@_stl.function = (_cast_func)_callback; | _E._field1._timer@_stl.function = (_cast_func)&_callback; ) // callback(unsigned long arg) @change_callback_handle_cast depends on change_timer_function_usage@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; identifier _handle; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { ( ... when != _origarg _handletype *_handle = -(_handletype *)_origarg; +from_timer(_handle, t, _field1._timer); ... when != _origarg | ... when != _origarg _handletype *_handle = -(void *)_origarg; +from_timer(_handle, t, _field1._timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(_handletype *)_origarg; +from_timer(_handle, t, _field1._timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(void *)_origarg; +from_timer(_handle, t, _field1._timer); ... when != _origarg ) } // callback(unsigned long arg) without existing variable @change_callback_handle_cast_no_arg depends on change_timer_function_usage && !change_callback_handle_cast@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { + _handletype *_origarg = from_timer(_origarg, t, _field1._timer); + ... when != _origarg - (_handletype *)_origarg + _origarg ... when != _origarg } // Avoid already converted callbacks. @match_callback_converted depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier t; @@ void _callback(struct timer_list *t) { ... } // callback(struct something *handle) @change_callback_handle_arg depends on change_timer_function_usage && !match_callback_converted && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; @@ void _callback( -_handletype *_handle +struct timer_list *t ) { + _handletype *_handle = from_timer(_handle, t, _field1._timer); ... } // If change_callback_handle_arg ran on an empty function, remove // the added handler. @unchange_callback_handle_arg depends on change_timer_function_usage && change_callback_handle_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; identifier t; @@ void _callback(struct timer_list *t) { - _handletype *_handle = from_timer(_handle, t, _field1._timer); } // We only want to refactor the setup_timer() data argument if we've found // the matching callback. This undoes changes in change_timer_function_usage. @unchange_timer_function_usage depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg && !change_callback_handle_arg@ expression change_timer_function_usage._E; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type change_timer_function_usage._cast_data; @@ ( -timer_setup(&_E->_field1._timer, _callback, 0); +setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E); | -timer_setup(&_E._field1._timer, _callback, 0); +setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E); ) // If we fixed a callback from a .function assignment, fix the // assignment cast now. @change_timer_function_assignment depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression change_timer_function_usage._E; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_func; typedef TIMER_FUNC_TYPE; @@ ( _E->_field1._timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_field1._timer.function = -&_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_field1._timer.function = -(_cast_func)_callback; +(TIMER_FUNC_TYPE)_callback ; | _E->_field1._timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; | _E._field1._timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E._field1._timer.function = -&_callback; +(TIMER_FUNC_TYPE)_callback ; | _E._field1._timer.function = -(_cast_func)_callback +(TIMER_FUNC_TYPE)_callback ; | _E._field1._timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; ) // Sometimes timer functions are called directly. Replace matched args. @change_timer_function_calls depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression _E; identifier change_timer_function_usage._field1; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_data; @@ _callback( ( -(_cast_data)_E +&_E->_field1._timer | -(_cast_data)&_E +&_E._field1._timer | -_E +&_E->_field1._timer ) ) // If a timer has been configured without a data argument, it can be // converted without regard to the callback argument, since it is unused. @match_timer_function_unused_data@ expression _E; identifier _field1; identifier _timer; identifier _callback; @@ ( -setup_timer(&_E->_field1._timer, _callback, 0); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, _callback, 0L); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E->_field1._timer, _callback, 0UL); +timer_setup(&_E->_field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, _callback, 0); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, _callback, 0L); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_E._field1._timer, _callback, 0UL); +timer_setup(&_E._field1._timer, _callback, 0); | -setup_timer(&_field1._timer, _callback, 0); +timer_setup(&_field1._timer, _callback, 0); | -setup_timer(&_field1._timer, _callback, 0L); +timer_setup(&_field1._timer, _callback, 0); | -setup_timer(&_field1._timer, _callback, 0UL); +timer_setup(&_field1._timer, _callback, 0); | -setup_timer(_field1._timer, _callback, 0); +timer_setup(_field1._timer, _callback, 0); | -setup_timer(_field1._timer, _callback, 0L); +timer_setup(_field1._timer, _callback, 0); | -setup_timer(_field1._timer, _callback, 0UL); +timer_setup(_field1._timer, _callback, 0); ) @change_callback_unused_data depends on match_timer_function_unused_data@ identifier match_timer_function_unused_data._callback; type _origtype; identifier _origarg; @@ void _callback( -_origtype _origarg +struct timer_list *unused ) { ... when != _origarg } Signed-off-by: Kees Cook <keescook@chromium.org>
* net: Add extack to fib_notifier_infoDavid Ahern2017-11-011-6/+9
| | | | | | | | | | | Add extack to fib_notifier_info and plumb through stack to call_fib_rule_notifiers, call_fib_entry_notifiers and call_fib6_entry_notifiers. This allows notifer handlers to return messages to user. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: mark expected switch fall-throughsGustavo A. R. Silva2017-10-181-0/+4
| | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Notice that in some cases I placed the "fall through" comment on its own line, which is what GCC is expecting to find. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: check fn before doing FIB6_SUBTREE(fn)Wei Wang2017-10-161-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In fib6_locate(), we need to first make sure fn is not NULL before doing FIB6_SUBTREE(fn) to avoid crash. This fixes the following static checker warning: net/ipv6/ip6_fib.c:1462 fib6_locate() warn: variable dereferenced before check 'fn' (see line 1459) net/ipv6/ip6_fib.c 1458 if (src_len) { 1459 struct fib6_node *subtree = FIB6_SUBTREE(fn); ^^^^^^^^^^^^^^^^ We shifted this dereference 1460 1461 WARN_ON(saddr == NULL); 1462 if (fn && subtree) ^^ before the check for NULL. 1463 fn = fib6_locate_1(subtree, saddr, src_len, 1464 offsetof(struct rt6_info, rt6i_src) Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: use rcu_dereference_bh() in ipv6_route_seq_next()Wei Wang2017-10-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces rcu_deference() with rcu_dereference_bh() in ipv6_route_seq_next() to avoid the following warning: [ 19.431685] WARNING: suspicious RCU usage [ 19.433451] 4.14.0-rc3-00914-g66f5d6c #118 Not tainted [ 19.435509] ----------------------------- [ 19.437267] net/ipv6/ip6_fib.c:2259 suspicious rcu_dereference_check() usage! [ 19.440790] [ 19.440790] other info that might help us debug this: [ 19.440790] [ 19.444734] [ 19.444734] rcu_scheduler_active = 2, debug_locks = 1 [ 19.447757] 2 locks held by odhcpd/3720: [ 19.449480] #0: (&p->lock){+.+.}, at: [<ffffffffb1231f7d>] seq_read+0x3c/0x333 [ 19.452720] #1: (rcu_read_lock_bh){....}, at: [<ffffffffb1d2b984>] ipv6_route_seq_start+0x5/0xfd [ 19.456323] [ 19.456323] stack backtrace: [ 19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted 4.14.0-rc3-00914-g66f5d6c #118 [ 19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 19.465414] Call Trace: [ 19.466788] dump_stack+0x86/0xc0 [ 19.468358] lockdep_rcu_suspicious+0xea/0xf3 [ 19.470183] ipv6_route_seq_next+0x71/0x164 [ 19.471963] seq_read+0x244/0x333 [ 19.473522] proc_reg_read+0x48/0x67 [ 19.475152] ? proc_reg_write+0x67/0x67 [ 19.476862] __vfs_read+0x26/0x10b [ 19.478463] ? __might_fault+0x37/0x84 [ 19.480148] vfs_read+0xba/0x146 [ 19.481690] SyS_read+0x51/0x8e [ 19.483197] do_int80_syscall_32+0x66/0x15a [ 19.484969] entry_INT80_compat+0x32/0x50 [ 19.486707] RIP: 0023:0xf7f0be8e [ 19.488244] RSP: 002b:00000000ffa75d04 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ 19.491431] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 0000000008056068 [ 19.493886] RDX: 0000000000001000 RSI: 0000000008056008 RDI: 0000000000001000 [ 19.496331] RBP: 00000000000001ff R08: 0000000000000000 R09: 0000000000000000 [ 19.498768] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 19.501217] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Xiaolong Ye <xiaolong.ye@intel.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: take care of rt6_statsWei Wang2017-10-071-18/+24
| | | | | | | | | | | | | | | | | Currently, most of the rt6_stats are not hooked up correctly. As the last part of this patch series, hook up all existing rt6_stats and add one new stat fib_rt_uncache to indicate the number of routes in the uncached list. For details of the stats, please refer to the comments added in include/net/ip6_fib.h. Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified under a lock. So atomic_t is used for them. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: replace rwlock with rcu and spinlock in fib6_tableWei Wang2017-10-071-161/+244
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With all the preparation work before, we are now ready to replace rwlock with rcu and spinlock in fib6_table. That means now all fib6_node in fib6_table are protected by rcu. And when freeing fib6_node, call_rcu() is used to wait for the rcu grace period before releasing the memory. When accessing fib6_node, corresponding rcu APIs need to be used. And all previous sessions protected by the write lock will now be protected by the spin lock per table. All previous sessions protected by read lock will now be protected by rcu_read_lock(). A couple of things to note here: 1. As part of the work of replacing rwlock with rcu, the linked list of fn->leaf now has to be rcu protected as well. So both fn->leaf and rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are used when manipulating them. 2. For fn->rr_ptr, first of all, it also needs to be rcu protected now and is tagged with __rcu and rcu APIs are used in corresponding places. Secondly, fn->rr_ptr is changed in rt6_select() which is a reader thread. This makes the issue a bit complicated. We think a valid solution for it is to let rt6_select() grab the tb6_lock if it decides to change it. As it is not in the normal operation and only happens when there is no valid neighbor cache for the route, we think the performance impact should be low. 3. fib6_walk_continue() has to be called with tb6_lock held even in the route dumping related functions, e.g. inet6_dump_fib(), fib6_tables_dump() and ipv6_route_seq_ops. It is because fib6_walk_continue() makes modifications to the walker structure, and so are fib6_repair_tree() and fib6_del_route(). In order to do proper syncing between them, we need to let fib6_walk_continue() hold the lock. We may be able to do further improvement on the way we do the tree walk to get rid of the need for holding the spin lock. But not for now. 4. When fib6_del_route() removes a route from the tree, we no longer mark rt->dst.rt6_next to NULL to make simultaneous reader be able to further traverse the list with rcu. However, rt->dst.rt6_next is only valid within this same rcu period. No one should access it later. 5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be performed before we publish this route (either by linking it to fn->leaf or insert it in the list pointed by fn->leaf) just to be safe because as soon as we publish the route, some read thread will be able to access it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: check fn->leaf before it is usedWei Wang2017-10-071-5/+18
| | | | | | | | | | | | | | | | | | | If rwlock is replaced with rcu and spinlock, it is possible that the reader thread will see fn->leaf as NULL in the following scenarios: 1. fib6_add() is in progress and we have already inserted a new node but not yet inserted the route. 2. fib6_del_route() is in progress and we have already set fn->leaf to NULL but not yet freed the node because of rcu grace period. This patch makes sure all the reader threads check fn->leaf first before using it. And together with later patch to grab rcu_read_lock() and rcu_dereference() fn->leaf, it makes sure reader threads are safe when accessing fn->leaf. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: update fn_sernum after route is inserted to treeWei Wang2017-10-071-18/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | fib6_add() logic currently calls fib6_add_1() to figure out what node should be used for the newly added route and then call fib6_add_rt2node() to insert the route to the node. And during the call of fib6_add_1(), fn_sernum is updated for all nodes that share the same prefix as the new route. This does not have issue in the current code because reader thread will not be able to access the tree while writer thread is inserting new route to it. However, it is not the case once we transition to use RCU. Reader thread could potentially see the new fn_sernum before the new route is inserted. As a result, reader thread's route lookup will return a stale route with the new fn_sernum. In order to solve this issue, we remove all the update of fn_sernum in fib6_add_1(), and instead, introduce a new function that updates fn_sernum for all related nodes and call this functions once the route is successfully inserted to the tree. Also, smp_wmb() is used after a route is successfully inserted into the fib tree and right before the updated of fn->sernum. And smp_rmb() is used right after fn->sernum is accessed in rt6_get_cookie_safe(). This is to guarantee that when the reader thread sees the new fn->sernum, the new route is already inserted in the tree in memory. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: don't release rt->rt6i_pcpu memory during rt6_release()Wei Wang2017-10-071-3/+0
| | | | | | | | | | | | | | | After rwlock is replaced with rcu and spinlock, route lookup can happen simultanously with route deletion. This patch removes the call to free_percpu(rt->rt6i_pcpu) from rt6_release() to avoid the race condition between rt6_release() and rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already called in ip6_dst_destroy() after the rcu grace period, it is safe to do this change. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: hook up exception table to store dst cacheWei Wang2017-10-071-78/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes use of the exception hash table implementation to store dst caches created by pmtu discovery and ip redirect into the hash table under the rt_info and no longer inserts these routes into fib6 tree. This makes the fib6 tree only contain static configured routes and could now be protected by rcu instead of a rw lock. With this change, in the route lookup related functions, after finding the rt6_info with the longest prefix, we also need to search for the exception table before doing backtracking. In the route delete function, if the route being deleted is not a dst cache, deletion of this route also need to flush the whole hash table under it. If it is a dst cache, then only delete the cached dst in the hash table. Note: for fib6_walk_continue() function, w->root now is always pointing to a root node considering that fib6_prune_clones() is removed from the code. So we add a WARN_ON() msg to make sure w->root always points to a root node and also removed the update of w->root in fib6_repair_tree(). This is a prerequisite for later patch because we don't need to make w->root as rcu protected when replacing rwlock with RCU. Also, we remove all prune related variables as it is no longer used. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: prepare fib6_locate() for exception tableWei Wang2017-10-071-7/+23
| | | | | | | | | | | | | | | | | | | fib6_locate() is used to find the fib6_node according to the passed in prefix address key. It currently tries to find the fib6_node with the exact match of the passed in key. However, when we move cached routes into the exception table, fib6_locate() will fail to find the fib6_node for it as the cached routes will be stored in the exception table under the fib6_node with the longest prefix match of the cache's dst addr key. This commit adds a new parameter to let the caller specify if it needs exact match or longest prefix match. Right now, all callers still does exact match when calling fib6_locate(). It will be changed in later commit where exception table is hooked up to store cached routes. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: prepare fib6_age() for exception tableWei Wang2017-10-071-17/+9
| | | | | | | | | | | | | | | If all dst cache entries are stored in the exception table under the main route, we have to go through them during fib6_age() when doing garbage collecting. Introduce a new function rt6_age_exception() which goes through all dst entries in the exception table and remove those entries that are expired. This function is called in fib6_age() so that all dst caches are also garbage collected. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: introduce a new function fib6_update_sernum()Wei Wang2017-10-071-0/+14
| | | | | | | | | | | | This function takes a route as input and tries to update the sernum in the fib6_node this route is associated with. It will be used in later commit when adding a cached route into the exception table under that route. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>