summaryrefslogtreecommitdiffstats
path: root/include/net
Commit message (Collapse)AuthorAgeFilesLines
* net: use indirect call wrappers at GRO transport layerPaolo Abeni2018-12-151-0/+7
| | | | | | | | | | | | | | | | | | This avoids an indirect call in the receive path for TCP and UDP packets. TCP takes precedence on UDP, so that we have a single additional conditional in the common case. When IPV6 is build as module, all gro symbols except UDPv6 are builtin, while the latter belong to the ipv6 module, so we need some special care. v1 -> v2: - adapted to INDIRECT_CALL_ changes v2 -> v3: - fix build issue with CONFIG_IPV6=m Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use indirect call wrappers at GRO network layerPaolo Abeni2018-12-151-0/+2
| | | | | | | | | | | | | | This avoids an indirect calls for L3 GRO receive path, both for ipv4 and ipv6, if the latter is not compiled as a module. Note that when IPv6 is compiled as builtin, it will be checked first, so we have a single additional compare for the more common path. v1 -> v2: - adapted to INDIRECT_CALL_ changes Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* neighbor: Improve neighbour struct layoutDavid Ahern2018-12-151-2/+2
| | | | | | | | Move arp_queue_len_bytes ahead of arp_queue to remove two 4-byte holes. Ensure ha element is always 8-byte aligned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* neighbor: Move neigh_update_ext_learned to core fileDavid Ahern2018-12-141-18/+0
| | | | | | | | | neigh_update_ext_learned has one caller in neighbour.c so does not need to be defined in the header. Move it and in the process remove the intialization of ndm_flags and just set it based on the flags check. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fold tcf_block_cb_call() into tc_setup_cb_call()Cong Wang2018-12-141-2/+2
| | | | | | | | | | | | | | | After commit 69bd48404f25 ("net/sched: Remove egdev mechanism"), tc_setup_cb_call() is nearly identical to tcf_block_cb_call(), so we can just fold tcf_block_cb_call() into tc_setup_cb_call() and remove its unused parameter 'exts'. Fixes: 69bd48404f25 ("net/sched: Remove egdev mechanism") Cc: Oz Shlomo <ozsh@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: switchdev: Add extack to switchdev_handle_port_obj_add() callbackPetr Machata2018-12-121-2/+4
| | | | | | | | | | | | | | | Drivers use switchdev_handle_port_obj_add() to handle recursive descent through lower devices. Change this function prototype to take add_cb that itself takes an extack argument. Decode extack from switchdev_notifier_port_obj_info and pass it to add_cb. Update mlxsw and ocelot drivers which use this helper. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: switchdev: Add extack to struct switchdev_notifier_infoPetr Machata2018-12-121-2/+11
| | | | | | | | | | | | | In order to pass extack to the drivers that need it, add an extack field to struct switchdev_notifier_info, and an extack argument to the function call_switchdev_blocking_notifiers(). Also add a helper function switchdev_notifier_info_to_extack(). Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: switchdev: Add extack argument to switchdev_port_obj_add()Petr Machata2018-12-121-2/+4
| | | | | | | | | | | | | After the previous patch, bridge driver has extack argument available to pass to switchdev. Therefore extend switchdev_port_obj_add() with this argument, updating all callers, and passing the argument through to switchdev_port_obj_notify(). Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: Remove egdev mechanismOz Shlomo2018-12-101-30/+0
| | | | | | | | | | | | The egdev mechanism was replaced by the TC indirect block notifications platform. Signed-off-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* net: Add netif_is_gretap()/netif_is_ip6gretap()Oz Shlomo2018-12-101-2/+11
| | | | | | | | | | | | | | | Changed the is_gretap_dev and is_ip6gretap_dev logic from structure comparison to string comparison of the rtnl_link_ops kind field. This approach aligns with the current identification methods and function names of vxlan and geneve network devices. Convert mlxsw to use these helpers and use them in downstream mlx5 patch. Signed-off-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-12-093-5/+30
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <davem@davemloft.net>
| * neighbour: Avoid writing before skb->head in neigh_hh_output()Stefano Brivio2018-12-071-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While skb_push() makes the kernel panic if the skb headroom is less than the unaligned hardware header size, it will proceed normally in case we copy more than that because of alignment, and we'll silently corrupt adjacent slabs. In the case fixed by the previous patch, "ipv6: Check available headroom in ip6_xmit() even without options", we end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware header and write 16 bytes, starting 2 bytes before the allocated buffer. Always check we're not writing before skb->head and, if the headroom is not enough, warn and drop the packet. v2: - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet (Eric Dumazet) - if we avoid the panic, though, we need to explicitly check the headroom before the memcpy(), otherwise we'll have corrupted slabs on a running kernel, after we warn - use __skb_push() instead of skb_push(), as the headroom check is already implemented here explicitly (Eric Dumazet) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: frag_point sanity checkJakub Audykowicz2018-12-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If for some reason an association's fragmentation point is zero, sctp_datamsg_from_user will try to endlessly try to divide a message into zero-sized chunks. This eventually causes kernel panic due to running out of memory. Although this situation is quite unlikely, it has occurred before as reported. I propose to add this simple last-ditch sanity check due to the severity of the potential consequences. Signed-off-by: Jakub Audykowicz <jakub.audykowicz@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: kfree_rcu asocXin Long2018-12-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences a transport's asoc under rcu_read_lock while asoc is freed not after a grace period, which leads to a use-after-free panic. This patch fixes it by calling kfree_rcu to make asoc be freed after a grace period. Note that only the asoc's memory is delayed to free in the patch, it won't cause sk to linger longer. Thanks Neil and Marcelo to make this clear. Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable") Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new transport") Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com Suggested-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | neighbor: Improve garbage collectionDavid Ahern2018-12-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing garbage collection algorithm has a number of problems: 1. The gc algorithm will not evict PERMANENT entries as those entries are managed by userspace, yet the existing algorithm walks the entire hash table which means it always considers PERMANENT entries when looking for entries to evict. In some use cases (e.g., EVPN) there can be tens of thousands of PERMANENT entries leading to wasted CPU cycles when gc kicks in. As an example, with 32k permanent entries, neigh_alloc has been observed taking more than 4 msec per invocation. 2. Currently, when the number of neighbor entries hits gc_thresh2 and the last flush for the table was more than 5 seconds ago gc kicks in walks the entire hash table evicting *all* entries not in PERMANENT or REACHABLE state and not marked as externally learned. There is no discriminator on when the neigh entry was created or if it just moved from REACHABLE to another NUD_VALID state (e.g., NUD_STALE). It is possible for entries to be created or for established neighbor entries to be moved to STALE (e.g., an external node sends an ARP request) right before the 5 second window lapses: -----|---------x|----------|----- t-5 t t+5 If that happens those entries are evicted during gc causing unnecessary thrashing on neighbor entries and userspace caches trying to track them. Further, this contradicts the description of gc_thresh2 which says "Entries older than 5 seconds will be cleared". One workaround is to make gc_thresh2 == gc_thresh3 but that negates the whole point of having separate thresholds. 3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries when gc_thresh2 is exceeded is over kill and contributes to trashing especially during startup. This patch addresses these problems as follows: 1. Use of a separate list_head to track entries that can be garbage collected along with a separate counter. PERMANENT entries are not added to this list. The gc_thresh parameters are only compared to the new counter, not the total entries in the table. The forced_gc function is updated to only walk this new gc_list looking for entries to evict. 2. Entries are added to the list head at the tail and removed from the front. 3. Entries are only evicted if they were last updated more than 5 seconds ago, adhering to the original intent of gc_thresh2. 4. Forced gc is stopped once the number of gc_entries drops below gc_thresh2. 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped when allocating a new neighbor for a PERMANENT entry. By extension this means there are no explicit limits on the number of PERMANENT entries that can be created, but this is no different than FIB entries or FDB entries. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vxlan: Add vxlan_fdb_clear_offload()Petr Machata2018-12-071-0/+6
| | | | | | | | | | | | | | | | | | | | When a driver unoffloads all FDB entries en bloc, it's inefficient to send the switchdev notification one by one. Add a helper that walks the FDB table, unsetting the offload flag on RDST with a given VNI. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vxlan: Add vxlan_fdb_replay()Petr Machata2018-12-071-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | When a VXLAN device becomes relevant to a driver (such as when it is attached to an offloaded bridge), the driver will generally need to walk the existing FDB entries and offload them. Add a function vxlan_fdb_replay() to call a given notifier block for each FDB entry with a given VNI. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: dsa: Add overhead to tag protocol ops.Andrew Lunn2018-12-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | Each DSA tag protocol needs to add additional headers to the Ethernet frame in order to direct it towards a specific switch egress port. It must also remove the head from a frame received from a switch. Indicate the maximum size of these headers in the tag protocol ops structure, so the core can take these overheads into account. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWATEric Dumazet2018-12-042-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | l3mdev: add function to retreive upper masterAlexis Bauvin2018-12-031-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Existing functions to retreive the l3mdev of a device did not walk the master chain to find the upper master. This patch adds a function to find the l3mdev, even indirect through e.g. a bridge: +----------+ | | | vrf-blue | | | +----+-----+ | | +----+-----+ | | | br-blue | | | +----+-----+ | | +----+-----+ | | | eth0 | | | +----------+ This will properly resolve the l3mdev of eth0 to vrf-blue. Signed-off-by: Alexis Bauvin <abauvin@scaleway.com> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: Amine Kherbouche <akherbouche@scaleway.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp_tunnel: add config option to bind to a deviceAlexis Bauvin2018-12-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | UDP tunnel sockets are always opened unbound to a specific device. This patch allow the socket to be bound on a custom device, which incidentally makes UDP tunnels VRF-aware if binding to an l3mdev. Signed-off-by: Alexis Bauvin <abauvin@scaleway.com> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com> Tested-by: Amine Kherbouche <akherbouche@scaleway.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add 'fw_load_policy' generic parameterShalom Toledo2018-12-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many drivers load the device's firmware image during the initialization flow either from the flash or from the disk. Currently this option is not controlled by the user and the driver decides from where to load the firmware image. 'fw_load_policy' gives the ability to control this option which allows the user to choose between different loading policies supported by the driver. This parameter can be useful while testing and/or debugging the device. For example, testing a firmware bug fix. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: reorder flowi_common fields to avoid holesPaolo Abeni2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the flowi* structures are used and memsetted by server functions in critical path. Currently flowi_common has a couple of holes that we can eliminate reordering the struct fields. As a side effect, both flowi4 and flowi6 shrink by 8 bytes. Before: pahole -EC flowi_common struct flowi_common { // ... /* size: 40, cachelines: 1, members: 10 */ /* sum members: 32, holes: 1, sum holes: 4 */ /* padding: 4 */ /* last cacheline: 40 bytes */ }; pahole -EC flowi6 struct flowi6 { // ... /* size: 88, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 24 bytes */ }; pahole -EC flowi4 struct flowi4 { // ... /* size: 56, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 56 bytes */ }; After: struct flowi_common { // ... /* size: 32, cachelines: 1, members: 10 */ /* last cacheline: 32 bytes */ }; struct flowi6 { // ... /* size: 80, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 16 bytes */ }; struct flowi4 { // ... /* size: 48, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 48 bytes */ }; Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: md5: add tcp_md5_needed jump labelEric Dumazet2018-11-301-3/+15
| | | | | | | | | | | | | | | | | | Most linux hosts never setup TCP MD5 keys. We can avoid a cache line miss (accessing tp->md5ig_info) on RX and TX using a jump label. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: make tcp_space() aware of socket backlogEric Dumazet2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jean-Louis Dupond reported poor iscsi TCP receive performance that we tracked to backlog drops. Apparently we fail to send window updates reflecting the fact that we are under stress. Note that we might lack a proper window increase when backlog is fully processed, since __release_sock() clears sk->sk_backlog.len _after_ all skbs have been processed. This should not matter in practice. If we had a significant load through socket backlog, we are in a dangerous situation. Reported-by: Jean-Louis Dupond <jean-louis@dupond.be> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Jean-Louis Dupond<jean-louis@dupond.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: hint compiler about sack flowsEric Dumazet2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Tell the compiler that most TCP flows are using SACK these days. There is no need to add the unlikely() clause in tcp_is_reno(), the compiler is able to infer it. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/flow_dissector: correct comments on enum flow_dissector_key_idEdward Cree2018-11-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | There are no such structs flow_dissector_key_flow_vlan or flow_dissector_key_flow_tags, the actual structs used are struct flow_dissector_key_vlan and struct flow_dissector_key_tags. So correct the comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and FLOW_DISSECTOR_KEY_CVLAN to refer to those. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-282-2/+2
|\| | | | | | | | | | | | | Trivial conflict in net/core/filter.c, a locally computed 'sdif' is now an argument to the function. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2018-11-282-2/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Disable BH while holding list spinlock in nf_conncount, from Taehee Yoo. 2) List corruption in nf_conncount, also from Taehee. 3) Fix race that results in leaving around an empty list node in nf_conncount, from Taehee Yoo. 4) Proper chain handling for inactive chains from the commit path, from Florian Westphal. This includes a selftest for this. 5) Do duplicate rule handles when replacing rules, also from Florian. 6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee. 7) Possible use-after-free in nft_compat when releasing extensions. From Florian. 8) Memory leak in xt_hashlimit, from Taehee. 9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long. 10) Fix cttimeout with udplite and gre, from Florian. 11) Preserve oif for IPv6 link-local generated traffic from mangle table, from Alin Nastac. 12) Missing error handling in masquerade notifiers, from Taehee Yoo. 13) Use mutex to protect registration/unregistration of masquerade extensions in order to prevent a race, from Taehee. 14) Incorrect condition check in tree_nodes_free(), also from Taehee. 15) Fix chain counter leak in rule replacement path, from Taehee. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: add missing error handling code for register functionsTaehee Yoo2018-11-272-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | register_{netdevice/inetaddr/inet6addr}_notifier may return an error value, this patch adds the code to handle these error paths. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-241-0/+12
|\| |
| * | Revert "sctp: remove sctp_transport_pmtu_check"Xin Long2018-11-191-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 22d7be267eaa8114dcc28d66c1c347f667d7878a. The dst's mtu in transport can be updated by a non sctp place like in xfrm where the MTU information didn't get synced between asoc, transport and dst, so it is still needed to do the pmtu check in sctp_packet_config. Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: Replace port obj add/del SDO with a notificationPetr Machata2018-11-231-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of this field from all clients, which were migrated to use switchdev notification in the previous patches. Add a new function switchdev_port_obj_notify() that sends the switchdev notifications SWITCHDEV_PORT_OBJ_ADD and _DEL. Update switchdev_port_obj_del_now() to dispatch to this new function. Drop __switchdev_port_obj_add() and update switchdev_port_obj_add() likewise. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: Add helpers to aid traversal through lower devicesPetr Machata2018-11-231-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the transition from switchdev operations to notifier chain (which will take place in following patches), the onus is on the driver to find its own devices below possible layer of LAG or other uppers. The logic to do so is fairly repetitive: each driver is looking for its own devices among the lowers of the notified device. For those that it finds, it calls a handler. To indicate that the event was handled, struct switchdev_notifier_port_obj_info.handled is set. The differences lie only in what constitutes an "own" device and what handler to call. Therefore abstract this logic into two helpers, switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If a driver only supports physical ports under a bridge device, it will simply avoid this layer of indirection. One area where this helper diverges from the current switchdev behavior is the case of mixed lowers, some of which are switchdev ports and some of which are not. Previously, such scenario would fail with -EOPNOTSUPP. The helper could do that for lowers for which the passed-in predicate doesn't hold. That would however break the case that switchdev ports from several different drivers are stashed under one master, a scenario that switchdev currently happily supports. Therefore tolerate any and all unknown netdevices, whether they are backed by a switchdev driver or not. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DELPetr Machata2018-11-231-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An offloading driver may need to have access to switchdev events on ports that aren't directly under its control. An example is a VXLAN port attached to a bridge offloaded by a driver. The driver needs to know about VLANs configured on the VXLAN device. However the VXLAN device isn't stashed between the bridge and a front-panel-port device (such as is the case e.g. for LAG devices), so the usual switchdev ops don't reach the driver. VXLAN is likely not the only device type like this: in theory any L2 tunnel device that needs offloading will prompt requirement of this sort. This falsifies the assumption that only the lower devices of a front panel port need to be notified to achieve flawless offloading. A way to fix this is to give up the notion of port object addition / deletion as a switchdev operation, which assumes somewhat tight coupling between the message producer and consumer. And instead send the message over a notifier chain. To that end, introduce two new switchdev notifier types, SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types communicate the same event as the corresponding switchdev op, except in a form of a notification. struct switchdev_notifier_port_obj_info was added to carry the fields that the switchdev op carries. An additional field, handled, will be used to communicate back to switchdev that the event has reached an interested party, which will be important for the two-phase commit. The two switchdev operations themselves are kept in place. Following patches first convert individual clients to the notifier protocol, and only then are the operations removed. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: Add a blocking notifier chainPetr Machata2018-11-231-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In general one can't assume that a switchdev notifier is called in a non-atomic context, and correspondingly, the switchdev notifier chain is an atomic one. However, port object addition and deletion messages are delivered from a process context. Even the MDB addition messages, whose delivery is scheduled from atomic context, are queued and the delivery itself takes place in blocking context. For VLAN messages in particular, keeping the blocking nature is important for error reporting. Therefore introduce a blocking notifier chain and related service functions to distribute the notifications for which a blocking context can be assumed. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): SanitizePetr Machata2018-11-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB() expand to a container_of() call, yielding an appropriate container of their sole argument. However, due to a name collision, the first argument, i.e. the contained object pointer, is not the only one to get expanded. The third argument, which is a structure member name, and should be kept literal, gets expanded as well. The only safe way to use these two macros is therefore to name the local variable passed to them "obj". To fix this, rename the sole argument of the two macros from "obj" (which collides with the member name) to "OBJ". Additionally, instead of passing "OBJ" to container_of() verbatim, parenthesize it, so that a comma in the passed-in expression doesn't pollute the container_of() invocation. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | vxlan: Add hardware FDB learningPetr Machata2018-11-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to allow devices to signal learning events to VXLAN, introduce two new switchdev messages: SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE and SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE. Listen to these notifications in the vxlan driver. The FDB entries learned this way have an NTF_EXT_LEARNED flag, and only entries marked as such can be unlearned by the _DEL_ event. They are also immediately marked as offloaded. This is the same behavior that the bridge driver observes. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | vxlan: Mark user-added FDB entriesPetr Machata2018-11-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VXLAN driver needs to differentiate between FDB entries learned by the VXLAN driver, and those added by the user. The latter ones shouldn't be taken over by external learning events. This is in accordance with bridge behavior. Therefore, extend the flags bitfield to 16 bits and add a new private NTF flag to mark the user-added entries. This seems preferable to adding a dedicated boolean, because passing the flag, unlike passing e.g. a true, makes it clear what the meaning of the bit is. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: cls_u32: add res to offload informationJakub Kicinski2018-11-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of egress offloads the class/flowid assigned by the filter may be very important for offloaded Qdisc selection. Provide this info to drivers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: gred: support reporting stats from offloadsJakub Kicinski2018-11-191-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow drivers which offload GRED to report back statistics. Since A lot of GRED stats is fairly ad hoc in nature pass to drivers the standard struct gnet_stats_basic/gnet_stats_queue pairs, and untangle the values in the core. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: gred: add basic Qdisc offloadJakub Kicinski2018-11-191-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sctp: rename enum sctp_event to sctp_event_typeXin Long2018-11-192-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | sctp_event is a structure name defined in RFC for sockopt SCTP_EVENT. To avoid the conflict, rename it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sctp: add subscribe per asocXin Long2018-11-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | The member subscribe should be per asoc, so that sockopt SCTP_EVENT in the next patch can subscribe a event from one asoc only. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sctp: define subscribe in sctp_sock as __u16Xin Long2018-11-192-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The member subscribe in sctp_sock is used to indicate to which of the events it is subscribed, more like a group of flags. So it's better to be defined as __u16 (2 bytpes), instead of struct sctp_event_subscribe (13 bytes). Note that sctp_event_subscribe is an UAPI struct, used on sockopt calls, and thus it will not be removed. This patch only changes the internal storage of the flags. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-191-1/+2
|\| |
| * | rxrpc: Fix life checkDavid Howells2018-11-151-1/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The life-checking function, which is used by kAFS to make sure that a call is still live in the event of a pending signal, only samples the received packet serial number counter; it doesn't actually provoke a change in the counter, rather relying on the server to happen to give us a packet in the time window. Fix this by adding a function to force a ping to be transmitted. kAFS then keeps track of whether there's been a stall, and if so, uses the new function to ping the server, resetting the timeout to allow the reply to come back. If there's a stall, a ping and the call is *still* stalled in the same place after another period, then the call will be aborted. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Fixes: f4d15fb6f99a ("rxrpc: Provide functions for allowing cleaner handling of signals") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: align gnet_stats_basic_cpu structEric Dumazet2018-11-171-1/+1
| | | | | | | | | | | | | | | | This structure is small (12 or 16 bytes depending on 64bit or 32bit kernels), but we do not want it spanning two cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: clean up STATE_TRACEYafang Shao2018-11-161-12/+0
| | | | | | | | | | | | | | | | | | | | Currently we can use bpf or tcp tracepoint to conveniently trace the tcp state transition at the run time. So we don't need to do this stuff at the compile time anymore. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: get rid of __tcp_checksum_complete()Cong Wang2018-11-151-6/+1
| | | | | | | | | | | | | | | | | | __tcp_checksum_complete() is 100% same with __skb_checksum_complete() and there is no other caller except tcp_checksum_complete(). So, just use __skb_checksum_complete() there. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>