summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* net/act_pedit: Support using offset relative to the conventional network headersAmir Vadai2017-02-103-16/+208
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend pedit to enable the user setting offset relative to network headers. This change would enable to work with more complex header schemes (vs the simple IPv4 case) where setting a fixed offset relative to the network header is not enough. After this patch, the action has information about the exact header type and field inside this header. This information could be used later on for hardware offloading of pedit. Backward compatibility was being kept: 1. Old kernel <-> new userspace 2. New kernel <-> old userspace 3. add rule using new userspace <-> dump using old userspace 4. add rule using old userspace <-> dump using new userspace When using the extended api, new netlink attributes are being used. This way, operation will fail in (1) and (3) - and no malformed rule be added or dumped. Of course, new user space that doesn't need the new functionality can use the old netlink attributes and operation will succeed. Since action can support both api's, (2) should work, and it is easy to write the new user space to have (4) work. The action is having a strict check that only header types and commands it can handle are accepted. This way future additions will be much easier. Usage example: $ tc filter add dev enp0s9 protocol ip parent ffff: \ flower \ ip_proto tcp \ dst_port 80 \ action pedit munge tcp dport set 8080 pipe \ action mirred egress redirect dev veth0 Will forward tcp port whose original dest port is 80, while modifying the destination port to 8080. Signed-off-by: Amir Vadai <amir@vadai.me> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/skbuff: Introduce skb_mac_offset()Amir Vadai2017-02-101-0/+5
| | | | | | | | Introduce skb_mac_offset() that could be used to get mac header offset. Signed-off-by: Amir Vadai <amir@vadai.me> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'mlxsw-offload-mc-flood'David S. Miller2017-02-105-54/+198
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Pirko says: ==================== mlxsw: Offload MC flood for unregister MC Nogah says: When multicast is enabled, the Linux bridge floods unregistered multicast packets only to ports connected to a multicast router. Devices capable of offloading the Linux bridge need to be made aware of such ports, for proper flooding behavior. On the other hand, when multicast is disabled, such packets should be flooded to all ports. This patchset aims to fix that, by offloading the multicast state and the list of multicast router ports. The first 3 patches adds switchdev attributes to offload this data. The rest of the patchset add implementation for handling this data in the mlxsw driver. The effects this data has on the MDB (namely, when the multicast is disabled the MDB should be considered as invalid, and when it is enabled, a packet that is flooded by it should also be flooded to the multicast routers ports) is subject of future work. Testing of this patchset included: Sending 3 mc packets streams, LL, register and unregistered, and checking that they reached only to the ports that should have received them. The configs were: mc disabled, mc without mc router ports and mc with fixed router port. It was checked for vlan aware bridge, vlan unaware bridge and vlan unaware bridge with another vlan unaware bridge on the same machine ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Update mc_disabled flag by switchdev attrNogah Frankel2017-02-101-0/+28
| | | | | | | | | | | | | | | | | | | | Add a function to update mc_disabled from switchdev attr SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Extend port_orig_get for bridge devicesNogah Frankel2017-02-101-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | The function mlxsw_sp_port_orig_get returns the vport from the physical port if needed, based on the original device. This patch addresses the case where the original device is a bridge. If it is vlan unaware bridge, it returns the matching vport. If it is vlan aware bridge, there is no matching vport, and it returns the original port. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Add an option to flood mc by mc_router_portNogah Frankel2017-02-103-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The decision whether to flood a multicast packet to a port dependent on three flags: mc_disabled, mc_router_port, mc_flood. If mc_disabled is on, the port will be flooded according to mc_flood, otherwise, according to mc_router_port. To accomplish that, add those flags into the mlxsw_sp_port struct and update the mc flood table accordingly. Update mc_router_port by switchdev attribute SWITCHDEV_ATTR_ID_PORT_MC_ROUTER_PORT. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Separate bc and mc floodsNogah Frankel2017-02-103-13/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | Break the bm (broadcast-multicast) into two tables, one for broadcast (and link local multicast that behaves like bc) and one for unknown multicasts. Add a bool into mlxsw_sp_port named mc_flood that reflect the value this port should have in the mc flood table (currently, always 1); Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Change max vfidNogah Frankel2017-02-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A user that wants many bridges will use 1.Q bridge which are scalable. One can have as many 1.Q bridges as vfids. This patch sets their number to 1k, which is a reasonably large number. This change is done here because the next patches will add a new flood table, and without it, it will increase the overall size of the flood tables dramatically. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Make port flood update more genericNogah Frankel2017-02-101-13/+13
| | | | | | | | | | | | | | | | | | | | | | Currently, there is a per port flood update function only for the UC table. Make the function more generic by changing the table type to be an input. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum: Break flood set func to be per tableNogah Frankel2017-02-101-20/+34
| | | | | | | | | | | | | | | | | | | | | | Currently, the flood set function can't operate on only one table, but sets both uc_flood and mb_flood together. This patch creates a function that sets the flood state per table. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * switchdev: bridge: Offload mc router portsNogah Frankel2017-02-102-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | Offload the mc router ports list, whenever it is being changed. It is done because in some cases mc packets needs to be flooded to all the ports in this list. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: mcast: Merge the mc router ports deletions to one functionNogah Frankel2017-02-101-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | There are three places where a port gets deleted from the mc router port list. This patch join the actual deletion to one function. It will be helpful for later patch that will offload changes in the mc router ports list. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * switchdev: bridge: Offload multicast disabledNogah Frankel2017-02-102-0/+18
|/ | | | | | | | | | | | | Offload multicast disabled flag, for more accurate mc flood behavior: When it is on, the mdb should be ignored. When it is off, unregistered mc packets should be flooded to mc router ports. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'sched-cls_api-small-cleanup'David S. Miller2017-02-1015-104/+126
|\ | | | | | | | | | | | | | | | | | | | | | | | | Jiri Pirko says: ==================== sched: cls_api: small cleanup This patchset makes couple of things in cls_api code a bit nicer and easier for reader to digest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: check negative err value to safe one level of indentJiri Pirko2017-02-101-13/+9
| | | | | | | | | | | | | | | | | | | | As it is more common, check err for !0. That allows to safe one level of indentation and makes the code easier to read. Also, make 'next' variable global in function as it is used twice. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: add missing curly braces in else branch in tc_ctl_tfilterJiri Pirko2017-02-101-1/+2
| | | | | | | | | | | | | | | | Curly braces need to be there, for stylistic reasons. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: move err set right before goto errout in tc_ctl_tfilterJiri Pirko2017-02-101-10/+19
| | | | | | | | | | | | | | | | This makes the reader to know right away what is the error value. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: push TC filter protocol creation into a separate functionJiri Pirko2017-02-101-51/+59
| | | | | | | | | | | | | | | | | | Make the long function tc_ctl_tfilter a little bit shorter and easier to read. Also make the creation of filter proto symmetric to destruction. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_apiJiri Pirko2017-02-1015-26/+34
| | | | | | | | | | | | | | | | Creation is done in this file, move destruction to be at the same place. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sched: rename tcf_destroy to tcf_destroy_protoJiri Pirko2017-02-103-7/+7
|/ | | | | | | | | This function destroys TC filter protocol, not TC filter. So name it accordingly. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'mlxsw-identical-routes-handling'David S. Miller2017-02-103-200/+489
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Pirko says: ==================== mlxsw: Identical routes handling Ido says: The kernel can store several FIB aliases that share the same prefix and length. These aliases can differ in other parameters such as TOS and metric, which are taken into account during lookup. Offloading devices might not have the same flexibility, allowing only a single route with the same prefix and length to be reflected. mlxsw is one such device. This patchset aims to correctly handle this situation in the mlxsw driver. The first four patches introduce small changes in the IPv4 FIB code, so that listeners of the FIB notification chain will be able to correctly handle identical routes. The last three patches build on top of previous work and introduce the necessary changes in the mlxsw driver. The biggest change is the introduction of a FIB node, where identical routes are chained, instead of a primitive reference counting. This is explained in detail in the fifth patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum_router: Add support for route replaceIdo Schimmel2017-02-101-7/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | Upon the reception of an ENTRY_REPLACE notification, resolve the FIB node corresponding to the prefix and length and insert the new route before the first matching entry. Since the notification also signals the deletion of the replaced route, delete it from the driver's cache. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum_router: Add support for route appendIdo Schimmel2017-02-101-6/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a new route is appended, it's placed after existing routes sharing the same parameters (prefix, length, table ID, TOS and priority). While the device supports only one route with the same prefix and length in a single table, it's important to correctly place the appended route in the driver's cache, as when a route is deleted the next one is programmed into the device. Following the reception of an ENTRY_APPEND notification, resolve the FIB node corresponding to the prefix and length and correctly place the new entry in its entry list. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: spectrum_router: Correctly handle identical routesIdo Schimmel2017-02-101-178/+403
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the device, routes are indexed in a routing table based on the prefix and its length. This is in contrast to the kernel's FIB where several FIB aliases can exist with these parameters being identical. In such cases, the routes will be sorted by table ID (LOCAL first, then MAIN), TOS and finally priority (metric). During lookup, these routes will be evaluated in order. In case the packet's TOS field is non-zero and a FIB alias with a matching TOS is found, then it's selected. Otherwise, the lookup defaults to the route with TOS 0 (if it exists). However, if the requested scope is narrower than the one found, then the lookup continues. To best reflect the kernel's datapath we should take the above into account. Given a prefix and its length, the reflected route will always be the first one in the FIB alias list. However, if the route has a non-zero TOS then its action will be converted to trap instead of forward, since we currently don't support TOS-based routing. If this turns out to be a real issue, we can add support for that using policy-based switching. The route's scope can be effectively ignored as any packet being routed by the device would've been looked-up using the widest scope (UNIVERSE). To achieve that we need to do two changes. Firstly, we need to create another struct (FIB node) that will hold the list of FIB entries sharing the same prefix and length. This struct will be hashed using these two parameters. Secondly, we need to change the route reflection to match the above logic, so that the first FIB entry in the list will be programmed into the device while the rest will remain in the driver's cache in case of subsequent changes. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fib: Add events for FIB replace and appendIdo Schimmel2017-02-102-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The FIB notification chain currently uses the NLM_F_{REPLACE,APPEND} flags to signal routes being replaced or appended. Instead of using netlink flags for in-kernel notifications we can simply introduce two new events in the FIB notification chain. This has the added advantage of making the API cleaner, thereby making it clear that these events should be supported by listeners of the notification chain. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fib: Send notification before deleting FIB aliasIdo Schimmel2017-02-101-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a FIB alias is replaced following NLM_F_REPLACE, the ENTRY_ADD notification is sent after the reference on the previous FIB info was dropped. This is problematic as potential listeners might need to access it in their notification blocks. Solve this by sending the notification prior to the deletion of the replaced FIB alias. This is consistent with ENTRY_DEL notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fib: Send deletion notification with actual FIB alias typeIdo Schimmel2017-02-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a FIB alias is removed, a notification is sent using the type passed from user space - can be RTN_UNSPEC - instead of the actual type of the removed alias. This is problematic for listeners of the FIB notification chain, as several FIB aliases can exist with matching parameters, but the type. Solve this by passing the actual type of the removed FIB alias. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fib: Only flush FIB aliases belonging to currently flushed tableIdo Schimmel2017-02-101-1/+2
|/ | | | | | | | | | | | | | | | | | | | In case the MAIN table is flushed and its trie is shared with the LOCAL table, then we might be flushing FIB aliases belonging to the latter. This can lead to FIB_ENTRY_DEL notifications sent with the wrong table ID. The above doesn't affect current listeners, as the table ID is ignored during entry deletion, but this will change later in the patchset. When flushing a particular table, skip any aliases belonging to a different one. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Patrick McHardy <kaber@trash.net> Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'openvswitch-Conntrack-integration-improvements'David S. Miller2017-02-098-113/+420
|\
| * openvswitch: Pack struct sw_flow_key.Jarno Rajahalme2017-02-094-34/+39
| | | | | | | | | | | | | | | | | | | | | | | | struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Add force commit.Jarno Rajahalme2017-02-092-2/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Add original direction conntrack tuple to sw_flow_key.Jarno Rajahalme2017-02-098-47/+246
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP nor ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Inherit master's labels.Jarno Rajahalme2017-02-091-14/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being inherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor of inheriting the labels allows more information be passed from the master connection to related connections, which would otherwise be much harder if the 32 bits in the connmark are not enough. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Refactor labels initialization.Jarno Rajahalme2017-02-091-42/+62
| | | | | | | | | | | | | | | | | | | | Refactoring conntrack labels initialization makes changes in later patches easier to review. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Simplify labels length logic.Jarno Rajahalme2017-02-091-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 distinct labels"), the size of conntrack labels extension has fixed to 128 bits, so we do not need to check for labels sizes shorter than 128 at run-time. This patch simplifies labels length logic accordingly, but allows the conntrack labels size to be increased in the future without breaking the build. In the event of conntrack labels increasing in size OVS would still be able to deal with the 128 first label bits. Suggested-by: Joe Stringer <joe@ovn.org> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Unionize ovs_key_ct_label with a u32 array.Jarno Rajahalme2017-02-092-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Do not trigger events for unconfirmed connections.Jarno Rajahalme2017-02-091-6/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Receiving change events before the 'new' event for the connection has been received can be confusing. Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.Jarno Rajahalme2017-02-091-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The conntrack lookup for existing connections fails to invert the packet 5-tuple for NATted packets, and therefore fails to find the existing conntrack entry. Conntrack only stores 5-tuples for incoming packets, and there are various situations where a lookup on a packet that has already been transformed by NAT needs to be made. Looking up an existing conntrack entry upon executing packet received from the userspace is one of them. This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple for the conntrack lookup whenever the packet has already been transformed by conntrack from its input form as evidenced by one of the NAT flags being set in the conntrack state metadata. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Fix comments for skb->_nfctJarno Rajahalme2017-02-091-7/+7
|/ | | | | | | | | | Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that they are combined into '_nfct'. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ena-bug-fixes'David S. Miller2017-02-096-83/+172
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Netanel Belgazal says: ==================== Bug Fixes in ENA driver Changes from V3: * Rebase patchset to master and solve merge conflicts. * Remove redundant bug fix (fix error handling when probe fails) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: update driver version to 1.1.2Netanel Belgazal2017-02-091-1/+1
| | | | | | | | | | Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: change condition for host attribute configurationNetanel Belgazal2017-02-092-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Move the host info config to be the first admin command that is executed. This change require the driver to remove the 'feature check' from host info configuration flow. The check is removed since the supported features bitmask field is retrieved only after calling ENA_ADMIN_DEVICE_ATTRIBUTES admin command. If set host info is not supported an error will be returned by the device. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: change driver's default timeoutsNetanel Belgazal2017-02-092-5/+5
| | | | | | | | | | | | | | The timeouts were too agressive and sometimes cause false alarms. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: reduce the severity of ena printoutsNetanel Belgazal2017-02-092-13/+28
| | | | | | | | | | Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: use READ_ONCE to access completion descriptorsNetanel Belgazal2017-02-092-4/+5
| | | | | | | | | | | | | | | | Completion descriptors are accessed from the driver and from the device. To avoid reading the old value, use READ_ONCE macro. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: use napi_complete_done() return valueNetanel Belgazal2017-02-091-15/+29
| | | | | | | | | | | | | | Do not unamsk interrupts if we are in busy poll mode. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: fix potential access to freed memory during device resetNetanel Belgazal2017-02-091-13/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the ena driver detects that the device is not behave as expected, it tries to reset the device. The reset flow calls ena_down, which will frees all the resources the driver allocates and then it will reset the device. This flow can cause memory corruption if the device is still writes to the driver's memory space. To overcome this potential race, move the reset before the device resources are freed. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: refactor ena_get_stats64 to be atomic context safeNetanel Belgazal2017-02-093-15/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ndo_get_stat64() can be called from atomic context, but the current implementation sends an admin command to retrieve the statistics from the device. This admin command can sleep. This patch re-factors the implementation of ena_get_stats64() to use the {rx,tx}bytes/count from the driver's inner counters, and to obtain the rx drop counter from the asynchronous keep alive (heart bit) event. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: fix NULL dereference when removing the driver after device reset failedNetanel Belgazal2017-02-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If for some reason the device stops responding, and the device reset failes to recover the device, the mmio register read data structure will not be reinitialized. On driver removal, the driver will also try to reset the device, but this time the mmio data structure will be NULL. To solve this issue, perform the device reset in the remove function only if the device is runnig. Crash log 54.240382] BUG: unable to handle kernel NULL pointer dereference at (null) [ 54.244186] IP: [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] PGD 0 [ 54.244186] Oops: 0002 [#1] SMP [ 54.244186] Modules linked in: ena_drv(OE-) snd_hda_codec_generic kvm_intel kvm crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_intel aes_x86_64 snd_hda_controller lrw gf128mul cirrus glue_helper ablk_helper ttm snd_hda_codec drm_kms_helper cryptd snd_hwdep drm snd_pcm pvpanic snd_timer syscopyarea sysfillrect snd parport_pc sysimgblt serio_raw soundcore i2c_piix4 mac_hid lp parport psmouse floppy [ 54.244186] CPU: 5 PID: 1841 Comm: rmmod Tainted: G OE 3.16.0-031600-generic #201408031935 [ 54.244186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 54.244186] task: ffff880135852880 ti: ffff8800bb640000 task.ti: ffff8800bb640000 [ 54.244186] RIP: 0010:[<ffffffffc067de5a>] [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] RSP: 0018:ffff8800bb643d50 EFLAGS: 00010083 [ 54.244186] RAX: 000000000000deb0 RBX: 0000000000030d40 RCX: 0000000000000003 [ 54.244186] RDX: 0000000000000202 RSI: 0000000000000058 RDI: ffffc90000775104 [ 54.244186] RBP: ffff8800bb643d88 R08: 0000000000000000 R09: cf00000000000000 [ 54.244186] R10: 0000000fffffffe0 R11: 0000000000000001 R12: 0000000000000000 [ 54.244186] R13: ffffc90000765000 R14: ffffc90000775104 R15: 00007fca1fa98090 [ 54.244186] FS: 00007fca1f1bd740(0000) GS:ffff88013fd40000(0000) knlGS:0000000000000000 [ 54.244186] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.244186] CR2: 0000000000000000 CR3: 00000000b9cf6000 CR4: 00000000001406e0 [ 54.244186] Stack: [ 54.244186] 0000000000000202 0000005800000286 ffffc90000765000 ffffc90000765000 [ 54.244186] ffff880135f6b000 ffff8800b9360000 00007fca1fa98090 ffff8800bb643db8 [ 54.244186] ffffffffc0680b3d ffff8800b93608c0 ffffc90000765000 ffff880135f6b000 [ 54.244186] Call Trace: [ 54.244186] [<ffffffffc0680b3d>] ena_com_dev_reset+0x1d/0x1b0 [ena_drv] [ 54.244186] [<ffffffffc0678497>] ena_remove+0xa7/0x130 [ena_drv] [ 54.244186] [<ffffffff813d4df6>] pci_device_remove+0x46/0xc0 [ 54.244186] [<ffffffff814c3b7f>] __device_release_driver+0x7f/0xf0 [ 54.244186] [<ffffffff814c4738>] driver_detach+0xc8/0xd0 [ 54.244186] [<ffffffff814c3969>] bus_remove_driver+0x59/0xd0 [ 54.244186] [<ffffffff814c4fde>] driver_unregister+0x2e/0x60 [ 54.244186] [<ffffffff810f0a80>] ? show_refcnt+0x40/0x40 [ 54.244186] [<ffffffff813d4ec3>] pci_unregister_driver+0x23/0xa0 [ 54.244186] [<ffffffffc068413f>] ena_cleanup+0x10/0xed1 [ena_drv] [ 54.244186] [<ffffffff810f3a47>] SyS_delete_module+0x157/0x1e0 [ 54.244186] [<ffffffff81014fb7>] ? do_notify_resume+0xc7/0xd0 [ 54.244186] [<ffffffff81793fad>] system_call_fastpath+0x1a/0x1f [ 54.244186] Code: c3 4d 8d b5 04 01 01 00 4c 89 f7 e8 e1 5a 11 c1 48 89 45 c8 41 0f b7 85 00 01 01 00 8d 48 01 66 2d 52 21 66 41 89 8d 00 01 01 00 <66> 41 89 04 24 0f b7 45 d4 89 45 d0 89 c1 41 0f b7 85 00 01 01 [ 54.244186] RIP [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] RSP <ffff8800bb643d50> [ 54.244186] CR2: 0000000000000000 [ 54.244186] ---[ end trace 18dd9889b6497810 ]--- Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ena: fix RSS default hash configurationNetanel Belgazal2017-02-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | ENA default hash configures IPv4_frag hash twice instead of configure non-IP packets. The bug caused IPv4 fragmented packets to be calculated based on L2 source and destination address instead of L3 source and destination. IPv4 packets can reach to the wrong Rx queue. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>