summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2015-11-1731-318/+437
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix list tests in netfilter ingress support, from Florian Westphal. 2) Fix reversal of input and output interfaces in ingress hook invocation, from Pablo Neira Ayuso. 3) We have a use after free in r8169, caught by Dave Jones, fixed by Francois Romieu. 4) Splice use-after-free fix in AF_UNIX frmo Hannes Frederic Sowa. 5) Three ipv6 route handling bug fixes from Martin KaFai Lau: a) Don't create clone routes not managed by the fib6 tree b) Don't forget to check expiration of DST_NOCACHE routes. c) Handle rt->dst.from == NULL properly. 6) Several AF_PACKET fixes wrt transport header setting and SKB protocol setting, from Daniel Borkmann. 7) Fix thunder driver crash on shutdown, from Pavel Fedin. 8) Several Mellanox driver fixes (max MTU calculations, use of correct DMA unmap in TX path, etc.) from Saeed Mahameed, Tariq Toukan, Doron Tsur, Achiad Shochat, Eran Ben Elisha, and Noa Osherovich. 9) Several mv88e6060 DSA driver fixes (wrong bit definitions for certain registers, etc.) from Neil Armstrong. 10) Make sure to disable preemption while updating per-cpu stats of ip tunnels, from Jason A. Donenfeld. 11) Various ARM64 bpf JIT fixes, from Yang Shi. 12) Flush icache properly in ARM JITs, from Daniel Borkmann. 13) Fix masking of RX and TX interrupts in ravb driver, from Masaru Nagai. 14) Fix netdev feature propagation for devices not implementing ->ndo_set_features(). From Nikolay Aleksandrov. 15) Big endian fix in vmxnet3 driver, from Shrikrishna Khare. 16) RAW socket code increments incorrect SNMP counters, fix from Ben Cartwright-Cox. 17) IPv6 multicast SNMP counters are bumped twice, fix from Neil Horman. 18) Fix handling of VLAN headers on stacked devices when REORDER is disabled. From Vlad Yasevich. 19) Fix SKB leaks and use-after-free in ipvlan and macvlan drivers, from Sabrina Dubroca. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) MAINTAINERS: Update Mellanox's Eth NIC driver entries net/core: revert "net: fix __netdev_update_features return.." and add comment af_unix: take receive queue lock while appending new skb rtnetlink: fix frame size warning in rtnl_fill_ifinfo net: use skb_clone to avoid alloc_pages failure. packet: Use PAGE_ALIGNED macro packet: Don't check frames_per_block against negative values net: phy: Use interrupts when available in NOLINK state phy: marvell: Add support for 88E1540 PHY arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS macvlan: fix leak in macvlan_handle_frame ipvlan: fix use after free of skb ipvlan: fix leak in ipvlan_rcv_frame vlan: Do not put vlan headers back on bridge and macvlan ports vlan: Fix untag operations of stacked vlans with REORDER_HEADER off via-velocity: unconditionally drop frames with bad l2 length ipg: Remove ipg driver dl2k: Add support for IP1000A-based cards snmp: Remove duplicate OUTMCAST stat increment net: thunder: Check for driver data in nicvf_remove() ...
| * net/core: revert "net: fix __netdev_update_features return.." and add commentNikolay Aleksandrov2015-11-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 00ee59271777 ("net: fix __netdev_update_features return on ndo_set_features failure") and adds a comment explaining why it's okay to return a value other than 0 upon error. Some drivers might actually change flags and return an error so it's better to fire a spurious notification rather than miss these. CC: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * af_unix: take receive queue lock while appending new skbHannes Frederic Sowa2015-11-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While possibly in future we don't necessarily need to use sk_buff_head.lock this is a rather larger change, as it affects the af_unix fd garbage collector, diag and socket cleanups. This is too much for a stable patch. For the time being grab sk_buff_head.lock without disabling bh and irqs, so don't use locked skb_queue_tail. Fixes: 869e7c62486e ("net: af_unix: implement stream sendpage support") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reported-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: fix frame size warning in rtnl_fill_ifinfoHannes Frederic Sowa2015-11-171-122/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following warning: CC net/core/rtnetlink.o net/core/rtnetlink.c: In function ‘rtnl_fill_ifinfo’: net/core/rtnetlink.c:1308:1: warning: the frame size of 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=] } ^ by splitting up the huge rtnl_fill_ifinfo into some smaller ones, so we don't have the huge frame allocations at the same time. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: use skb_clone to avoid alloc_pages failure.Martin Zhang2015-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | 1. new skb only need dst and ip address(v4 or v6). 2. skb_copy may need high order pages, which is very rare on long running server. Signed-off-by: Junwei Zhang <linggao.zjw@alibaba-inc.com> Signed-off-by: Martin Zhang <martinbj2008@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: Use PAGE_ALIGNED macroTobias Klauser2015-11-171-1/+1
| | | | | | | | | | | | | | Use PAGE_ALIGNED(...) instead of open-coding it. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: Don't check frames_per_block against negative valuesTobias Klauser2015-11-171-2/+2
| | | | | | | | | | | | | | | | | | rb->frames_per_block is an unsigned int, thus can never be negative. Also fix spacing in the calculation of frames_per_block. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vlan: Do not put vlan headers back on bridge and macvlan portsVlad Yasevich2015-11-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vlan is configured with REORDER_HEADER set to 0, the vlan header is put back into the packet and makes it appear that the vlan header is still there even after it's been processed. This posses a problem for bridge and macvlan ports. The packets passed to those device may be forwarded and at the time of the forward, vlan headers end up being unexpectedly present. With the patch, we make sure that we do not put the vlan header back (when REORDER_HEADER is 0) if a bridge or macvlan has been configured on top of the vlan device. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vlan: Fix untag operations of stacked vlans with REORDER_HEADER offVlad Yasevich2015-11-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have multiple stacked vlan devices all of which have turned off REORDER_HEADER flag, the untag operation does not locate the ethernet addresses correctly for nested vlans. The reason is that in case of REORDER_HEADER flag being off, the outer vlan headers are put back and the mac_len is adjusted to account for the presense of the header. Then, the subsequent untag operation, for the next level vlan, always use VLAN_ETH_HLEN to locate the begining of the ethernet header and that ends up being a multiple of 4 bytes short of the actuall beginning of the mac header (the multiple depending on the how many vlan encapsulations ethere are). As a reslult, if there are multiple levles of vlan devices with REODER_HEADER being off, the recevied packets end up being dropped. To solve this, we use skb->mac_len as the offset. The value is always set on receive path and starts out as a ETH_HLEN. The value is also updated when the vlan header manupations occur so we know it will be correct. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * snmp: Remove duplicate OUTMCAST stat incrementNeil Horman2015-11-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | the OUTMCAST stat is double incremented, getting bumped once in the mcast code itself, and again in the common ip output path. Remove the mcast bump, as its not needed Validated by the reporter, with good results Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Claus Jensen <claus.jensen@microsemi.com> CC: Claus Jensen <claus.jensen@microsemi.com> CC: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/core: use netdev name in warning if no parentBjørn Mork2015-11-161-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent flaw in the netdev feature setting resulted in warnings like this one from VLAN interfaces: WARNING: CPU: 1 PID: 4975 at net/core/dev.c:2419 skb_warn_bad_offload+0xbc/0xcb() : caps=(0x00000000001b5820, 0x00000000001b5829) len=2782 data_len=0 gso_size=1348 gso_type=16 ip_summed=3 The ":" is supposed to be preceded by a driver name, but in this case it is an empty string since the device has no parent. There are many types of network devices without a parent. The anonymous warnings for these devices can be hard to debug. Log the network device name instead in these cases to assist further debugging. This is mostly similar to how __netdev_printk() handles orphan devices. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
| * af_unix: don't append consumed skbs to sk_receive_queueHannes Frederic Sowa2015-11-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case multiple writes to a unix stream socket race we could end up in a situation where we pre-allocate a new skb for use in unix_stream_sendpage but have to free it again in the locked section because another skb has been appended meanwhile, which we must use. Accidentally we didn't clear the pointer after consuming it and so we touched freed memory while appending it to the sk_receive_queue. So, clear the pointer after consuming the skb. This bug has been found with syzkaller (http://github.com/google/syzkaller) by Dmitry Vyukov. Fixes: 869e7c62486e ("net: af_unix: implement stream sendpage support") Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * raw: increment correct SNMP counters for ICMP messagesBen Cartwright-Cox2015-11-161-3/+5
| | | | | | | | | | | | | | | | | | | | | | Sending ICMP packets with raw sockets ends up in the SNMP counters logging the type as the first byte of the IPv4 header rather than the ICMP header. This is fixed by adding the IP Header Length to the casting into a icmphdr struct. Signed-off-by: Ben Cartwright-Cox <ben@benjojo.co.uk> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix __netdev_update_features return on ndo_set_features failureNikolay Aleksandrov2015-11-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | If ndo_set_features fails __netdev_update_features() will return -1 but this is wrong because it is expected to return 0 if no features were changed (see netdev_update_features()), which will cause a netdev notifier to be called without any actual changes. Fix this by returning 0 if ndo_set_features fails. Fixes: 6cb6a27c45ce ("net: Call netdev_features_change() from netdev_update_features()") CC: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix feature changes on devices without ndo_set_featuresNikolay Aleksandrov2015-11-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When __netdev_update_features() was updated to ensure some features are disabled on new lower devices, an error was introduced for devices which don't have the ndo_set_features() method set. Before we'll just set the new features, but now we return an error and don't set them. Fix this by returning the old behaviour and setting err to 0 when ndo_set_features is not present. Fixes: e7868a85e1b2 ("net/core: ensure features get disabled on new lower devs") CC: Jarod Wilson <jarod@redhat.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Ido Schimmel <idosch@mellanox.com> CC: Sander Eikelenboom <linux@eikelenboom.it> CC: Andy Gospodarek <gospo@cumulusnetworks.com> CC: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com> Reviewed-by: Jarod Wilson <jarod@redhat.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Dave Young <dyoung@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * switchdev: bridge: Check return code is not EOPNOTSUPPIdo Schimmel2015-11-162-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When NET_SWITCHDEV=n, switchdev_port_attr_set simply returns EOPNOTSUPP. In this case we should not emit errors and warnings to the kernel log. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Fixes: 0bc05d585d38 ("switchdev: allow caller to explicitly request attr_set as deferred") Fixes: 6ac311ae8bfb ("Adding switchdev ageing notification on port bridged") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipvs: use skb_to_full_sk() helperEric Dumazet2015-11-151-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | SYNACK packets might be attached to request sockets. Use skb_to_full_sk() helper to avoid illegal accesses to inet_sk(skb->sk) Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: ensure proper barriers in lockless contextsEric Dumazet2015-11-155-23/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Some functions access TCP sockets without holding a lock and might output non consistent data, depending on compiler and or architecture. tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ... Introduce sk_state_load() and sk_state_store() to fix the issues, and more clearly document where this lack of locking is happening. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: translate host order to network order when setting a hmacidlucien2015-11-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | now sctp auth cannot work well when setting a hmacid manually, which is caused by that we didn't use the network order for hmacid, so fix it by adding the transformation in sctp_auth_ep_set_hmacs. even we set hmacid with the network order in userspace, it still can't work, because of this condition in sctp_auth_ep_set_hmacs(): if (id > SCTP_AUTH_HMAC_ID_MAX) return -EOPNOTSUPP; so this wasn't working before and thus it won't break compatibility. Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH extensions.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: fix tpacket_snd max frame lenDaniel Borkmann2015-11-151-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since it's introduction in commit 69e3c75f4d54 ("net: TX_RING and packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW side. When used with SOCK_DGRAM only, the size_max > dev->mtu + reserve check should have reserve as 0, but currently, this is unconditionally set (in it's original form as dev->hard_header_len). I think this is not correct since tpacket_fill_skb() would then take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM, the extra VLAN_HLEN could be possible in both cases. Presumably, the reserve code was copied from packet_snd(), but later on missed the check. Make it similar as we have it in packet_snd(). Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: infer protocol from ethernet header if unsetDaniel Borkmann2015-11-151-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case no struct sockaddr_ll has been passed to packet socket's sendmsg() when doing a TX_RING flush run, then skb->protocol is set to po->num instead, which is the protocol passed via socket(2)/bind(2). Applications only xmitting can go the path of allocating the socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the TX_RING with sll_protocol of 0. That way, register_prot_hook() is neither called on creation nor on bind time, which saves cycles when there's no interest in capturing anyway. That leaves us however with po->num 0 instead and therefore the TX_RING flush run sets skb->protocol to 0 as well. Eric reported that this leads to problems when using tools like trafgen over bonding device. I.e. the bonding's hash function could invoke the kernel's flow dissector, which depends on skb->protocol being properly set. In the current situation, all the traffic is then directed to a single slave. Fix it up by inferring skb->protocol from the Ethernet header when not set and we have ARPHRD_ETHER device type. This is only done in case of SOCK_RAW and where we have a dev->hard_header_len length. In case of ARPHRD_ETHER devices, this is guaranteed to cover ETH_HLEN, and therefore being accessed on the skb after the skb_store_bits(). Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: only allow extra vlan len on ethernet devicesDaniel Borkmann2015-11-151-35/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Packet sockets can be used by various net devices and are not really restricted to ARPHRD_ETHER device types. However, when currently checking for the extra 4 bytes that can be transmitted in VLAN case, our assumption is that we generally probe on ARPHRD_ETHER devices. Therefore, before looking into Ethernet header, check the device type first. This also fixes the issue where non-ARPHRD_ETHER devices could have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus the check would test unfilled linear part of the skb (instead of non-linear). Fixes: 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.") Fixes: 52f1454f629f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: always probe for transport headerDaniel Borkmann2015-11-151-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We concluded that the skb_probe_transport_header() should better be called unconditionally. Avoiding the call into the flow dissector has also not really much to do with the direct xmit mode. While it seems that only virtio_net code makes use of GSO from non RX/TX ring packet socket paths, we should probe for a transport header nevertheless before they hit devices. Reference: http://thread.gmane.org/gmane.linux.network/386173/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: do skb_probe_transport_header when we actually have dataDaniel Borkmann2015-11-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tpacket_fill_skb() commit c1aad275b029 ("packet: set transport header before doing xmit") and later on 40893fd0fd4e ("net: switch to use skb_probe_transport_header()") was probing for a transport header on the skb from a ring buffer slot, but at a time, where the skb has _not even_ been filled with data yet. So that call into the flow dissector is pretty useless. Lets do it after we've set up the skb frags. Fixes: c1aad275b029 ("packet: set transport header before doing xmit") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Check rt->dst.from for the DST_NOCACHE routeMartin KaFai Lau2015-11-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All DST_NOCACHE rt6_info used to have rt->dst.from set to its parent. After commit 8e3d5be73681 ("ipv6: Avoid double dst_free"), DST_NOCACHE is also set to rt6_info which does not have a parent (i.e. rt->dst.from is NULL). This patch catches the rt->dst.from == NULL case. Fixes: 8e3d5be73681 ("ipv6: Avoid double dst_free") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Check expire on DST_NOCACHE routeMartin KaFai Lau2015-11-151-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the expires of the DST_NOCACHE rt can be set during the ip6_rt_update_pmtu(), we also need to consider the expires value when doing ip6_dst_check(). This patches creates __rt6_check_expired() to only check the expire value (if one exists) of the current rt. In rt6_dst_from_check(), it adds __rt6_check_expired() as one of the condition check. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Avoid creating RTF_CACHE from a rt that is not managed by fib6 treeMartin KaFai Lau2015-11-151-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1272571 The setup has a IPv4 GRE tunnel running in a IPSec. The bug happens when ndisc starts sending router solicitation at the gre interface. The simplified oops stack is like: __lock_acquire+0x1b2/0x1c30 lock_acquire+0xb9/0x140 _raw_write_lock_bh+0x3f/0x50 __ip6_ins_rt+0x2e/0x60 ip6_ins_rt+0x49/0x50 ~~~~~~~~ __ip6_rt_update_pmtu.part.54+0x145/0x250 ip6_rt_update_pmtu+0x2e/0x40 ~~~~~~~~ ip_tunnel_xmit+0x1f1/0xf40 __gre_xmit+0x7a/0x90 ipgre_xmit+0x15a/0x220 dev_hard_start_xmit+0x2bd/0x480 __dev_queue_xmit+0x696/0x730 dev_queue_xmit+0x10/0x20 neigh_direct_output+0x11/0x20 ip6_finish_output2+0x21f/0x770 ip6_finish_output+0xa7/0x1d0 ip6_output+0x56/0x190 ~~~~~~~~ ndisc_send_skb+0x1d9/0x400 ndisc_send_rs+0x88/0xc0 ~~~~~~~~ The rt passed to ip6_rt_update_pmtu() is created by icmp6_dst_alloc() and it is not managed by the fib6 tree, so its rt6i_table == NULL. When __ip6_rt_update_pmtu() creates a RTF_CACHE clone, the newly created clone also has rt6i_table == NULL and it causes the ip6_ins_rt() oops. During pmtu update, we only want to create a RTF_CACHE clone from a rt which is currently managed (or owned) by the fib6 tree. It means either rt->rt6i_node != NULL or rt is a RTF_PCPU clone. It is worth to note that rt6i_table may not be NULL even it is not (yet) managed by the fib6 tree (e.g. addrconf_dst_alloc()). Hence, rt6i_node is a better check instead of rt6i_table. Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reported-by: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu> Cc: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * af-unix: fix use-after-free with concurrent readers while splicingHannes Frederic Sowa2015-11-151-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During splicing an af-unix socket to a pipe we have to drop all af-unix socket locks. While doing so we allow another reader to enter unix_stream_read_generic which can read, copy and finally free another skb. If exactly this skb is just in process of being spliced we get a use-after-free report by kasan. First, we must make sure to not have a free while the skb is used during the splice operation. We simply increment its use counter before unlocking the reader lock. Stream sockets have the nice characteristic that we don't care about zero length writes and they never reach the peer socket's queue. That said, we can take the UNIXCB.consumed field as the indicator if the skb was already freed from the socket's receive queue. If the skb was fully consumed after we locked the reader side again we know it has been dropped by a second reader. We indicate a short read to user space and abort the current splice operation. This bug has been found with syzkaller (http://github.com/google/syzkaller) by Dmitry Vyukov. Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2015-11-1212-99/+123
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree. This large batch that includes fixes for ipset, netfilter ingress, nf_tables dynamic set instantiation and a longstanding Kconfig dependency problem. More specifically, they are: 1) Add missing check for empty hook list at the ingress hook, from Florian Westphal. 2) Input and output interface are swapped at the ingress hook, reported by Patrick McHardy. 3) Resolve ipset extension alignment issues on ARM, patch from Jozsef Kadlecsik. 4) Fix bit check on bitmap in ipset hash type, also from Jozsef. 5) Release buckets when all entries have expired in ipset hash type, again from Jozsef. 6) Oneliner to initialize conntrack tuple object in the PPTP helper, otherwise the conntrack lookup may fail due to random bits in the structure holes, patch from Anthony Lineham. 7) Silence a bogus gcc warning in nfnetlink_log, from Arnd Bergmann. 8) Fix Kconfig dependency problems with TPROXY, socket and dup, also from Arnd. 9) Add __netdev_alloc_pcpu_stats() to allow creating percpu counters from atomic context, this is required by the follow up fix for nf_tables. 10) Fix crash from the dynamic set expression, we have to add new clone operation that should be defined when a simple memcpy is not enough. This resolves a crash when using per-cpu counters with new Patrick McHardy's flow table nft support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: nf_tables: add clone interface to expression operationsPablo Neira Ayuso2015-11-102-10/+44
| | | | | | | | | | | | | | | | | | | | | | | | With the conversion of the counter expressions to make it percpu, we need to clone the percpu memory area, otherwise we crash when using counters from flow tables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: fix xt_TEE and xt_TPROXY dependenciesArnd Bergmann2015-11-101-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kconfig is too smart for its own good: a Kconfig line that states select NF_DEFRAG_IPV6 if IP6_NF_IPTABLES means that if IP6_NF_IPTABLES is set to 'm', then NF_DEFRAG_IPV6 will also be set to 'm', regardless of the state of the symbol from which it is selected. When the xt_TEE driver is built-in and nothing else forces NF_DEFRAG_IPV6 to be built-in, this causes a link-time error: net/built-in.o: In function `tee_tg6': net/netfilter/xt_TEE.c:46: undefined reference to `nf_dup_ipv6' This works around that behavior by changing the dependency to 'if IP6_NF_IPTABLES != n', which is interpreted as boolean expression rather than a tristate and causes the NF_DEFRAG_IPV6 symbol to be built-in as well. The bug only occurs once in thousands of 'randconfig' builds and does not really impact real users. From inspecting the other surrounding Kconfig symbols, I am guessing that NETFILTER_XT_TARGET_TPROXY and NETFILTER_XT_MATCH_SOCKET have the same issue. If not, this change should still be harmless. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nfnetlink_log: work around uninitialized variable warningArnd Bergmann2015-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a recent (correct) change, gcc started warning about the use of the 'flags' variable in nfulnl_recv_config() net/netfilter/nfnetlink_log.c: In function 'nfulnl_recv_config': net/netfilter/nfnetlink_log.c:320:14: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] net/netfilter/nfnetlink_log.c:828:6: note: 'flags' was declared here The warning first shows up in ARM s3c2410_defconfig with gcc-4.3 or higher (including 5.2.1, which is the latest version I checked) I tried working around it by rearranging the code but had no success with that. As a last resort, this initializes the variable to zero, which shuts up the warning, but means that we don't get a warning if the code is ever changed in a way that actually causes the variable to be used without first being written. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 8cbc870829ec ("netfilter: nfnetlink_log: validate dependencies to avoid breaking atomicity") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: Fix removal of GRE expectation entries created by PPTPAnthony Lineham2015-11-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The uninitialized tuple structure caused incorrect hash calculation and the lookup failed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=106441 Signed-off-by: Anthony Lineham <anthony.lineham@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: ipset: Fix hash type expire: release empty hash bucket blockJozsef Kadlecsik2015-11-071-4/+9
| | | | | | | | | | | | | | | | | | When all entries are expired/all slots are empty, release the bucket. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Fix hash:* type expirationJozsef Kadlecsik2015-11-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Incorrect index was used when the data blob was shrinked at expiration, which could lead to falsely expired entries and memory leak when the comment extension was used too. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Fix extension alignmentJozsef Kadlecsik2015-11-077-79/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The data extensions in ipset lacked the proper memory alignment and thus could lead to kernel crash on several architectures. Therefore the structures have been reorganized and alignment attributes added where needed. The patch was tested on armv7h by Gerhard Wiesinger and on x86_64, sparc64 by Jozsef Kadlecsik. Reported-by: Gerhard Wiesinger <lists@wiesinger.com> Tested-by: Gerhard Wiesinger <lists@wiesinger.com> Tested-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2015-11-135-87/+93
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There are several patches from Ilya fixing RBD allocation lifecycle issues, a series adding a nocephx_sign_messages option (and associated bug fixes/cleanups), several patches from Zheng improving the (directory) fsync behavior, a big improvement in IO for direct-io requests when striping is enabled from Caifeng, and several other small fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: clear msg->con in ceph_msg_release() only libceph: add nocephx_sign_messages option libceph: stop duplicating client fields in messenger libceph: drop authorizer check from cephx msg signing routines libceph: msg signing callouts don't need con argument libceph: evaluate osd_req_op_data() arguments only once ceph: make fsync() wait unsafe requests that created/modified inode ceph: add request to i_unsafe_dirops when getting unsafe reply libceph: introduce ceph_x_authorizer_cleanup() ceph: don't invalidate page cache when inode is no longer used rbd: remove duplicate calls to rbd_dev_mapping_clear() rbd: set device_type::release instead of device::release rbd: don't free rbd_dev outside of the release callback rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails libceph: use local variable cursor instead of &msg->cursor libceph: remove con argument in handle_reply() ceph: combine as many iovec as possile into one OSD request ceph: fix message length computation ceph: fix a comment typo rbd: drop null test before destroy functions
| * | | libceph: clear msg->con in ceph_msg_release() onlyIlya Dryomov2015-11-022-28/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: add nocephx_sign_messages optionIlya Dryomov2015-11-023-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support for message signing was merged into 3.19, along with nocephx_require_signatures option. But, all that option does is allow the kernel client to talk to clusters that don't support MSG_AUTH feature bit. That's pretty useless, given that it's been supported since bobtail. Meanwhile, if one disables message signing on the server side with "cephx sign messages = false", it becomes impossible to use the kernel client since it expects messages to be signed if MSG_AUTH was negotiated. Add nocephx_sign_messages option to support this use case. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: stop duplicating client fields in messengerIlya Dryomov2015-11-022-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | supported_features and required_features serve no purpose at all, while nocrc and tcp_nodelay belong to ceph_options::flags. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: drop authorizer check from cephx msg signing routinesIlya Dryomov2015-11-021-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I don't see a way for auth->authorizer to be NULL in ceph_x_sign_message() or ceph_x_check_message_signature(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: msg signing callouts don't need con argumentIlya Dryomov2015-11-022-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: evaluate osd_req_op_data() arguments only onceIoana Ciornei2015-11-021-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the osd_req_op_data() macro to not evaluate arguments more than once in order to follow the kernel coding style. Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: changelog, formatting] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: introduce ceph_x_authorizer_cleanup()Ilya Dryomov2015-11-022-12/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ae385eaf24dc ("libceph: store session key in cephx authorizer") introduced ceph_x_authorizer::session_key, but didn't update all the exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(), which currently always leaks key and ceph_x_build_authorizer() error paths. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * | | libceph: use local variable cursor instead of &msg->cursorShraddha Barke2015-11-021-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use local variable cursor in place of &msg->cursor in read_partial_msg_data() and write_partial_msg_data(). Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | libceph: remove con argument in handle_reply()Shraddha Barke2015-11-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since handle_reply() does not use its con argument, remove it. Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | Merge tag 'nfsd-4.4' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2015-11-113-28/+78
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "Apologies for coming a little late in the merge window. Fortunately this is another fairly quiet one: Mainly smaller bugfixes and cleanup. We're still finding some bugs from the breakup of the big NFSv4 state lock in 3.17 -- thanks especially to Andrew Elble and Jeff Layton for tracking down some of the remaining races" * tag 'nfsd-4.4' of git://linux-nfs.org/~bfields/linux: svcrpc: document lack of some memory barriers nfsd: fix race with open / open upgrade stateids nfsd: eliminate sending duplicate and repeated delegations nfsd: remove recurring workqueue job to clean DRC SUNRPC: drop stale comment in svc_setup_socket() nfsd: ensure that seqid morphing operations are atomic wrt to copies nfsd: serialize layout stateid morphing operations nfsd: improve client_has_state to check for unused openowners nfsd: fix clid_inuse on mount with security change sunrpc/cache: make cache flushing more reliable. nfsd: move include of state.h from trace.c to trace.h sunrpc: avoid warning in gss_key_timeout lockd: get rid of reference-counted NSM RPC clients SUNRPC: Use MSG_SENDPAGE_NOTLAST when calling sendpage() lockd: create NSM handles per net namespace nfsd: switch unsigned char flags in svc_fh to bools nfsd: move svc_fh->fh_maxsize to just after fh_handle nfsd: drop null test before destroy functions nfsd: serialize state seqid morphing operations
| * | | svcrpc: document lack of some memory barriersJ. Bruce Fields2015-11-101-6/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're missing memory barriers in net/sunrpc/svcsock.c in some spots we'd expect them. But it doesn't appear they're necessary in our case, and this is likely a hot path--for now just document the odd behavior. Kosuke Tatsukawa found this issue while looking through the linux source code for places calling waitqueue_active() before wake_up*(), but without preceding memory barriers, after sending a patch to fix a similar issue in drivers/tty/n_tty.c (Details about the original issue can be found here: https://lkml.org/lkml/2015/9/28/849). Reported-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | SUNRPC: drop stale comment in svc_setup_socket()Stefan Hajnoczi2015-11-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The svc_setup_socket() function does set the send and receive buffer sizes, so the comment is out-of-date: Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | sunrpc/cache: make cache flushing more reliable.Neil Brown2015-10-231-13/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The caches used to store sunrpc authentication information can be flushed by writing a timestamp to a file in /proc. This timestamp has a one-second resolution and any entry in cache that was last_refreshed *before* that time is treated as expired. This is problematic as it is not possible to reliably flush the cache without interrupting NFS service. If the current time is written to the "flush" file, any entry that was added since the current second started will still be treated as valid. If one second beyond than the current time is written to the file then no entries can be valid until the second ticks over. This will mean that no NFS request will be handled for up to 1 second. To resolve this issue we make two changes: 1/ treat an entry as expired if the timestamp when it was last_refreshed is before *or the same as* the expiry time. This means that current code which writes out the current time will now flush the cache reliably. 2/ when a new entry in added to the cache - set the last_refresh timestamp to 1 second *beyond* the current flush time, when that not in the past. This ensures that newly added entries will always be valid. Now that we have a very reliable way to flush the cache, and also since we are using "since-boot" timestamps which are monotonic, change cache_purge() to set the smallest future flush_time which will work, and leave it there: don't revert to '1'. Also disable the setting of the 'flush_time' far into the future. That has never been useful and is now awkward as it would cause last_refresh times to be strange. Finally: if a request is made to set the 'flush_time' to the current second, assume the intent is to flush the cache and advance it, if necessary, to 1 second beyond the current 'flush_time' so that all active entries will be deemed to be expired. As part of this we need to add a 'cache_detail' arg to cache_init() and cache_fresh_locked() so they can find the current ->flush_time. Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: Olaf Kirch <okir@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>