summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* netdevice: move netdev_cap_txqueue for shared usage to headerDaniel Borkmann2014-02-171-0/+20
| | | | | | | | | In order to allow users to invoke netdev_cap_txqueue, it needs to be moved into netdevice.h header file. While at it, also add kernel doc header to document the API. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netdevice: add queue selection fallback handler for ndo_select_queueDaniel Borkmann2014-02-171-3/+6
| | | | | | | | | | | | | | | | | | Add a new argument for ndo_select_queue() callback that passes a fallback handler. This gets invoked through netdev_pick_tx(); fallback handler is currently __netdev_pick_tx() as most drivers invoke this function within their customized implementation in case for skbs that don't need any special handling. This fallback handler can then be replaced on other call-sites with different queue selection methods (e.g. in packet sockets, pktgen etc). This also has the nice side-effect that __netdev_pick_tx() is then only invoked from netdev_pick_tx() and export of that function to modules can be undone. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sctp: Fix a_rwnd/rwnd management to reflect real state of the ↵Matija Glavinic Pecotic2014-02-171-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | receiver's buffer Implementation of (a)rwnd calculation might lead to severe performance issues and associations completely stalling. These problems are described and solution is proposed which improves lksctp's robustness in congestion state. 1) Sudden drop of a_rwnd and incomplete window recovery afterwards Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data), but size of sk_buff, which is blamed against receiver buffer, is not accounted in rwnd. Theoretically, this should not be the problem as actual size of buffer is double the amount requested on the socket (SO_RECVBUF). Problem here is that this will have bad scaling for data which is less then sizeof sk_buff. E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion of traffic of this size (less then 100B). An example of sudden drop and incomplete window recovery is given below. Node B exhibits problematic behavior. Node A initiates association and B is configured to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp message in 4G (LTE) network). On B data is left in buffer by not reading socket in userspace. Lets examine when we will hit pressure state and declare rwnd to be 0 for scenario with above stated parameters (rwnd == 10000, chunk size == 43, each chunk is sent in separate sctp packet) Logic is implemented in sctp_assoc_rwnd_decrease: socket_buffer (see below) is maximum size which can be held in socket buffer (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count) A simple expression is given for which it will be examined after how many packets for above stated parameters we enter pressure state: We start by condition which has to be met in order to enter pressure state: socket_buffer < currently_alloced; currently_alloced is represented as size of sctp packets received so far and not yet delivered to userspace. x is the number of chunks/packets (since there is no bundling, and each chunk is delivered in separate packet, we can observe each chunk also as sctp packet, and what is important here, having its own sk_buff): socket_buffer < x*each_sctp_packet; each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is twice the amount of initially requested size of socket buffer, which is in case of sctp, twice the a_rwnd requested: 2*rwnd < x*(payload+sizeof(struc sk_buff)); sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000 and each payload size is 43 20000 < x(43+190); x > 20000/233; x ~> 84; After ~84 messages, pressure state is entered and 0 rwnd is advertised while received 84*43B ~= 3612B sctp data. This is why external observer notices sudden drop from 6474 to 0, as it will be now shown in example: IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017] IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839] IP A.34340 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.34340: sctp (1) [COOKIE ACK] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0] <...> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18] --> Sudden drop IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] At this point, rwnd_press stores current rwnd value so it can be later restored in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This condition is not met since rwnd, after it hit 0, must first reach rwnd_press by adding amount which is read from userspace. Let us observe values in above example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the amount of actual sctp data currently waiting to be delivered to userspace is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed only for sctp data, which is ~3500. Condition is never met, and when userspace reads all data, rwnd stays on 3569. IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] --> At this point userspace read everything, rwnd recovered only to 3569 IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] Reproduction is straight forward, it is enough for sender to send packets of size less then sizeof(struct sk_buff) and receiver keeping them in its buffers. 2) Minute size window for associations sharing the same socket buffer In case multiple associations share the same socket, and same socket buffer (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one of the associations can permanently drop rwnd of other association(s). Situation will be typically observed as one association suddenly having rwnd dropped to size of last packet received and never recovering beyond that point. Different scenarios will lead to it, but all have in common that one of the associations (let it be association from 1)) nearly depleted socket buffer, and the other association blames socket buffer just for the amount enough to start the pressure. This association will enter pressure state, set rwnd_press and announce 0 rwnd. When data is read by userspace, similar situation as in 1) will occur, rwnd will increase just for the size read by userspace but rwnd_press will be high enough so that association doesn't have enough credit to reach rwnd_press and restore to previous state. This case is special case of 1), being worse as there is, in the worst case, only one packet in buffer for which size rwnd will be increased. Consequence is association which has very low maximum rwnd ('minute size', in our case down to 43B - size of packet which caused pressure) and as such unusable. Scenario happened in the field and labs frequently after congestion state (link breaks, different probabilities of packet drop, packet reordering) and with scenario 1) preceding. Here is given a deterministic scenario for reproduction: >From node A establish two associations on the same socket, with rcvbuf_policy being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1 repeat scenario from 1), that is, bring it down to 0 and restore up. Observe scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered', bring it down close to 0, as in just one more packet would close it. This has as a consequence that association number 2 is able to receive (at least) one more packet which will bring it in pressure state. E.g. if association 2 had rwnd of 10000, packet received was 43, and we enter at this point into pressure, rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will increase for 43, but conditions to restore rwnd to original state, just as in 1), will never be satisfied. --> Association 1, between A.y and B.12345 IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569] IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613] IP A.55915 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.55915: sctp (1) [COOKIE ACK] --> Association 2, between A.z and B.12346 IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173] IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240] IP A.55915 > B.12346: sctp (1) [COOKIE ECHO] IP B.12346 > A.55915: sctp (1) [COOKIE ACK] --> Deplete socket buffer by sending messages of size 43B over association 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] <...> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0] --> Sudden drop on 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using association 1 so there is place in buffer for only one more packet IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] --> Socket buffer is almost depleted, but there is space for one more packet, send them over association 2, size 43B IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Immediate drop IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Read everything from the socket, both association recover up to maximum rwnd they are capable of reaching, note that association 1 recovered up to 3698, and association 2 recovered only to 43 IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0] IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] A careful reader might wonder why it is necessary to reproduce 1) prior reproduction of 2). It is simply easier to observe when to send packet over association 2 which will push association into the pressure state. Proposed solution: Both problems share the same root cause, and that is improper scaling of socket buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while calculating rwnd is not possible due to fact that there is no linear relationship between amount of data blamed in increase/decrease with IP packet in which payload arrived. Even in case such solution would be followed, complexity of the code would increase. Due to nature of current rwnd handling, slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is entered is rationale, but it gives false representation to the sender of current buffer space. Furthermore, it implements additional congestion control mechanism which is defined on implementation, and not on standard basis. Proposed solution simplifies whole algorithm having on mind definition from rfc: o Receiver Window (rwnd): This gives the sender an indication of the space available in the receiver's inbound buffer. Core of the proposed solution is given with these lines: sctp_assoc_rwnd_update: if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0) asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1; else asoc->rwnd = 0; We advertise to sender (half of) actual space we have. Half is in the braces depending whether you would like to observe size of socket buffer as SO_RECVBUF or twice the amount, i.e. size is the one visible from userspace, that is, from kernelspace. In this way sender is given with good approximation of our buffer space, regardless of the buffer policy - we always advertise what we have. Proposed solution fixes described problems and removes necessity for rwnd restoration algorithm. Finally, as proposed solution is simplification, some lines of code, along with some bytes in struct sctp_association are saved. Version 2 of the patch addressed comments from Vlad. Name of the function is set to be more descriptive, and two parts of code are changed, in one removing the superfluous call to sctp_assoc_rwnd_update since call would not result in update of rwnd, and the other being reordering of the code in a way that call to sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2 in a way that existing function is not reordered/copied in line, but it is correctly called. Thanks Vlad for suggesting. Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal2014-02-131-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: core: introduce netif_skb_dev_featuresFlorian Westphal2014-02-131-1/+6
| | | | | | | | Will be used by upcoming ipv4 forward path change that needs to determine feature mask using skb->dst->dev instead of skb->dev. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2014-02-1111-20/+103
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Fix flexcan build on big endian, from Arnd Bergmann 2) Correctly attach cpsw to GPIO bitbang MDIO drive, from Stefan Roese 3) udp_add_offload has to use GFP_ATOMIC since it can be invoked from non-sleepable contexts. From Or Gerlitz 4) vxlan_gro_receive() does not iterate over all possible flows properly, fix also from Or Gerlitz 5) CAN core doesn't use a proper SKB destructor when it hooks up sockets to SKBs. Fix from Oliver Hartkopp 6) ip_tunnel_xmit() can use an uninitialized route pointer, fix from Eric Dumazet 7) Fix address family assignment in IPVS, from Michal Kubecek 8) Fix ath9k build on ARM, from Sujith Manoharan 9) Make sure fail_over_mac only applies for the correct bonding modes, from Ding Tianhong 10) The udp offload code doesn't use RCU correctly, from Shlomo Pongratz 11) Handle gigabit features properly in generic PHY code, from Florian Fainelli 12) Don't blindly invoke link operations in rtnl_link_get_slave_info_data_size, they are optional. Fix from Fernando Luis Vazquez Cao 13) Add USB IDs for Netgear Aircard 340U, from Bjørn Mork 14) Handle netlink packet padding properly in openvswitch, from Thomas Graf 15) Fix oops when deleting chains in nf_tables, from Patrick McHardy 16) Fix RX stalls in xen-netback driver, from Zoltan Kiss 17) Fix deadlock in mac80211 stack, from Emmanuel Grumbach 18) inet_nlmsg_size() forgets to consider ifa_cacheinfo, fix from Geert Uytterhoeven 19) tg3_change_mtu() can deadlock, fix from Nithin Sujir 20) Fix regression in setting SCTP local source addresses on accepted sockets, caused by some generic ipv6 socket changes. Fix from Matija Glavinic Pecotic 21) IPPROTO_* must be pure defines, otherwise module aliases don't get constructed properly. Fix from Jan Moskyto 22) IPV6 netconsole setup doesn't work properly unless an explicit source address is specified, fix from Sabrina Dubroca 23) Use __GFP_NORETRY for high order skb page allocations in sock_alloc_send_pskb and skb_page_frag_refill. From Eric Dumazet 24) Fix a regression added in netconsole over bridging, from Cong Wang 25) TCP uses an artificial offset of 1ms for SRTT, but this doesn't jive well with TCP pacing which needs the SRTT to be accurate. Fix from Eric Dumazet 26) Several cases of missing header file includes from Rashika Kheria 27) Add ZTE MF667 device ID to qmi_wwan driver, from Raymond Wanyoike 28) TCP Small Queues doesn't handle nonagle properly in some corner cases, fix from Eric Dumazet 29) Remove extraneous read_unlock in bond_enslave, whoops. From Ding Tianhong 30) Fix 9p trans_virtio handling of vmalloc buffers, from Richard Yao * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (136 commits) 6lowpan: fix lockdep splats alx: add missing stats_lock spinlock init 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers bonding: remove unwanted bond lock for enslave processing USB2NET : SR9800 : One chip USB2.0 USB2NET SR9800 Device Driver Support tcp: tsq: fix nonagle handling bridge: Prevent possible race condition in br_fdb_change_mac_address bridge: Properly check if local fdb entry can be deleted when deleting vlan bridge: Properly check if local fdb entry can be deleted in br_fdb_delete_by_port bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address bridge: Fix the way to check if a local fdb entry can be deleted bridge: Change local fdb entries whenever mac address of bridge device changes bridge: Fix the way to find old local fdb entries in br_fdb_change_mac_address bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr bridge: Fix the way to find old local fdb entries in br_fdb_changeaddr tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min net: vxge: Remove unused device pointer net: qmi_wwan: add ZTE MF667 3c59x: Remove unused pointer in vortex_eisa_cleanup() net: fix 'ip rule' iif/oif device rename ...
| * net: Move prototype declaration to header file include/net/net_namespace.h ↵Rashika Kheria2014-02-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | from net/ipx/af_ipx.c Move prototype declaration of function to header file include/net/net_namespace.h from net/ipx/af_ipx.c because they are used by more than one file. This eliminates the following warning in net/ipx/sysctl_net_ipx.c: net/ipx/sysctl_net_ipx.c:33:6: warning: no previous prototype for ‘ipx_register_sysctl’ [-Wmissing-prototypes] net/ipx/sysctl_net_ipx.c:38:6: warning: no previous prototype for ‘ipx_unregister_sysctl’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move prototype declaration to header file include/net/datalink.h from ↵Rashika Kheria2014-02-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipx/af_ipx.c Move prototype declarations of function to header file include/net/datalink.h from net/ipx/af_ipx.c because they are used by more than one file. This eliminates the following warning in net/ipx/pe2.c: net/ipx/pe2.c:20:24: warning: no previous prototype for ‘make_EII_client’ [-Wmissing-prototypes] net/ipx/pe2.c:32:6: warning: no previous prototype for ‘destroy_EII_client’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move prototype declaration to header file include/net/ipx.h from ↵Rashika Kheria2014-02-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipx/af_ipx.c Move prototype declaration of functions to header file include/net/ipx.h from net/ipx/af_ipx.c because they are used by more than one file. This eliminates the following warning in net/ipx/ipx_route.c:33:19: warning: no previous prototype for ‘ipxrtr_lookup’ [-Wmissing-prototypes] net/ipx/ipx_route.c:52:5: warning: no previous prototype for ‘ipxrtr_add_route’ [-Wmissing-prototypes] net/ipx/ipx_route.c:94:6: warning: no previous prototype for ‘ipxrtr_del_routes’ [-Wmissing-prototypes] net/ipx/ipx_route.c:149:5: warning: no previous prototype for ‘ipxrtr_route_skb’ [-Wmissing-prototypes] net/ipx/ipx_route.c:171:5: warning: no previous prototype for ‘ipxrtr_route_packet’ [-Wmissing-prototypes] net/ipx/ipx_route.c:261:5: warning: no previous prototype for ‘ipxrtr_ioctl’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move prototype declaration to include/net/ipx.h from net/ipx/ipx_route.cRashika Kheria2014-02-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move prototype definition of function to header file include/net/ipx.h from net/ipx/ipx_route.c because they are used by more than one file. This eliminates the following warning from net/ipx/af_ipx.c: net/ipx/af_ipx.c:193:23: warning: no previous prototype for ‘ipxitf_find_using_net’ [-Wmissing-prototypes] net/ipx/af_ipx.c:577:5: warning: no previous prototype for ‘ipxitf_send’ [-Wmissing-prototypes] net/ipx/af_ipx.c:1219:8: warning: no previous prototype for ‘ipx_cksum’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move prototype declaration to header file include/net/dn.h from ↵Rashika Kheria2014-02-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/decnet/af_decnet.c Move prototype declaration of functions to header file include/net/dn.h from net/decnet/af_decnet.c because they are used by more than one file. This eliminates the following warning in net/decnet/af_decnet.c: net/decnet/sysctl_net_decnet.c:354:6: warning: no previous prototype for ‘dn_register_sysctl’ [-Wmissing-prototypes] net/decnet/sysctl_net_decnet.c:359:6: warning: no previous prototype for ‘dn_unregister_sysctl’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move prototype declaration to appropriate header file from ↵Rashika Kheria2014-02-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | decnet/af_decnet.c Move prototype declaration of functions to header file include/net/dn_route.h from net/decnet/af_decnet.c because it is used by more than one file. This eliminates the following warning in net/decnet/dn_route.c: net/decnet/dn_route.c:629:5: warning: no previous prototype for ‘dn_route_rcv’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2014-02-093-4/+32
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/nftables/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes, mostly nftables fixes, most relevantly they are: * Fix a crash in the h323 conntrack NAT helper due to expectation list corruption, from Alexey Dobriyan. * A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and me. * Dump direction attribute in nft_ct only if it is set, from Arturo Borrero. * Fix IPVS bug in its own connection tracking system that may lead to copying only 4 bytes of the IPv6 address when initializing the ip_vs_conn object, from Michal Kubecek. * Fix -EBUSY errors in nftables when deleting the rules, chain and tables in a row due mixture of asynchronous and synchronous object releasing, from me. * Three fixes for the nf_tables set infrastructure when using intervals and mappings, from me. * Four patches to fixing the nf_tables log, reject and ct expressions from the new inet table, from Patrick McHardy. * Fix memory overrun in the map that is used to dynamically allocate names from anonymous sets, also from Patrick. * Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table name, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: nf_tables: fix racy rule deletionPablo Neira Ayuso2014-02-061-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We may lost race if we flush the rule-set (which happens asynchronously via call_rcu) and we try to remove the table (that userspace assumes to be empty). Fix this by recovering synchronous rule and chain deletion. This was introduced time ago before we had no batch support, and synchronous rule deletion performance was not good. Now that we have the batch support, we can just postpone the purge of old rule in a second step in the commit phase. All object deletions are synchronous after this patch. As a side effect, we save memory as we don't need rcu_head per rule anymore. Cc: Patrick McHardy <kaber@trash.net> Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: add reject module for NFPROTO_INETPatrick McHardy2014-02-061-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | Add a reject module for NFPROTO_INET. It does nothing but dispatch to the AF-specific modules based on the hook family. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc partsPatrick McHardy2014-02-061-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the nft_reject module depends on symbols from ipv6. This is wrong since no generic module should force IPv6 support to be loaded. Split up the module into AF-specific and a generic part. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: add AF specific expression supportPatrick McHardy2014-02-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | For the reject module, we need to add AF-specific implementations to get rid of incorrect module dependencies. Try to load an AF-specific module first and fall back to generic modules. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_conntrack: don't release a conntrack with non-zero refcntPablo Neira Ayuso2014-02-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch, the conntrack refcount is initially set to zero and it is bumped once it is added to any of the list, so we fulfill Eric's golden rule which is that all released objects always have a refcount that equals zero. Andrey Vagin reports that nf_conntrack_free can't be called for a conntrack with non-zero ref-counter, because it can race with nf_conntrack_find_get(). A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero ref-counter says that this conntrack is used. So when we release a conntrack with non-zero counter, we break this assumption. CPU1 CPU2 ____nf_conntrack_find() nf_ct_put() destroy_conntrack() ... init_conntrack __nf_conntrack_alloc (set use = 1) atomic_inc_not_zero(&ct->use) (use = 2) if (!l4proto->new(ct, skb, dataoff, timeouts)) nf_conntrack_free(ct); (use = 2 !!!) ... __nf_conntrack_alloc (set use = 1) if (!nf_ct_key_equal(h, tuple, zone)) nf_ct_put(ct); (use = 0) destroy_conntrack() /* continue to work with CT */ After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get" another bug was triggered in destroy_conntrack(): <4>[67096.759334] ------------[ cut here ]------------ <2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211! ... <4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB <4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack] <4>[67096.760255] Call Trace: <4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30 <4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack] <4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack] <4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4] <4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0 <4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20 <4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110 <4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20 <4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910 <4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130 <4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0 <4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140 <4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140 <4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60 <4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20 <4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40 <4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80 <4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20 <4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190 <4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200 <4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290 <4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210 <4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5 I have reused the original title for the RFC patch that Andrey posted and most of the original patch description. Cc: Eric Dumazet <edumazet@google.com> Cc: Andrew Vagin <avagin@parallels.com> Cc: Florian Westphal <fw@strlen.de> Reported-by: Andrew Vagin <avagin@parallels.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrew Vagin <avagin@parallels.com>
| * | inet: defines IPPROTO_* needed for module alias generationJan Moskyto Matejka2014-02-061-16/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit cfd280c91253 ("net: sync some IP headers with glibc") changed a set of define's to an enum (with no explanation why) which introduced a bug in module mip6 where aliases are generated using the IPPROTO_* defines; mip6 doesn't load if require_module called with the aliases from xfrm_get_type(). Reverting this change back to define's to fix the aliases. modinfo mip6 (before this change) alias: xfrm-type-10-IPPROTO_DSTOPTS alias: xfrm-type-10-IPPROTO_ROUTING modinfo mip6 (after this change) alias: xfrm-type-10-43 alias: xfrm-type-10-60 Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: ethoc: set up MII management bus clockMax Filippov2014-02-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MII management bus clock is derived from the MAC clock by dividing it by MIIMODER register CLKDIV field value. This value may need to be set up in case it is undefined or its default value is too high (and communication with PHY is too slow) or too low (and communication with PHY is impossible). The value of CLKDIV is not specified directly, but is derived from the MAC clock for the default MII management bus frequency of 2.5MHz. The MAC clock may be specified in the platform data, or in the 'clocks' device tree attribute. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | can: add destructor for self generated skbsOliver Hartkopp2014-01-301-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Self generated skbuffs in net/can/bcm.c are setting a skb->sk reference but no explicit destructor which is enforced since Linux 3.11 with commit 376c7311bdb6 (net: add a temporary sanity check in skb_orphan()). This patch adds some helper functions to make sure that a destructor is properly defined when a sock reference is assigned to a CAN related skb. To create an unshared skb owned by the original sock a common helper function has been introduced to replace open coded functions to create CAN echo skbs. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Andre Naujoks <nautsch2@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | smp.h: fix x86+cpu.c sparse warnings about arch nonboot CPU callsPaul Gortmaker2014-02-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use what we already do for arch_disable_smp_support() to fix these: arch/x86/kernel/smpboot.c:1155:6: warning: symbol 'arch_enable_nonboot_cpus_begin' was not declared. Should it be static? arch/x86/kernel/smpboot.c:1160:6: warning: symbol 'arch_enable_nonboot_cpus_end' was not declared. Should it be static? kernel/cpu.c:512:13: warning: symbol 'arch_enable_nonboot_cpus_begin' was not declared. Should it be static? kernel/cpu.c:516:13: warning: symbol 'arch_enable_nonboot_cpus_end' was not declared. Should it be static? Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2014-02-091-1/+7
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes, both -stable fodder. The O_SYNC bug is fairly old..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix a kmap leak in virtio_console fix O_SYNC|O_APPEND syncing the wrong range on write()
| * | | fix O_SYNC|O_APPEND syncing the wrong range on write()Al Viro2014-02-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support) when sync_page_range() had been introduced; generic_file_write{,v}() correctly synced pos_after_write - written .. pos_after_write - 1 but generic_file_aio_write() synced pos_before_write .. pos_before_write + written - 1 instead. Which is not the same thing with O_APPEND, obviously. A couple of years later correct variant had been killed off when everything switched to use of generic_file_aio_write(). All users of generic_file_aio_write() are affected, and the same bug has been copied into other instances of ->aio_write(). The fix is trivial; the only subtle point is that generic_write_sync() ought to be inlined to avoid calculations useless for the majority of calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2014-02-082-1/+11
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "Quite a varied little collection of fixes. Most of them are relatively small or isolated; the biggest one is Mel Gorman's fixes for TLB range flushing. A couple of AMD-related fixes (including not crashing when given an invalid microcode image) and fix a crash when compiled with gcov" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode, AMD: Unify valid container checks x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y x86/efi: Allow mapping BGRT on x86-32 x86: Fix the initialization of physnode_map x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable() x86/intel/mid: Fix X86_INTEL_MID dependencies arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge x86: mm: change tlb_flushall_shift for IvyBridge x86/mm: Eliminate redundant page table walk during TLB range flushing x86/mm: Clean up inconsistencies when flushing TLB ranges mm, x86: Account for TLB flushes only when debugging x86/AMD/NB: Fix amd_set_subcaches() parameter type x86/quirks: Add workaround for AMD F16h Erratum792 x86, doc, kconfig: Fix dud URL for Microcode data
| * | | | Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin2014-02-07567-3840/+14954
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | Merge branch 'linus' into x86/urgentIngo Molnar2014-01-2550-340/+737
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge in the x86 changes to apply a fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | mm, x86: Account for TLB flushes only when debuggingMel Gorman2014-01-252-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | swap: add a simple detector for inappropriate swapin readaheadShaohua Li2014-02-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | Merge tag 'stable/for-linus-3.14-rc1-tag' of ↵Linus Torvalds2014-02-051-6/+2
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen fixes from Konrad Rzeszutek Wilk: "Bug-fixes: - Revert "xen/grant-table: Avoid m2p_override during mapping" as it broke Xen ARM build. - Fix CR4 not being set on AP processors in Xen PVH mode" * tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/pvh: set CR4 flags for APs Revert "xen/grant-table: Avoid m2p_override during mapping"
| * | | | | | Revert "xen/grant-table: Avoid m2p_override during mapping"Konrad Rzeszutek Wilk2014-02-031-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 08ece5bb2312b4510b161a6ef6682f37f4eac8a1. As it breaks ARM builds and needs more attention on the ARM side. Acked-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | | | | | Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2014-02-052-1/+16
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe driver update from Matthew Wilcox: "Looks like I missed the merge window ... but these are almost all bugfixes anyway (the ones that aren't have been baking for months)" * git://git.infradead.org/users/willy/linux-nvme: NVMe: Namespace use after free on surprise removal NVMe: Correct uses of INIT_WORK NVMe: Include device and queue numbers in interrupt name NVMe: Add a pci_driver shutdown method NVMe: Disable admin queue on init failure NVMe: Dynamically allocate partition numbers NVMe: Async IO queue deletion NVMe: Surprise removal handling NVMe: Abort timed out commands NVMe: Schedule reset for failed controllers NVMe: Device resume error handling NVMe: Cache dev->pci_dev in a local pointer NVMe: Fix lockdep warnings NVMe: compat SG_IO ioctl NVMe: remove deprecated IRQF_DISABLED NVMe: Avoid shift operation when writing cq head doorbell
| * | | | | | | NVMe: Abort timed out commandsKeith Busch2014-01-272-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Send nvme abort command to io requests that have timed out on an initialized device. If the command is not returned after another timeout, schedule the controller for reset. Signed-off-by: Keith Busch <keith.busch@intel.com> [fix endianness issues] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | | | | NVMe: Schedule reset for failed controllersKeith Busch2014-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Schedules a controller reset when it indicates it has a failed status. If the device does not become ready after a reset, the pci device will be scheduled for removal. Signed-off-by: Keith Busch <keith.busch@intel.com> [fixed checkpatch issue] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | | | | NVMe: Device resume error handlingKeith Busch2013-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds controller error handling on resume power management. If the device fails to initialize, the device is queued for a reset. If the reset fails, a thread is spawned to remove the pci device. If the device resumes as "busy", the device is responding to admin commands but will not create IO queues. In this case, we need to remove the gendisks and free the IO queues since they can't be used and may be holding bios in their lists. From testing, the dma pools require a pci device so this had to change the pci driver 'remove' to release the dma resources in line with that call instead of after all references to the device are released. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | | | | NVMe: compat SG_IO ioctlKeith Busch2013-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For 32-bit versions of sg3-utils running on a 64-bit system. This is mostly a copy from the relevent portions of fs/compat_ioctl.c, with slight modifications for going through block_device_operations. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> [fixed up CONFIG_COMPAT=n build problems] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | | | | NVMe: Avoid shift operation when writing cq head doorbellHaiyan Hu2013-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes the type of dev->db_stride to unsigned and changes the value stored there to be 1 << the current value. Then there is less calculation to be done at completion time. Signed-off-by: Haiyan Hu <huhaiyan@huawei.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
* | | | | | | | execve: use 'struct filename *' for executable name passingLinus Torvalds2014-02-053-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | | Merge tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-02-041-0/+2
|\ \ \ \ \ \ \ \ | |_|_|_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix NFSv3 acl regressions - Fix NFSv4 memory corruption due to slot table abuse in nfs4_proc_open_confirm - nfs4_destroy_session must call rpc_destroy_waitqueue" * tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: fs: get_acl() must be allowed to return EOPNOTSUPP NFSv3: Fix return value of nfs3_proc_setacls NFSv3: Remove unused function nfs3_proc_set_default_acl NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue NFSv4: Fix memory corruption in nfs4_proc_open_confirm nfs: fix setting of ACLs on file creation.
| * | | | | | | NFSv4: Fix memory corruption in nfs4_proc_open_confirmTrond Myklebust2014-02-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs41_wake_and_assign_slot() relies on the task->tk_msg.rpc_argp and task->tk_msg.rpc_resp always pointing to the session sequence arguments. nfs4_proc_open_confirm tries to pull a fast one by reusing the open sequence structure, thus causing corruption of the NFSv4 slot table. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | | | | | | Merge branch 'slab/next' of ↵Linus Torvalds2014-02-021-4/+4
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "Random bug fixes that have accumulated in my inbox over the past few months" * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: Fix warning on make htmldocs caused by slab.c mm: slub: work around unneeded lockdep warning mm: sl[uo]b: fix misleading comments slub: Fix possible format string bug. slub: use lockdep_assert_held slub: Fix calculation of cpu slabs slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings
| * | | | | | | | mm: sl[uo]b: fix misleading commentsDave Hansen2014-01-311-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86, SLUB creates and handles <=8192-byte allocations internally. It passes larger ones up to the allocator. Saying "up to order 2" is, at best, ambiguous. Is that order-1? Or (order-2 bytes)? Make it more clear. SLOB commits a similar sin. It *handles* page-size requests, but the comment says that it passes up "all page size and larger requests". SLOB also swaps around the order of the very-similarly-named KMALLOC_SHIFT_HIGH and KMALLOC_SHIFT_MAX #defines. Make it consistent with the order of the other two allocators. Cc: Matt Mackall <mpm@selenic.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| * | | | | | | | slab.h: remove duplicate kmalloc declaration and fix kernel-doc warningsRandy Dunlap2013-12-181-56/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc warning for duplicate definition of 'kmalloc': Documentation/DocBook/kernel-api.xml:9483: element refentry: validity error : ID API-kmalloc already defined <refentry id="API-kmalloc"> Also combine the kernel-doc info from the 2 kmalloc definitions into one block and remove the "see kcalloc" comment since kmalloc now contains the @flags info. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
* | | | | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2014-02-011-1/+1
|\ \ \ \ \ \ \ \ \ | |_|/ / / / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Several obvious fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix mountpoint reference leakage in linkat hfsplus: use xattr handlers for removexattr Typo in compat_sys_lseek() declaration fs/super.c: sync ro remount after blocking writers vfs: unexport the getname() symbol
| * | | | | | | | Typo in compat_sys_lseek() declarationStephan Springl2014-01-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | | | Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-01-311-0/+1
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix several races in nfs_revalidate_mapping - NFSv4.1 slot leakage in the pNFS files driver - Stable fix for a slot leak in nfs40_sequence_done - Don't reject NFSv4 servers that support ACLs with only ALLOW aces" * tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: initialize the ACL support bits to zero. NFSv4.1: Cleanup NFSv4.1: Clean up nfs41_sequence_done NFSv4: Fix a slot leak in nfs40_sequence_done NFSv4.1 free slot before resending I/O to MDS nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING NFS: Fix races in nfs_revalidate_mapping sunrpc: turn warn_gssd() log message into a dprintk() NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping nfs: handle servers that support only ALLOW ACE type.
| * | | | | | | | | NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mappingJeff Layton2014-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a possible race in how the nfs_invalidate_mapping function is handled. Currently, we go and invalidate the pages in the file and then clear NFS_INO_INVALID_DATA. The problem is that it's possible for a stale page to creep into the mapping after the page was invalidated (i.e., via readahead). If another writer comes along and sets the flag after that happens but before invalidate_inode_pages2 returns then we could clear the flag without the cache having been properly invalidated. So, we must clear the flag first and then invalidate the pages. Doing this however, opens another race: It's possible to have two concurrent read() calls that end up in nfs_revalidate_mapping at the same time. The first one clears the NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping. Just before calling that though, the other task races in, checks the flag and finds it cleared. At that point, it trusts that the mapping is good and gets the lock on the page, allowing the read() to be satisfied from the cache even though the data is no longer valid. These effects are easily manifested by running diotest3 from the LTP test suite on NFS. That program does a series of DIO writes and buffered reads. The operations are serialized and page-aligned but the existing code fails the test since it occasionally allows a read to come out of the cache incorrectly. While mixing direct and buffered I/O isn't recommended, I believe it's possible to hit this in other ways that just use buffered I/O, though that situation is much harder to reproduce. The problem is that the checking/clearing of that flag and the invalidation of the mapping really need to be atomic. Fix this by serializing concurrent invalidations with a bitlock. At the same time, we also need to allow other places that check NFS_INO_INVALID_DATA to check whether we might be in the middle of invalidating the file, so fix up a couple of places that do that to look for the new NFS_INO_INVALIDATING flag. Doing this requires us to be careful not to set the bitlock unnecessarily, so this code only does that if it believes it will be doing an invalidation. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | | | | | | | | Merge branch 'for-next' of ↵Linus Torvalds2014-01-316-9/+89
|\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "The highlights this round include: - add support for SCSI Referrals (Hannes) - add support for T10 DIF into target core (nab + mkp) - add support for T10 DIF emulation in FILEIO + RAMDISK backends (Sagi + nab) - add support for T10 DIF -> bio_integrity passthrough in IBLOCK backend (nab) - prep changes to iser-target for >= v3.15 T10 DIF support (Sagi) - add support for qla2xxx N_Port ID Virtualization - NPIV (Saurav + Quinn) - allow percpu_ida_alloc() to receive task state bitmask (Kent) - fix >= v3.12 iscsi-target session reset hung task regression (nab) - fix >= v3.13 percpu_ref se_lun->lun_ref_active race (nab) - fix a long-standing network portal creation race (Andy)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (51 commits) target: Fix percpu_ref_put race in transport_lun_remove_cmd target/iscsi: Fix network portal creation race target: Report bad sector in sense data for DIF errors iscsi-target: Convert gfp_t parameter to task state bitmask iscsi-target: Fix connection reset hang with percpu_ida_alloc percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask iscsi-target: Pre-allocate more tags to avoid ack starvation qla2xxx: Configure NPIV fc_vport via tcm_qla2xxx_npiv_make_lport qla2xxx: Enhancements to enable NPIV support for QLOGIC ISPs with TCM/LIO. qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure IB/isert: pass scatterlist instead of cmd to fast_reg_mr routine IB/isert: Move fastreg descriptor creation to a function IB/isert: Avoid frwr notation, user fastreg IB/isert: seperate connection protection domains and dma MRs tcm_loop: Enable DIF/DIX modes in SCSI host LLD target/rd: Add DIF protection into rd_execute_rw target/rd: Add support for protection SGL setup + release target/rd: Refactor rd_build_device_space + rd_release_device_space target/file: Add DIF protection support to fd_execute_rw target/file: Add DIF protection init/format support ...
| * | | | | | | | | | target: Fix percpu_ref_put race in transport_lun_remove_cmdNicholas Bellinger2014-01-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a percpu_ref_put race for se_lun->lun_ref in transport_lun_remove_cmd() where ->lun_ref could end up being put more than once per command via different target completion and fabric release contexts. It adds a cmpxchg() for se_cmd->lun_ref_active to ensure that percpu_ref_put() is only ever called once per se_cmd. This bug was manifesting itself as a LUN shutdown regression bug in >= v3.13 code, where percpu_ref_kill() would end up hanging indefinately due to the incorrect percpu_ref count. (Change se_cmd->lun_ref_active from bool -> int to force at least a 4-byte cmpxchg with MIPS ll/sc ins. - Fengguang) Reported-by: Tommy Apel <tommyapeldk@gmail.com> Cc: Tommy Apel <tommyapeldk@gmail.com> Cc: <stable@vger.kernel.org> #3.13+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| * | | | | | | | | | target: Report bad sector in sense data for DIF errorsSagi Grimberg2014-01-251-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SPC-4 states that data-integrity errors shall also report the failed sector in CHECK_CONDITION response sense data information field. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>