summaryrefslogtreecommitdiffstats
path: root/include/net/sock.h
Commit message (Collapse)AuthorAgeFilesLines
* net: Make locking in sock_bindtoindex optionalFerenc Fejes2020-06-011-1/+1
| | | | | | | | | | | | | The sock_bindtoindex intended for kernel wide usage however it will lock the socket regardless of the context. This modification relax this behavior optionally: locking the socket will be optional by calling the sock_bindtoindex with lock_sk = true. The modification applied to all users of the sock_bindtoindex. Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu
* net: add a new bind_add methodChristoph Hellwig2020-05-291-1/+5
| | | | | | | | | | | | The SCTP protocol allows to bind multiple address to a socket. That feature is currently only exposed as a socket option. Add a bind_add method struct proto that allows to bind additional addresses, and switch the dlm code to use the method instead of going through the socket option from kernel space. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_reuseportChristoph Hellwig2020-05-281-0/+1
| | | | | | | | Add a helper to directly set the SO_REUSEPORT sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_rcvbufChristoph Hellwig2020-05-281-0/+1
| | | | | | | | Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_keepaliveChristoph Hellwig2020-05-281-0/+1
| | | | | | | | Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_enable_timestampsChristoph Hellwig2020-05-281-0/+1
| | | | | | | | | Add a helper to directly enable timestamps instead of setting the SO_TIMESTAMP* sockopts from kernel space and going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_bindtoindexChristoph Hellwig2020-05-281-0/+1
| | | | | | | | Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_sndtimeoChristoph Hellwig2020-05-281-0/+1
| | | | | | | | | | Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel space without going through a fake uaccess. The interface is simplified to only pass the seconds value, as that is the only thing needed at the moment. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_priorityChristoph Hellwig2020-05-281-0/+1
| | | | | | | | | Add a helper to directly set the SO_PRIORITY sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_no_lingerChristoph Hellwig2020-05-281-0/+1
| | | | | | | | | | Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add sock_set_reuseaddrChristoph Hellwig2020-05-281-0/+2
| | | | | | | | | | | | | Add a helper to directly set the SO_REUSEADDR sockopt from kernel space without going through a fake uaccess. For this the iscsi target now has to formally depend on inet to avoid a mostly theoretical compile failure. For actual operation it already did depend on having ipv4 or ipv6 support. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sock.h: fix skb_steal_sock() kernel-docLothar Rubusch2020-04-071-3/+3
| | | | | | | | | | Fix warnings related to kernel-doc notation, and wording in function description. Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf: Don't refcount LISTEN sockets in sk_assign()Joe Stringer2020-03-301-8/+17
| | | | | | | | | | | | | Avoid taking a reference on listen sockets by checking the socket type in the sk_assign and in the corresponding skb_steal_sock() code in the the transport layer, and by ensuring that the prefetch free (sock_pfree) function uses the same logic to check whether the socket is refcounted. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
* net: Track socket refcounts in skb_steal_sock()Joe Stringer2020-03-301-1/+9
| | | | | | | | | | | Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make the determination of whether the socket is reference counted in the case where it is prefetched by earlier logic such as early_demux. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
* bpf: Add socket assign supportJoe Stringer2020-03-301-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-02-211-2/+35
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2020-02-21 The following pull-request contains BPF updates for your *net-next* tree. We've added 25 non-merge commits during the last 4 day(s) which contain a total of 33 files changed, 2433 insertions(+), 161 deletions(-). The main changes are: 1) Allow for adding TCP listen sockets into sock_map/hash so they can be used with reuseport BPF programs, from Jakub Sitnicki. 2) Add a new bpf_program__set_attach_target() helper for adding libbpf support to specify the tracepoint/function dynamically, from Eelco Chaudron. 3) Add bpf_read_branch_records() BPF helper which helps use cases like profile guided optimizations, from Daniel Xu. 4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu. 5) Relax BTF mandatory check if only used for libbpf itself e.g. to process BTF defined maps, from Andrii Nakryiko. 6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has been observed that former fails in envs with low memlock, from Yonghong Song. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net, sk_msg: Clear sk_user_data pointer on clone if taggedJakub Sitnicki2020-02-211-2/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk_user_data can hold a pointer to an object that is not intended to be shared between the parent socket and the child that gets a pointer copy on clone. This is the case when sk_user_data points at reference-counted object, like struct sk_psock. One way to resolve it is to tag the pointer with a no-copy flag by repurposing its lowest bit. Based on the bit-flag value we clear the child sk_user_data pointer after cloning the parent socket. The no-copy flag is stored in the pointer itself as opposed to externally, say in socket flags, to guarantee that the pointer and the flag are copied from parent to child socket in an atomic fashion. Parent socket state is subject to change while copying, we don't hold any locks at that time. This approach relies on an assumption that sk_user_data holds a pointer to an object aligned at least 2 bytes. A manual audit of existing users of rcu_dereference_sk_user_data helper confirms our assumption. Also, an RCU-protected sk_user_data is not likely to hold a pointer to a char value or a pathological case of "struct { char c; }". To be safe, warn when the flag-bit is set when setting sk_user_data to catch any future misuses. It is worth considering why clearing sk_user_data unconditionally is not an option. There exist users, DRBD, NVMe, and Xen drivers being among them, that rely on the pointer being copied when cloning the listening socket. Potentially we could distinguish these users by checking if the listening socket has been created in kernel-space via sock_create_kern, and hence has sk_kern_sock flag set. However, this is not the case for NVMe and Xen drivers, which create sockets without marking them as belonging to the kernel. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
* | net/sock.h: fix all kernel-doc warningsRandy Dunlap2020-02-161-5/+33
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix all kernel-doc warnings for <net/sock.h>. Fixes these warnings: ../include/net/sock.h:232: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_portpair' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_cookie' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_listener' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common' ../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common' ../include/net/sock.h:498: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_wq_raw' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock' ../include/net/sock.h:498: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock' ../include/net/sock.h:2024: warning: No description found for return value of 'sk_wmem_alloc_get' ../include/net/sock.h:2035: warning: No description found for return value of 'sk_rmem_alloc_get' ../include/net/sock.h:2046: warning: No description found for return value of 'sk_has_allocations' ../include/net/sock.h:2082: warning: No description found for return value of 'skwq_has_sleeper' ../include/net/sock.h:2244: warning: No description found for return value of 'sk_page_frag' ../include/net/sock.h:2444: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE' ../include/net/sock.h:2444: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE' ../include/net/sock.h:2444: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-01-231-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-01-22 The following pull-request contains BPF updates for your *net-next* tree. We've added 92 non-merge commits during the last 16 day(s) which contain a total of 320 files changed, 7532 insertions(+), 1448 deletions(-). The main changes are: 1) function by function verification and program extensions from Alexei. 2) massive cleanup of selftests/bpf from Toke and Andrii. 3) batched bpf map operations from Brian and Yonghong. 4) tcp congestion control in bpf from Martin. 5) bulking for non-map xdp_redirect form Toke. 6) bpf_send_signal_thread helper from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * xsk, net: Make sock_def_readable() have external linkageBjörn Töpel2020-01-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | XDP sockets use the default implementation of struct sock's sk_data_ready callback, which is sock_def_readable(). This function is called in the XDP socket fast-path, and involves a retpoline. By letting sock_def_readable() have external linkage, and being called directly, the retpoline can be avoided. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com
* | tcp: clean ext on tx recyclePaolo Abeni2020-01-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise we will find stray/unexpected/old extensions value on next iteration. On tcp_write_xmit() we can end-up splitting an already queued skb in two parts, via tso_fragment(). The newly created skb can be allocated via the tx cache and an upper layer will not be aware of it, so that upper layer cannot set the ext properly. Resetting the ext on recycle ensures that stale data is not propagated in to packet headers or elsewhere. An alternative would be add an additional hook in tso_fragment() or in sk_stream_alloc_skb() to init the ext for upper layers that need it. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: Make sk_protocol a 16-bit valueMat Martineau2020-01-091-20/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so sizeof(struct sock) does not change. Also take care of BPF field access for sk_type/sk_protocol. Both of them are now outside the bitfield, so we can use load instructions without further shifting/masking. v5 -> v6: - update eBPF accessors, too (Intel's kbuild test robot) v2 -> v3: - keep 'sk_type' 2 bytes aligned (Eric) v1 -> v2: - preserve sk_pacing_shift as bit field (Eric) Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: bpf@vger.kernel.org Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Make sock protocol value checks more specificMat Martineau2020-01-091-1/+0
|/ | | | | | | | | | | SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The limits have more to do with the those protocol definitions than they do with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use U8_MAX directly. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2019-12-221-2/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso, including adding a missing ipv6 match description. 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi Bhat. 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet. 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold. 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien. 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul Chaignon. 7) Multicast MAC limit test is off by one in qede, from Manish Chopra. 8) Fix established socket lookup race when socket goes from TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening RCU grace period. From Eric Dumazet. 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet. 10) Fix active backup transition after link failure in bonding, from Mahesh Bandewar. 11) Avoid zero sized hash table in gtp driver, from Taehee Yoo. 12) Fix wrong interface passed to ->mac_link_up(), from Russell King. 13) Fix DSA egress flooding settings in b53, from Florian Fainelli. 14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost. 15) Fix double free in dpaa2-ptp code, from Ioana Ciornei. 16) Reject invalid MTU values in stmmac, from Jose Abreu. 17) Fix refcount leak in error path of u32 classifier, from Davide Caratti. 18) Fix regression causing iwlwifi firmware crashes on boot, from Anders Kaseorg. 19) Fix inverted return value logic in llc2 code, from Chan Shu Tak. 20) Disable hardware GRO when XDP is attached to qede, frm Manish Chopra. 21) Since we encode state in the low pointer bits, dst metrics must be at least 4 byte aligned, which is not necessarily true on m68k. Add annotations to fix this, from Geert Uytterhoeven. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits) sfc: Include XDP packet headroom in buffer step size. sfc: fix channel allocation with brute force net: dst: Force 4-byte alignment of dst_metrics selftests: pmtu: fix init mtu value in description hv_netvsc: Fix unwanted rx_table reset net: phy: ensure that phy IDs are correctly typed mod_devicetable: fix PHY module format qede: Disable hardware gro when xdp prog is installed net: ena: fix issues in setting interrupt moderation params in ethtool net: ena: fix default tx interrupt moderation interval net/smc: unregister ib devices in reboot_event net: stmmac: platform: Fix MDIO init for platforms without PHY llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) net: hisilicon: Fix a BUG trigered by wrong bytes_compl net: dsa: ksz: use common define for tag len s390/qeth: don't return -ENOTSUPP to userspace s390/qeth: fix promiscuous mode after reset s390/qeth: handle error due to unsupported transport mode cxgb4: fix refcount init for TC-MQPRIO offload tc-testing: initial tdc selftests for cls_u32 ...
| * net: annotate lockless accesses to sk->sk_pacing_shiftEric Dumazet2019-12-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_pacing_shift can be read and written without lock synchronization. This patch adds annotations to document this fact and avoid future syzbot complains. This might also avoid unexpected false sharing in sk_pacing_shift_update(), as the compiler could remove the conditional check and always write over sk->sk_pacing_shift : if (sk->sk_pacing_shift != val) sk->sk_pacing_shift = val; Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp/dccp: fix possible race __inet_lookup_established()Eric Dumazet2019-12-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michal Kubecek and Firo Yang did a very nice analysis of crashes happening in __inet_lookup_established(). Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN (via a close()/socket()/listen() cycle) without a RCU grace period, I should not have changed listeners linkage in their hash table. They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt), so that a lookup can detect a socket in a hash list was moved in another one. Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix"), we have to add hlist_nulls_add_tail_rcu() helper. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Firo Yang <firo.yang@suse.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/ Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* | treewide: Use sizeof_field() macroPankaj Bharadiya2019-12-091-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except at places where these are defined. Later patches will remove the unused definition of FIELD_SIZEOF(). This patch is generated using following script: EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h" git grep -l -e "\bFIELD_SIZEOF\b" | while read file; do if [[ "$file" =~ $EXCLUDE_FILES ]]; then continue fi sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file; done Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com> Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: David Miller <davem@davemloft.net> # for net
* Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2019-11-261-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - A comprehensive rewrite of the robust/PI futex code's exit handling to fix various exit races. (Thomas Gleixner et al) - Rework the generic REFCOUNT_FULL implementation using atomic_fetch_* operations so that the performance impact of the cmpxchg() loops is mitigated for common refcount operations. With these performance improvements the generic implementation of refcount_t should be good enough for everybody - and this got confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely, leaving the generic implementation enabled unconditionally. (Will Deacon) - Other misc changes, fixes, cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) lkdtm: Remove references to CONFIG_REFCOUNT_FULL locking/refcount: Remove unused 'refcount_error_report()' function locking/refcount: Consolidate implementations of refcount_t locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions locking/refcount: Move saturation warnings out of line locking/refcount: Improve performance of generic REFCOUNT_FULL code locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header locking/refcount: Remove unused refcount_*_checked() variants locking/refcount: Ensure integer operands are treated as signed locking/refcount: Define constants for saturation and max refcount values futex: Prevent exit livelock futex: Provide distinct return value when owner is exiting futex: Add mutex around futex exit futex: Provide state handling for exec() as well futex: Sanitize exit state handling futex: Mark the begin of futex exit explicitly futex: Set task::futex_state to DEAD right after handling futex exit futex: Split futex_mm_release() for exit/exec exit/exec: Seperate mm_release() futex: Replace PF_EXITPIDONE with a state ...
| * locking/lockdep: Remove unused @nested argument from lock_release()Qian Cai2019-10-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the following commit: b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release") @nested is no longer used in lock_release(), so remove it from all lock_release() calls and friends. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: airlied@linux.ie Cc: akpm@linux-foundation.org Cc: alexander.levin@microsoft.com Cc: daniel@iogearbox.net Cc: davem@davemloft.net Cc: dri-devel@lists.freedesktop.org Cc: duyuyang@gmail.com Cc: gregkh@linuxfoundation.org Cc: hannes@cmpxchg.org Cc: intel-gfx@lists.freedesktop.org Cc: jack@suse.com Cc: jlbec@evilplan.or Cc: joonas.lahtinen@linux.intel.com Cc: joseph.qi@linux.alibaba.com Cc: jslaby@suse.com Cc: juri.lelli@redhat.com Cc: maarten.lankhorst@linux.intel.com Cc: mark@fasheh.com Cc: mhocko@kernel.org Cc: mripard@kernel.org Cc: ocfs2-devel@oss.oracle.com Cc: rodrigo.vivi@intel.com Cc: sean@poorly.run Cc: st@kernel.org Cc: tj@kernel.org Cc: tytso@mit.edu Cc: vdavydov.dev@gmail.com Cc: vincent.guittot@linaro.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2019-11-091-2/+2
|\ \ | | | | | | | | | | | | | | | | | | One conflict in the BPF samples Makefile, some fixes in 'net' whilst we were converting over to Makefile.target rules in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: prevent load/store tearing on sk->sk_stampEric Dumazet2019-11-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a couple of READ_ONCE() and WRITE_ONCE() to prevent load-tearing and store-tearing in sock_read_timestamp() and sock_write_timestamp() This might prevent another KCSAN report. Fixes: 3a0ed3e96197 ("sock: Make sock->sk_stamp thread-safe") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: silence data-races on sk_backlog.tailEric Dumazet2019-11-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_backlog.tail might be read without holding the socket spinlock, we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings. KCSAN reported : BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1: __sk_add_backlog include/net/sock.h:907 [inline] sk_add_backlog include/net/sock.h:938 [inline] tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759 tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133 napi_skb_finish net/core/dev.c:5596 [inline] napi_gro_receive+0x28f/0x330 net/core/dev.c:5629 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061 virtnet_receive drivers/net/virtio_net.c:1323 [inline] virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428 napi_poll net/core/dev.c:6311 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6379 __do_softirq+0x115/0x33f kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0xbb/0xe0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263 ret_from_intr+0x0/0x19 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x1af/0x280 kernel/sched/idle.c:263 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355 start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241 read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0: tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 sock_read_iter+0x15f/0x1e0 net/socket.c:967 call_read_iter include/linux/fs.h:1889 [inline] new_sync_read+0x389/0x4f0 fs/read_write.c:414 __vfs_read+0xb1/0xc0 fs/read_write.c:427 vfs_read fs/read_write.c:461 [inline] vfs_read+0x143/0x2c0 fs/read_write.c:446 ksys_read+0xd5/0x1b0 fs/read_write.c:587 __do_sys_read fs/read_write.c:597 [inline] __se_sys_read fs/read_write.c:595 [inline] __x64_sys_read+0x4c/0x60 fs/read_write.c:595 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: annotate lockless accesses to sk->sk_max_ack_backlogEric Dumazet2019-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_max_ack_backlog can be read without any lock being held at least in TCP/DCCP cases. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: annotate lockless accesses to sk->sk_ack_backlogEric Dumazet2019-11-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_ack_backlog can be read without any lock being held. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: avoid potential false sharing in neighbor related codeEric Dumazet2019-11-061-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are common instances of the following construct : if (n->confirmed != now) n->confirmed = now; A C compiler could legally remove the conditional. Use READ_ONCE()/WRITE_ONCE() to avoid this problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2019-11-021-5/+10
|\| | | | | | | | | | | | | | | | | | | | | | | | | | The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: annotate accesses to sk->sk_incoming_cpuEric Dumazet2019-10-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: fix sk_page_frag() recursion from memory reclaimTejun Heo2019-10-281-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk_page_frag() optimizes skb_frag allocations by using per-task skb_frag cache when it knows it's the only user. The condition is determined by seeing whether the socket allocation mask allows blocking - if the allocation may block, it obviously owns the task's context and ergo exclusively owns current->task_frag. Unfortunately, this misses recursion through memory reclaim path. Please take a look at the following backtrace. [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10 ... tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 sock_xmit.isra.24+0xa1/0x170 [nbd] nbd_send_cmd+0x1d2/0x690 [nbd] nbd_queue_rq+0x1b5/0x3b0 [nbd] __blk_mq_try_issue_directly+0x108/0x1b0 blk_mq_request_issue_directly+0xbd/0xe0 blk_mq_try_issue_list_directly+0x41/0xb0 blk_mq_sched_insert_requests+0xa2/0xe0 blk_mq_flush_plug_list+0x205/0x2a0 blk_flush_plug_list+0xc3/0xf0 [1] blk_finish_plug+0x21/0x2e _xfs_buf_ioapply+0x313/0x460 __xfs_buf_submit+0x67/0x220 xfs_buf_read_map+0x113/0x1a0 xfs_trans_read_buf_map+0xbf/0x330 xfs_btree_read_buf_block.constprop.42+0x95/0xd0 xfs_btree_lookup_get_block+0x95/0x170 xfs_btree_lookup+0xcc/0x470 xfs_bmap_del_extent_real+0x254/0x9a0 __xfs_bunmapi+0x45c/0xab0 xfs_bunmapi+0x15/0x30 xfs_itruncate_extents_flags+0xca/0x250 xfs_free_eofblocks+0x181/0x1e0 xfs_fs_destroy_inode+0xa8/0x1b0 destroy_inode+0x38/0x70 dispose_list+0x35/0x50 prune_icache_sb+0x52/0x70 super_cache_scan+0x120/0x1a0 do_shrink_slab+0x120/0x290 shrink_slab+0x216/0x2b0 shrink_node+0x1b6/0x4a0 do_try_to_free_pages+0xc6/0x370 try_to_free_mem_cgroup_pages+0xe3/0x1e0 try_charge+0x29e/0x790 mem_cgroup_charge_skmem+0x6a/0x100 __sk_mem_raise_allocated+0x18e/0x390 __sk_mem_schedule+0x2a/0x40 [0] tcp_sendmsg_locked+0x8eb/0xe10 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 ___sys_sendmsg+0x26d/0x2b0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In [0], tcp_send_msg_locked() was using current->page_frag when it called sk_wmem_schedule(). It already calculated how many bytes can be fit into current->page_frag. Due to memory pressure, sk_wmem_schedule() called into memory reclaim path which called into xfs and then IO issue path. Because the filesystem in question is backed by nbd, the control goes back into the tcp layer - back into tcp_sendmsg_locked(). nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes sense - it's in the process of freeing memory and wants to be able to, e.g., drop clean pages to make forward progress. However, this confused sk_page_frag() called from [2]. Because it only tests whether the allocation allows blocking which it does, it now thinks current->page_frag can be used again although it already was being used in [0]. After [2] used current->page_frag, the offset would be increased by the used amount. When the control returns to [0], current->page_frag's offset is increased and the previously calculated number of bytes now may overrun the end of allocated memory leading to silent memory corruptions. Fix it by adding gfpflags_normal_context() which tests sleepable && !reclaim and use it to determine whether to use current->task_frag. v2: Eric didn't like gfp flags being tested twice. Introduce a new helper gfpflags_normal_context() and combine the two tests. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/smc: remove unneeded include for smc.hUrsula Braun2019-10-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only smc-related reference in net/sock.h is struct smc_hashinfo. But just its address is refered to. Thus there is no need for the include of net/smc.h. Remove it. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2019-10-201-11/+22
|\| | | | | | | | | | | | | | | | | | | | Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: annotate sk->sk_wmem_queued lockless readsEric Dumazet2019-10-131-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the sake of tcp_poll(), there are few places where we fetch sk->sk_wmem_queued while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. sk_wmem_queued_add() helper is added so that we can in the future convert to ADD_ONCE() or equivalent if/when available. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: annotate sk->sk_sndbuf lockless readsEric Dumazet2019-10-131-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the sake of tcp_poll(), there are few places where we fetch sk->sk_sndbuf while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. Note that other transports probably need similar fixes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: annotate sk->sk_rcvlowat lockless readsEric Dumazet2019-10-091-1/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | sock_rcvlowat() or int_sk_rcvlowat() might be called without the socket lock for example from tcp_poll(). Use READ_ONCE() to document the fact that other cpus might change sk->sk_rcvlowat under us and avoid KCSAN splats. Use WRITE_ONCE() on write sides too. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* / net: spread "enum sock_flags"Alexey Dobriyan2019-10-041-1/+1
|/ | | | | | | Some ints are "enum sock_flags" in fact. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski2019-08-081-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* coallocate socket_wq with socket itselfAl Viro2019-07-081-2/+2
| | | | | | | | | | | | | | | | | | | socket->wq is assign-once, set when we are initializing both struct socket it's in and struct socket_wq it points to. As the matter of fact, the only reason for separate allocation was the ability to RCU-delay freeing of socket_wq. RCU-delaying the freeing of socket itself gets rid of that need, so we can just fold struct socket_wq into the end of struct socket and simplify the life both for sock_alloc_inode() (one allocation instead of two) and for tun/tap oddballs, where we used to embed struct socket and struct socket_wq into the same structure (now - embedding just the struct socket). Note that reference to struct socket_wq in struct sock does remain a reference - that's unchanged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add high_order_alloc_disable sysctl/static keyEric Dumazet2019-06-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | >From linux-3.7, (commit 5640f7685831 "net: use a per task frag allocator") TCP sendmsg() has preferred using order-3 allocations. While it gives good results for most cases, we had reports that heavy uses of TCP over loopback were hitting a spinlock contention in page allocations/freeing. This commits adds a sysctl so that admins can opt-in for order-0 allocations. Hopefully mm layer might optimize order-3 allocations in the future since it could give us a nice boost (see 8 lines of following benchmark) The following benchmark shows a win when more than 8 TCP_STREAM threads are running (56 x86 cores server in my tests) for thr in {1..30} do sysctl -wq net.core.high_order_alloc_disable=0 T0=`./super_netperf $thr -H 127.0.0.1 -l 15` sysctl -wq net.core.high_order_alloc_disable=1 T1=`./super_netperf $thr -H 127.0.0.1 -l 15` echo $thr:$T0:$T1 done 1: 49979: 37267 2: 98745: 76286 3: 141088: 110051 4: 177414: 144772 5: 197587: 173563 6: 215377: 208448 7: 241061: 234087 8: 267155: 263373 9: 295069: 297402 10: 312393: 335213 11: 340462: 368778 12: 371366: 403954 13: 412344: 443713 14: 426617: 473580 15: 474418: 507861 16: 503261: 538539 17: 522331: 563096 18: 532409: 567084 19: 550824: 605240 20: 525493: 641988 21: 564574: 665843 22: 567349: 690868 23: 583846: 710917 24: 588715: 736306 25: 603212: 763494 26: 604083: 792654 27: 602241: 796450 28: 604291: 797993 29: 611610: 833249 30: 577356: 841062 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add tcp_tx_skb_cache sysctlEric Dumazet2019-06-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Feng Tang reported a performance regression after introduction of per TCP socket tx/rx caches, for TCP over loopback (netperf) There is high chance the regression is caused by a change on how well the 32 KB per-thread page (current->task_frag) can be recycled, and lack of pcp caches for order-3 pages. I could not reproduce the regression myself, cpus all being spinning on the mm spinlocks for page allocs/freeing, regardless of enabling or disabling the per tcp socket caches. It seems best to disable the feature by default, and let admins enabling it. MM layer either needs to provide scalable order-3 pages allocations, or could attempt a trylock on zone->lock if the caller only attempts to get a high-order page and is able to fallback to order-0 ones in case of pressure. Tests run on a 56 cores host (112 hyper threads) - 35.49% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath - 35.49% queued_spin_lock_slowpath - 18.18% get_page_from_freelist - __alloc_pages_nodemask - 18.18% alloc_pages_current skb_page_frag_refill sk_page_frag_refill tcp_sendmsg_locked tcp_sendmsg inet_sendmsg sock_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64_after_hwframe __libc_send + 17.31% __free_pages_ok + 31.43% swapper [kernel.vmlinux] [k] intel_idle + 9.12% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 6.53% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 0.69% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath + 0.68% netperf [kernel.vmlinux] [k] skb_release_data + 0.52% netperf [kernel.vmlinux] [k] tcp_sendmsg_locked 0.46% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Feng Tang <feng.tang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add tcp_rx_skb_cache sysctlEric Dumazet2019-06-141-4/+2
| | | | | | | | | | Instead of relying on rps_needed, it is safer to use a separate static key, since we do not want to enable TCP rx_skb_cache by default. This feature can cause huge increase of memory usage on hosts with millions of sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner2019-05-301-6/+1
| | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>