| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.
After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.
Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).
Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.
Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9fe5 ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
|
|
|
|
|
|
|
| |
All uses of sk_nulls_node_init() have
been removed since commit dbca1596bbb0 ("ping: convert to RCU
lookups, get rid of rwlock"), so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().
The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.
The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
sk_setsockopt()
When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.
This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt(). The number of supported optnames are
increasing ever and so as the duplicated code.
One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock. This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog. The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx
initialized. Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.
This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use. These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock. They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Note on the change in sock_setbindtodevice(). sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.
Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.
Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.
Without this, the following warning can be occasionally observed:
=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
#0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
#1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
#2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
#3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
#4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4
stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x5f
bpf_sk_reuseport_detach+0x6d/0xa4
reuseport_detach_sock+0x75/0xdd
inet_unhash+0xa5/0x1c0
tcp_set_state+0x169/0x20f
? lockdep_sock_is_held+0x3a/0x3a
? __lock_release.isra.0+0x13e/0x220
? reacquire_held_locks+0x1bb/0x1bb
? hlock_class+0x31/0x96
? mark_lock+0x9e/0x1af
__tcp_close+0x50/0x4b6
tcp_close+0x28/0x70
inet_release+0x8e/0xa7
__sock_release+0x95/0x121
sock_close+0x14/0x17
__fput+0x20f/0x36a
task_work_run+0xa3/0xcc
exit_to_user_mode_prepare+0x9c/0x14d
syscall_exit_to_user_mode+0x18/0x44
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: cf8c1e967224 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Syzkaller reports refcount bug as follows:
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
<TASK>
__refcount_add_not_zero include/linux/refcount.h:163 [inline]
__refcount_inc_not_zero include/linux/refcount.h:227 [inline]
refcount_inc_not_zero include/linux/refcount.h:245 [inline]
sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
sk_backlog_rcv include/net/sock.h:1061 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2849
release_sock+0x54/0x1b0 net/core/sock.c:3404
inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
__sys_shutdown_sock net/socket.c:2331 [inline]
__sys_shutdown_sock net/socket.c:2325 [inline]
__sys_shutdown+0xf1/0x1b0 net/socket.c:2343
__do_sys_shutdown net/socket.c:2351 [inline]
__se_sys_shutdown net/socket.c:2349 [inline]
__x64_sys_shutdown+0x50/0x70 net/socket.c:2349
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
During SMC fallback process in connect syscall, kernel will
replaces TCP with SMC. In order to forward wakeup
smc socket waitqueue after fallback, kernel will sets
clcsk->sk_user_data to origin smc socket in
smc_fback_replace_callbacks().
Later, in shutdown syscall, kernel will calls
sk_psock_get(), which treats the clcsk->sk_user_data
as psock type, triggering the refcnt warning.
So, the root cause is that smc and psock, both will use
sk_user_data field. So they will mismatch this field
easily.
This patch solves it by using another bit(defined as
SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
sk_user_data points to a psock object or not.
This patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e
("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
For there will possibly be more flags in the sk_user_data field,
this patch also refactor sk_user_data flags code to be more generic
to improve its maintainability.
Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\
| |
| |
| |
| |
| | |
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While reading these sysctl variables, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- .sysctl_rmem
- .sysctl_rwmem
- .sysctl_rmem_offset
- .sysctl_wmem_offset
- sysctl_tcp_rmem[1, 2]
- sysctl_tcp_wmem[1, 2]
- sysctl_decnet_rmem[1]
- sysctl_decnet_wmem[1]
- sysctl_tipc_rmem[1]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
include/net/sock.h
310731e2f161 ("net: Fix data-races around sysctl_mem.")
e70f3c701276 ("Revert "net: set SK_MEM_QUANTUM to 4096"")
https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/
net/ipv4/fib_semantics.c
747c14307214 ("ip: fix dflt addr selection for connected nexthop")
d62607c3fe45 ("net: rename reference+tracking helpers")
net/tls/tls.h
include/net/tls.h
3d8c51b25a23 ("net/tls: Check for errors in tls_device_init")
587903142308 ("tls: create an internal header")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While reading .sysctl_mem, it can be changed concurrently.
So, we need to add READ_ONCE() to avoid data-races.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There are no more users for the mentioned macros, just
drop them.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Move the len fields manipulation in the skbs to a helper function.
There is a comment specifically requesting this and there are several
other areas in the code displaying the same pattern which can be
refactored.
This improves code readability.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Link: https://lore.kernel.org/r/20220622160853.GA6478@debian
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\|
| |
| |
| |
| |
| | |
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts:
commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry")
Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/
There are a few things that need to be fixed here:
* Updating bhash2 in cases where the socket's rcv saddr changes
* Adding bhash2 hashbucket locks
Links to syzbot reports:
https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/
https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com
Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
sk_memory_allocated_add() has three callers, and returns
to them @memory_allocated.
sk_forced_mem_schedule() is one of them, and ignores
the returned value.
Change sk_memory_allocated_add() to return void.
Change sock_reserve_memory() and __sk_mem_raise_allocated()
to call sk_memory_allocated().
This removes one cache line miss [1] for RPC workloads,
as first skbs in TCP write queue and receive queue go through
sk_forced_mem_schedule().
[1] Cache line holding tcp_memory_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.
Each TCP socket can forward allocate up to 2 MB of memory, even after
flow became less active.
10,000 sockets can have reserved 20 GB of memory,
and we have no shrinker in place to reclaim that.
Instead of trying to reclaim the extra allocations in some places,
just keep sk->sk_forward_alloc values as small as possible.
This should not impact performance too much now we have per-cpu
reserves: Changes to tcp_memory_allocated should not be too frequent.
For sockets not using SO_RESERVE_MEM:
- idle sockets (no packets in tx/rx queues) have zero forward alloc.
- non idle sockets have a forward alloc smaller than one page.
Note:
- Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
is left to MPTCP maintainers as a follow up.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes,
we want to allocate 1 byte more (rounded up to one page),
instead of 150001 :/
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We plan keeping sk->sk_forward_alloc as small as possible
in future patches.
This means we are going to call sk_memory_allocated_add()
and sk_memory_allocated_sub() more often.
Implement a per-cpu cache of +1/-1 MB, to reduce number
of changes to sk->sk_prot->memory_allocated, which
would otherwise be cause of false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Each protocol having a ->memory_allocated pointer gets a corresponding
per-cpu reserve, that following patches will use.
Instead of having reserved bytes per socket,
we want to have per-cpu reserves.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.
This might change in the future, but it seems better to avoid the
confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit bd68a2a854ad5a85f0c8d0a9c8048ca3f6391efb.
This change broke memcg on arches with PAGE_SIZE != 4096
Later, commit 2bb2f5fb21b04 ("net: add new socket option SO_RESERVE_MEM")
also assumed PAGE_SIZE==SK_MEM_QUANTUM
Following patches in the series will greatly reduce the over allocations
problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
| |
Despite these inline functions having full visibility to the compiler
at compile time, they still strip const from passed pointers.
This change allows for functions in various network drivers to be marked as
const that could not be marked const before.
Signed-off-by: Peter Lafreniere <pjlafren@mtu.edu>
Link: https://lore.kernel.org/r/20220606113458.35953-1-pjlafren@mtu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We currently have one tcp bind table (bhash) which hashes by port
number only. In the socket bind path, we check for bind conflicts by
traversing the specified port's inet_bind2_bucket while holding the
bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
In instances where there are tons of sockets hashed to the same port
at different addresses, checking for a bind conflict is time-intensive
and can cause softirq cpu lockups, as well as stops new tcp connections
since __inet_inherit_port() also contests for the spinlock.
This patch proposes adding a second bind table, bhash2, that hashes by
port and ip address. Searching the bhash2 table leads to significantly
faster conflict resolution and less time holding the spinlock.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.
Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.
BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg
write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
__ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0xffffff9b
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
INET_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET_MATCH() an inline function
to ease our changes.
v2:
We remove the 32bit version of it, as modern compilers
should generate the same code really, no need to
try to be smarter.
Also make 'struct net *net' the first argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
| |
sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK
should be included in the ancillary data returned by recvmsg().
Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs().
Signed-off-by: Erin MacNeil <lnx.erin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.
But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.
For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.
For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.
Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.
This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.
This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.
In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.
Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.
Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.
Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)
Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.
10 runs of one TCP_STREAM flow
Before:
Average throughput: 49685 Mbit.
Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)
57.81% [kernel] [k] copy_user_enhanced_fast_string
(*) 12.87% [kernel] [k] skb_release_data
(*) 4.25% [kernel] [k] __free_one_page
(*) 3.57% [kernel] [k] __list_del_entry_valid
1.85% [kernel] [k] __netif_receive_skb_core
1.60% [kernel] [k] __skb_datagram_iter
(*) 1.59% [kernel] [k] free_unref_page_commit
(*) 1.16% [kernel] [k] __slab_free
1.16% [kernel] [k] _copy_to_iter
(*) 1.01% [kernel] [k] kfree
(*) 0.88% [kernel] [k] free_unref_page
0.57% [kernel] [k] ip6_rcv_core
0.55% [kernel] [k] ip6t_do_table
0.54% [kernel] [k] flush_smp_call_function_queue
(*) 0.54% [kernel] [k] free_pcppages_bulk
0.51% [kernel] [k] llist_reverse_order
0.38% [kernel] [k] process_backlog
(*) 0.38% [kernel] [k] free_pcp_prepare
0.37% [kernel] [k] tcp_recvmsg_locked
(*) 0.37% [kernel] [k] __list_add_valid
0.34% [kernel] [k] sock_rfree
0.34% [kernel] [k] _raw_spin_lock_irq
(*) 0.33% [kernel] [k] __page_cache_release
0.33% [kernel] [k] tcp_v6_rcv
(*) 0.33% [kernel] [k] __put_page
(*) 0.29% [kernel] [k] __mod_zone_page_state
0.27% [kernel] [k] _raw_spin_lock
After patch:
Average throughput: 73076 Mbit.
Kernel profiles on cpu running user thread recvmsg() looks better:
81.35% [kernel] [k] copy_user_enhanced_fast_string
1.95% [kernel] [k] _copy_to_iter
1.95% [kernel] [k] __skb_datagram_iter
1.27% [kernel] [k] __netif_receive_skb_core
1.03% [kernel] [k] ip6t_do_table
0.60% [kernel] [k] sock_rfree
0.50% [kernel] [k] tcp_v6_rcv
0.47% [kernel] [k] ip6_rcv_core
0.45% [kernel] [k] read_tsc
0.44% [kernel] [k] _raw_spin_lock_irqsave
0.37% [kernel] [k] _raw_spin_lock
0.37% [kernel] [k] native_irq_return_iret
0.33% [kernel] [k] __inet6_lookup_established
0.31% [kernel] [k] ip6_protocol_deliver_rcu
0.29% [kernel] [k] tcp_rcv_established
0.29% [kernel] [k] llist_reverse_order
v2: kdoc issue (kernel bots)
do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
replace the sk_buff_head with a single-linked list (Jakub)
add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to report the reasons of skb drops in 'sock_queue_rcv_skb()',
introduce the function 'sock_queue_rcv_skb_reason()'.
As the return value of 'sock_queue_rcv_skb()' is used as the error code,
we can't make it as drop reason and have to pass extra output argument.
'sock_queue_rcv_skb()' is used in many places, so we can't change it
directly.
Introduce the new function 'sock_queue_rcv_skb_reason()' and make
'sock_queue_rcv_skb()' an inline call to it.
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
tools/testing/selftests/net/mptcp/mptcp_join.sh
34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers")
857898eb4b28 ("selftests: mptcp: add missing join check")
6ef84b1517e0 ("selftests: mptcp: more robust signal race test")
https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/
drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h
drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c
fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions")
c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information")
09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr")
84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr")
efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow")
3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
UDP sendmsg() can be lockless, this is causing all kinds
of data races.
This patch converts sk->sk_tskey to remove one of these races.
BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
__ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
__ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000054d -> 0x0000054e
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/
|
|
|
|
|
|
|
|
| |
Add a per ns sysctl that controls the txhash rethink behavior:
net.core.txrehash. When enabled, the same behavior is retained,
when disabled, rethink is not performed. Sysctl is enabled by default.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| | |
include/net/sock.h
commit 8f905c0e7354 ("inet: fully convert sk->sk_rx_dst to RCU rules")
commit 43f51df41729 ("net: move early demux fields close to sk_refcnt")
https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch moves sock_release_ownership() down in include/net/sock.h and
replaces some sk_lock.owned tests with sock_owned_by_user_nocheck().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20211208062158.54132-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| | |
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\|
| |
| |
| | |
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A recent change triggers a KMSAN warning, because request
sockets do not initialize @sk_rx_queue_mapping field.
Add sk_rx_queue_update() helper to make our intent clear.
BUG: KMSAN: uninit-value in sk_rx_queue_set include/net/sock.h:1922 [inline]
BUG: KMSAN: uninit-value in tcp_conn_request+0x3bcc/0x4dc0 net/ipv4/tcp_input.c:6922
sk_rx_queue_set include/net/sock.h:1922 [inline]
tcp_conn_request+0x3bcc/0x4dc0 net/ipv4/tcp_input.c:6922
tcp_v4_conn_request+0x218/0x2a0 net/ipv4/tcp_ipv4.c:1528
tcp_rcv_state_process+0x2c5/0x3290 net/ipv4/tcp_input.c:6406
tcp_v4_do_rcv+0xb4e/0x1330 net/ipv4/tcp_ipv4.c:1738
tcp_v4_rcv+0x468d/0x4ed0 net/ipv4/tcp_ipv4.c:2100
ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204
ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:460 [inline]
ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline]
ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline]
ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609
ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5505 [inline]
__netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553
__netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605
netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696
gro_normal_list net/core/dev.c:5850 [inline]
napi_complete_done+0x579/0xdd0 net/core/dev.c:6587
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557
__napi_poll+0x14e/0xbc0 net/core/dev.c:7020
napi_poll net/core/dev.c:7087 [inline]
net_rx_action+0x824/0x1880 net/core/dev.c:7174
__do_softirq+0x1fe/0x7eb kernel/softirq.c:558
invoke_softirq+0xa4/0x130 kernel/softirq.c:432
__irq_exit_rcu kernel/softirq.c:636 [inline]
irq_exit_rcu+0x76/0x130 kernel/softirq.c:648
common_interrupt+0xb6/0xd0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40
smap_restore arch/x86/include/asm/smap.h:67 [inline]
get_shadow_origin_ptr mm/kmsan/instrumentation.c:31 [inline]
__msan_metadata_ptr_for_load_1+0x28/0x30 mm/kmsan/instrumentation.c:63
tomoyo_check_acl+0x1b0/0x630 security/tomoyo/domain.c:173
tomoyo_path_permission security/tomoyo/file.c:586 [inline]
tomoyo_check_open_permission+0x61f/0xe10 security/tomoyo/file.c:777
tomoyo_file_open+0x24f/0x2d0 security/tomoyo/tomoyo.c:311
security_file_open+0xb1/0x1f0 security/security.c:1635
do_dentry_open+0x4e4/0x1bf0 fs/open.c:809
vfs_open+0xaf/0xe0 fs/open.c:957
do_open fs/namei.c:3426 [inline]
path_openat+0x52f1/0x5dd0 fs/namei.c:3559
do_filp_open+0x306/0x760 fs/namei.c:3586
do_sys_openat2+0x263/0x8f0 fs/open.c:1212
do_sys_open fs/open.c:1228 [inline]
__do_sys_open fs/open.c:1236 [inline]
__se_sys_open fs/open.c:1232 [inline]
__x64_sys_open+0x314/0x380 fs/open.c:1232
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x44/0xae
Uninit was created at:
__alloc_pages+0xbc7/0x10a0 mm/page_alloc.c:5409
alloc_pages+0x8a5/0xb80
alloc_slab_page mm/slub.c:1810 [inline]
allocate_slab+0x287/0x1c20 mm/slub.c:1947
new_slab mm/slub.c:2010 [inline]
___slab_alloc+0xbdf/0x1e90 mm/slub.c:3039
__slab_alloc mm/slub.c:3126 [inline]
slab_alloc_node mm/slub.c:3217 [inline]
slab_alloc mm/slub.c:3259 [inline]
kmem_cache_alloc+0xbb3/0x11c0 mm/slub.c:3264
reqsk_alloc include/net/request_sock.h:91 [inline]
inet_reqsk_alloc+0xaf/0x8b0 net/ipv4/tcp_input.c:6712
tcp_conn_request+0x910/0x4dc0 net/ipv4/tcp_input.c:6852
tcp_v4_conn_request+0x218/0x2a0 net/ipv4/tcp_ipv4.c:1528
tcp_rcv_state_process+0x2c5/0x3290 net/ipv4/tcp_input.c:6406
tcp_v4_do_rcv+0xb4e/0x1330 net/ipv4/tcp_ipv4.c:1738
tcp_v4_rcv+0x468d/0x4ed0 net/ipv4/tcp_ipv4.c:2100
ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204
ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:460 [inline]
ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline]
ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline]
ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609
ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5505 [inline]
__netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553
__netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605
netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696
gro_normal_list net/core/dev.c:5850 [inline]
napi_complete_done+0x579/0xdd0 net/core/dev.c:6587
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557
__napi_poll+0x14e/0xbc0 net/core/dev.c:7020
napi_poll net/core/dev.c:7087 [inline]
net_rx_action+0x824/0x1880 net/core/dev.c:7174
__do_softirq+0x1fe/0x7eb kernel/softirq.c:558
Fixes: 342159ee394d ("net: avoid dirtying sk->sk_rx_queue_mapping")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20211130182939.2584764-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Steffen reported a TCP stream corruption for HTTP requests
served by the apache web-server using a cifs mount-point
and memory mapping the relevant file.
The root cause is quite similar to the one addressed by
commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
memory reclaim"). Here the nested access to the task page frag
is caused by a page fault on the (mmapped) user-space memory
buffer coming from the cifs file.
The page fault handler performs an smb transaction on a different
socket, inside the same process context. Since sk->sk_allaction
for such socket does not prevent the usage for the task_frag,
the nested allocation modify "under the hood" the page frag
in use by the outer sendmsg call, corrupting the stream.
The overall relevant stack trace looks like the following:
httpd 78268 [001] 3461630.850950: probe:tcp_sendmsg_locked:
ffffffff91461d91 tcp_sendmsg_locked+0x1
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139814e sock_sendmsg+0x3e
ffffffffc06dfe1d smb_send_kvec+0x28
[...]
ffffffffc06cfaf8 cifs_readpages+0x213
ffffffff90e83c4b read_pages+0x6b
ffffffff90e83f31 __do_page_cache_readahead+0x1c1
ffffffff90e79e98 filemap_fault+0x788
ffffffff90eb0458 __do_fault+0x38
ffffffff90eb5280 do_fault+0x1a0
ffffffff90eb7c84 __handle_mm_fault+0x4d4
ffffffff90eb8093 handle_mm_fault+0xc3
ffffffff90c74f6d __do_page_fault+0x1ed
ffffffff90c75277 do_page_fault+0x37
ffffffff9160111e page_fault+0x1e
ffffffff9109e7b5 copyin+0x25
ffffffff9109eb40 _copy_from_iter_full+0xe0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139815c sock_sendmsg+0x4c
ffffffff913981f7 sock_write_iter+0x97
ffffffff90f2cc56 do_iter_readv_writev+0x156
ffffffff90f2dff0 do_iter_write+0x80
ffffffff90f2e1c3 vfs_writev+0xa3
ffffffff90f2e27c do_writev+0x5c
ffffffff90c042bb do_syscall_64+0x5b
ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
we can avoid the nesting using the sk page frag for allocation
lacking the __GFP_FS flag. Do not define an additional mm-helper
for that, as this is strictly tied to the sk page frag usage.
v1 -> v2:
- use a stricted sk_page_frag() check instead of reordering the
code (Eric)
Reported-by: Steffen Froemer <sfroemer@redhat.com>
Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add neigh_confirm() for the confirmed member in struct neighbour,
it can be called as an independent unit by other functions.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|