| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.
This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).
The following benchmarks were run to check improvements in latency and
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz
and L1 guest running on QEMU/KVM with vhost process and all vCPUs
pinned individually to pCPUs.
- Latency
Tool: Fio version 3.37-56
Mode: pingpong (h-g-h)
Test runs: 50
Runtime-per-test: 50s
Type: SOCK_STREAM
In the following fio benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.
fio process pinned both inside the host and the guest system.
Before: Linux 6.9.8
Payload 64B:
1st perc. overall 99th perc.
Before 12.91 16.78 42.24 us
After 9.77 13.57 39.17 us
Payload 512B:
1st perc. overall 99th perc.
Before 13.35 17.35 41.52 us
After 10.25 14.11 39.58 us
Payload 4K:
1st perc. overall 99th perc.
Before 14.71 19.87 41.52 us
After 10.51 14.96 40.81 us
- Throughput
Tool: iperf-vsock
The size represents the buffer length (-l) to read/write
P represents the number of parallel streams
P=1
4K 64K 128K
Before 6.87 29.3 29.5 Gb/s
After 10.5 39.4 39.9 Gb/s
P=2
4K 64K 128K
Before 10.5 32.8 33.2 Gb/s
After 17.8 47.7 48.5 Gb/s
P=4
4K 64K 128K
Before 12.7 33.6 34.2 Gb/s
After 16.9 48.1 50.5 Gb/s
The performance improvement is related to this optimization,
I used a ebpf kretprobe on virtio_transport_send_skb to check
that each packet was sent directly to the virtqueue
Co-developed-by: Marco Pinna <marco.pinn95@gmail.com>
Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Message-Id: <20240730-pinna-v4-2-5c9179164db5@outlook.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Preliminary patch to introduce an optimization to the
enqueue system.
All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.
Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20240730-pinna-v4-1-5c9179164db5@outlook.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth, wireless and netfilter.
No known outstanding regressions.
Current release - regressions:
- wifi: iwlwifi: fix hibernation
- eth: ionic: prevent tx_timeout due to frequent doorbell ringing
Previous releases - regressions:
- sched: fix sch_fq incorrect behavior for small weights
- wifi:
- iwlwifi: take the mutex before running link selection
- wfx: repair open network AP mode
- netfilter: restore IP sanity checks for netdev/egress
- tcp: fix forever orphan socket caused by tcp_abort
- mptcp: close subflow when receiving TCP+FIN
- bluetooth: fix random crash seen while removing btnxpuart driver
Previous releases - always broken:
- mptcp: more fixes for the in-kernel PM
- eth: bonding: change ipsec_lock from spin lock to mutex
- eth: mana: fix race of mana_hwc_post_rx_wqe and new hwc response
Misc:
- documentation: drop special comment style for net code"
* tag 'net-6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
nfc: pn533: Add poll mod list filling check
mailmap: update entry for Sriram Yagnaraman
selftests: mptcp: join: check re-re-adding ID 0 signal
mptcp: pm: ADD_ADDR 0 is not a new address
selftests: mptcp: join: validate event numbers
mptcp: avoid duplicated SUB_CLOSED events
selftests: mptcp: join: check re-re-adding ID 0 endp
mptcp: pm: fix ID 0 endp usage after multiple re-creations
mptcp: pm: do not remove already closed subflows
selftests: mptcp: join: no extra msg if no counter
selftests: mptcp: join: check re-adding init endp with != id
mptcp: pm: reset MPC endp ID when re-added
mptcp: pm: skip connecting to already established sf
mptcp: pm: send ACK on an active subflow
selftests: mptcp: join: check removing ID 0 endpoint
mptcp: pm: fix RM_ADDR ID for the initial subflow
mptcp: pm: reuse ID 0 after delete and re-add
net: busy-poll: use ktime_get_ns() instead of local_clock()
sctp: fix association labeling in the duplicate COOKIE-ECHO case
mptcp: pr_debug: add missing \n at the end
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The ADD_ADDR 0 with the address from the initial subflow should not be
considered as a new address: this is not something new. If the host
receives it, it simply means that the address is available again.
When receiving an ADD_ADDR for the ID 0, the PM already doesn't consider
it as new by not incrementing the 'add_addr_accepted' counter. But the
'accept_addr' might not be set if the limit has already been reached:
this can be bypassed in this case. But before, it is important to check
that this ADD_ADDR for the ID 0 is for the same address as the initial
subflow. If not, it is not something that should happen, and the
ADD_ADDR can be ignored.
Note that if an ADD_ADDR is received while there is already a subflow
opened using the same address, this ADD_ADDR is ignored as well. It
means that if multiple ADD_ADDR for ID 0 are received, there will not be
any duplicated subflows created by the client.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The initial subflow might have already been closed, but still in the
connection list. When the worker is instructed to close the subflows
that have been marked as closed, it might then try to close the initial
subflow again.
A consequence of that is that the SUB_CLOSED event can be seen twice:
# ip mptcp endpoint
1.1.1.1 id 1 subflow dev eth0
2.2.2.2 id 2 subflow dev eth1
# ip mptcp monitor &
[ CREATED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ ESTABLISHED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_ESTABLISHED] remid=0 locid=2 saddr4=2.2.2.2 daddr4=9.9.9.9
# ip mptcp endpoint delete id 1
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
The first one is coming from mptcp_pm_nl_rm_subflow_received(), and the
second one from __mptcp_close_subflow().
To avoid doing the post-closed processing twice, the subflow is now
marked as closed the first time.
Note that it is not enough to check if we are dealing with the first
subflow and check its sk_state: the subflow might have been reset or
closed before calling mptcp_close_ssk().
Fixes: b911c97c7dc7 ("mptcp: add netlink event support")
Cc: stable@vger.kernel.org
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
'local_addr_used' and 'add_addr_accepted' are decremented for addresses
not related to the initial subflow (ID0), because the source and
destination addresses of the initial subflows are known from the
beginning: they don't count as "additional local address being used" or
"ADD_ADDR being accepted".
It is then required not to increment them when the entrypoint used by
the initial subflow is removed and re-added during a connection. Without
this modification, this entrypoint cannot be removed and re-added more
than once.
Reported-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/512
Fixes: 3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
Reported-by: syzbot+455d38ecd5f655fc45cf@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/00000000000049861306209237f4@google.com
Cc: stable@vger.kernel.org
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It is possible to have in the list already closed subflows, e.g. the
initial subflow has been already closed, but still in the list. No need
to try to close it again, and increments the related counters again.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The initial subflow has a special local ID: 0. It is specific per
connection.
When a global endpoint is deleted and re-added later, it can have a
different ID -- most services managing the endpoints automatically don't
force the ID to be the same as before. It is then important to track
these modifications to be consistent with the ID being used for the
address used by the initial subflow, not to confuse the other peer or to
send the ID 0 for the wrong address.
Now when removing an endpoint, msk->mpc_endpoint_id is reset if it
corresponds to this endpoint. When adding a new endpoint, the same
variable is updated if the address match the one of the initial subflow.
Fixes: 3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The lookup_subflow_by_daddr() helper checks if there is already a
subflow connected to this address. But there could be a subflow that is
closing, but taking time due to some reasons: latency, losses, data to
process, etc.
If an ADD_ADDR is received while the endpoint is being closed, it is
better to try connecting to it, instead of rejecting it: the peer which
has sent the ADD_ADDR will not be notified that the ADD_ADDR has been
rejected for this reason, and the expected subflow will not be created
at the end.
This helper should then only look for subflows that are established, or
going to be, but not the ones being closed.
Fixes: d84ad04941c3 ("mptcp: skip connecting the connected address")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Taking the first one on the list doesn't work in some cases, e.g. if the
initial subflow is being removed. Pick another one instead of not
sending anything.
Fixes: 84dfe3677a6f ("mptcp: send out dedicated ADD_ADDR packet")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The initial subflow has a special local ID: 0. When an endpoint is being
deleted, it is then important to check if its address is not linked to
the initial subflow to send the right ID.
If there was an endpoint linked to the initial subflow, msk's
mpc_endpoint_id field will be set. We can then use this info when an
endpoint is being removed to see if it is linked to the initial subflow.
So now, the correct IDs are passed to mptcp_pm_nl_rm_addr_or_subflow(),
it is no longer needed to use mptcp_local_id_match().
Fixes: 3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When the endpoint used by the initial subflow is removed and re-added
later, the PM has to force the ID 0, it is a special case imposed by the
MPTCP specs.
Note that the endpoint should then need to be re-added reusing the same
ID.
Fixes: 3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Regressions:
* wfx: fix for open network connection
* iwlwifi: fix for hibernate (due to fast resume feature)
* iwlwifi: fix for a few warnings that were recently added
(had previously been messages not warnings)
Previously broken:
* mwifiex: fix static structures used for per-device data
* iwlwifi: some harmless FW related messages were tagged
too high priority
* iwlwifi: scan buffers weren't checked correctly
* mac80211: SKB leak on beacon error path
* iwlwifi: fix ACPI table interop with certain BIOSes
* iwlwifi: fix locking for link selection
* mac80211: fix SSID comparison in beacon validation
* tag 'wireless-2024-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: clear trans->state earlier upon error
wifi: wfx: repair open network AP mode
wifi: mac80211: free skb on error path in ieee80211_beacon_get_ap()
wifi: iwlwifi: mvm: don't wait for tx queues if firmware is dead
wifi: iwlwifi: mvm: allow 6 GHz channels in MLO scan
wifi: iwlwifi: mvm: pause TCM when the firmware is stopped
wifi: iwlwifi: fw: fix wgds rev 3 exact size
wifi: iwlwifi: mvm: take the mutex before running link selection
wifi: iwlwifi: mvm: fix iwl_mvm_max_scan_ie_fw_cmd_room()
wifi: iwlwifi: mvm: fix iwl_mvm_scan_fits() calculation
wifi: iwlwifi: lower message level for FW buffer destination
wifi: iwlwifi: mvm: fix hibernation
wifi: mac80211: fix beacon SSID mismatch handling
wifi: mwifiex: duplicate static structs used in driver instances
====================
Link: https://patch.msgid.link/20240828100151.23662-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In 'ieee80211_beacon_get_ap()', free allocated skb in case of error
returned by 'ieee80211_beacon_protect()'. Compile tested only.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20240805142035.227847-1-dmantipov@yandex.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Return false when memcmp with zero_ssid returns 0 to correctly
handle hidden SSIDs case.
Fixes: 9cc88678db5b ("wifi: mac80211: check SSID in beacon")
Reviewed-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Daniel Gabay <daniel.gabay@intel.com>
Link: https://patch.msgid.link/20240823105546.7ab29ae287a6.I7f98e57e1ab6597614703fdd138cc88ad253d986@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
sctp_sf_do_5_2_4_dupcook() currently calls security_sctp_assoc_request()
on new_asoc, but as it turns out, this association is always discarded
and the LSM labels never get into the final association (asoc).
This can be reproduced by having two SCTP endpoints try to initiate an
association with each other at approximately the same time and then peel
off the association into a new socket, which exposes the unitialized
labels and triggers SELinux denials.
Fix it by calling security_sctp_assoc_request() on asoc instead of
new_asoc. Xin Long also suggested limit calling the hook only to cases
A, B, and D, since in cases C and E the COOKIE ECHO chunk is discarded
and the association doesn't enter the ESTABLISHED state, so rectify that
as well.
One related caveat with SELinux and peer labeling: When an SCTP
connection is set up simultaneously in this way, we will end up with an
association that is initialized with security_sctp_assoc_request() on
both sides, so the MLS component of the security context of the
association will get swapped between the peers, instead of just one side
setting it to the other's MLS component. However, at that point
security_sctp_assoc_request() had already been called on both sides in
sctp_sf_do_unexpected_init() (on a temporary association) and thus if
the exchange didn't fail before due to MLS, it won't fail now either
(most likely both endpoints have the same MLS range).
Tested by:
- reproducer from https://src.fedoraproject.org/tests/selinux/pull-request/530
- selinux-testsuite (https://github.com/SELinuxProject/selinux-testsuite/)
- sctp-tests (https://github.com/sctp/sctp-tests) - no tests failed
that wouldn't fail also without the patch applied
Fixes: c081d53f97a1 ("security: pass asoc to sctp_assoc_request and sctp_sk_clone")
Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM/SELinux)
Link: https://patch.msgid.link/20240826130711.141271-1-omosnace@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
pr_debug() have been added in various places in MPTCP code to help
developers to debug some situations. With the dynamic debug feature, it
is easy to enable all or some of them, and asks users to reproduce
issues with extra debug.
Many of these pr_debug() don't end with a new line, while no 'pr_cont()'
are used in MPTCP code. So the goal was not to display multiple debug
messages on one line: they were then not missing the '\n' on purpose.
Not having the new line at the end causes these messages to be printed
with a delay, when something else needs to be printed. This issue is not
visible when many messages need to be printed, but it is annoying and
confusing when only specific messages are expected, e.g.
# echo "func mptcp_pm_add_addr_echoed +fmp" \
> /sys/kernel/debug/dynamic_debug/control
# ./mptcp_join.sh "signal address"; \
echo "$(awk '{print $1}' /proc/uptime) - end"; \
sleep 5s; \
echo "$(awk '{print $1}' /proc/uptime) - restart"; \
./mptcp_join.sh "signal address"
013 signal address
(...)
10.75 - end
15.76 - restart
013 signal address
[ 10.367935] mptcp:mptcp_pm_add_addr_echoed: MPTCP: msk=(...)
(...)
=> a delay of 5 seconds: printed with a 10.36 ts, but after 'restart'
which was printed at the 15.76 ts.
The 'Fixes' tag here below points to the first pr_debug() used without
'\n' in net/mptcp. This patch could be split in many small ones, with
different Fixes tag, but it doesn't seem worth it, because it is easy to
re-generate this patch with this simple 'sed' command:
git grep -l pr_debug -- net/mptcp |
xargs sed -i "s/\(pr_debug(\".*[^n]\)\(\"[,)]\)/\1\\\n\2/g"
So in case of conflicts, simply drop the modifications, and launch this
command.
Fixes: f870fa0b5768 ("mptcp: Add MPTCP socket stubs")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-4-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The 'mptcp_subflow_context' structure has two items related to the
backup flags:
- 'backup': the subflow has been marked as backup by the other peer
- 'request_bkup': the backup flag has been set by the host
Looking only at the 'backup' flag can make sense in some cases, but it
is not the behaviour of the default packet scheduler when selecting
paths.
As explained in the commit b6a66e521a20 ("mptcp: sched: check both
directions for backup"), the packet scheduler should look at both flags,
because that was the behaviour from the beginning: the 'backup' flag was
set by accident instead of the 'request_bkup' one. Now that the latter
has been fixed, get_retrans() needs to be adapted as well.
Fixes: b6a66e521a20 ("mptcp: sched: check both directions for backup")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-3-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a peer decides to close one subflow in the middle of a connection
having multiple subflows, the receiver of the first FIN should accept
that, and close the subflow on its side as well. If not, the subflow
will stay half closed, and would even continue to be used until the end
of the MPTCP connection or a reset from the network.
The issue has not been seen before, probably because the in-kernel
path-manager always sends a RM_ADDR before closing the subflow. Upon the
reception of this RM_ADDR, the other peer will initiate the closure on
its side as well. On the other hand, if the RM_ADDR is lost, or if the
path-manager of the other peer only closes the subflow without sending a
RM_ADDR, the subflow would switch to TCP_CLOSE_WAIT, but that's it,
leaving the subflow half-closed.
So now, when the subflow switches to the TCP_CLOSE_WAIT state, and if
the MPTCP connection has not been closed before with a DATA_FIN, the
kernel owning the subflow schedules its worker to initiate the closure
on its side as well.
This issue can be easily reproduced with packetdrill, as visible in [1],
by creating an additional subflow, injecting a FIN+ACK before sending
the DATA_FIN, and expecting a FIN+ACK in return.
Fixes: 40947e13997a ("mptcp: schedule worker when subflow is closed")
Cc: stable@vger.kernel.org
Link: https://github.com/multipath-tcp/packetdrill/pull/154 [1]
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-1-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We have some problem closing zero-window fin-wait-1 tcp sockets in our
environment. This patch come from the investigation.
Previously tcp_abort only sends out reset and calls tcp_done when the
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
purging the write queue, but not close the socket and left it to the
timer.
While purging the write queue, tp->packets_out and sk->sk_write_queue
is cleared along the way. However tcp_retransmit_timer have early
return based on !tp->packets_out and tcp_probe_timer have early
return based on !sk->sk_write_queue.
This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.
This patch removes the SOCK_DEAD check in tcp_abort, making it send
reset to peer and close the socket accordingly. Preventing the
timer-less orphan from happening.
According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.
The -ENOENT code comes from the associate patch Lorenzo made for
iproute2-ss; link attached below, which also conform to RFC 9293.
At the end of the patch, tcp_write_queue_purge(sk) is removed because it
was already called in tcp_done_with_error().
p.s. This is the same patch with v2. Resent due to mis-labeled "changes
requested" on patchwork.kernel.org.
Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro@kuroa.me>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
fq_dequeue() has a complex logic to find packets in one of the 3 bands.
As Neal found out, it is possible that one band has a deficit smaller
than its weight. fq_dequeue() can return NULL while some packets are
elligible for immediate transmit.
In this case, more than one iteration is needed to refill pband->credit.
With default parameters (weights 589824 196608 65536) bug can trigger
if large BIG TCP packets are sent to the lowest priority band.
Bisected-by: John Sperbeck <jsperbeck@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20240824181901.953776-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
A sysfs reader can race with a device reset or removal, attempting to
read device state when the device is not actually present. eg:
[exception RIP: qed_get_current_link+17]
#8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede]
#9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3
#10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4
#11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300
#12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c
#13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b
#14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3
#15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1
#16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f
#17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb
crash> struct net_device.state ffff9a9d21336000
state = 5,
state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100).
The device is not present, note lack of __LINK_STATE_PRESENT (0b10).
This is the same sort of panic as observed in commit 4224cfd7fb65
("net-sysfs: add check for netdevice being present to speed_show").
There are many other callers of __ethtool_get_link_ksettings() which
don't have a device presence check.
Move this check into ethtool to protect all callers.
Fixes: d519e17e2d01 ("net: export device speed and duplex via sysfs")
Fixes: 4224cfd7fb65 ("net-sysfs: add check for netdevice being present to speed_show")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This fixes not handling hibernation actions on suspend notifier so they
are treated in the same way as regular suspend actions.
Fixes: 9952d90ea288 ("Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I have seen the WARN_ON(smp_processor_id() != cpu) firing
in pktgen_thread_worker() during tests.
We must use cpus_read_lock()/cpus_read_unlock()
around the for_each_online_cpu(cpu) loop.
While we are at it use WARN_ON_ONCE() to avoid a possible syslog flood.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240821175339.1191779-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\ \ \
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull NFS client fixes from Anna Schumaker:
- Fix rpcrdma refcounting in xa_alloc
- Fix rpcrdma usage of XA_FLAGS_ALLOC
- Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS
- Fix attribute bitmap decoder to handle a 3rd word
- Add reschedule points when returning delegations to avoid soft lockups
- Fix clearing layout segments in layoutreturn
- Avoid unnecessary rescanning of the per-server delegation list
* tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Avoid unnecessary rescanning of the per-server delegation list
NFSv4: Fix clearing of layout segments in layoutreturn
NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
nfs: fix bitmap decoder to handle a 3rd word
nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
rpcrdma: Trace connection registration and unregistration
rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1
rpcrdma: Device kref is over-incremented on error from xa_alloc
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
These new trace points record xarray indices and the time of
endpoint registration and unregistration, to co-ordinate with
device removal events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Nit: The built-in xa_limit_32b range starts at 0, but
XA_FLAGS_ALLOC1 configures the xarray's allocator to start at 1.
Adopt the more conventional XA_FLAGS_ALLOC because there's no
mechanical reason to skip 0.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If the device's reference count is too high, the device completion
callback never fires.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
iucv_alloc_device() gets a format string and a varying number of
arguments. This is incorrectly forwarded by calling dev_set_name() with
the format string and a va_list, while dev_set_name() expects also a
varying number of arguments.
Symptoms:
Corrupted iucv device names, which can result in log messages like:
sysfs: cannot create duplicate filename '/devices/iucv/hvc_iucv1827699952'
Fixes: 4452e8ef8c36 ("s390/iucv: Provide iucv_alloc_device() / iucv_release_device()")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1228425
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Thorsten Winkler <twinkler@linux.ibm.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20240821091337.3627068-1-wintera@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There is something wrong with ovs_drop_reasons. ovs_drop_reasons[0] is
"OVS_DROP_LAST_ACTION", but OVS_DROP_LAST_ACTION == __OVS_DROP_REASON + 1,
which means that ovs_drop_reasons[1] should be "OVS_DROP_LAST_ACTION".
And as Adrian tested, without the patch, adding flow to drop packets
results in:
drop at: do_execute_actions+0x197/0xb20 [openvsw (0xffffffffc0db6f97)
origin: software
input port ifindex: 8
timestamp: Tue Aug 20 10:19:17 2024 859853461 nsec
protocol: 0x800
length: 98
original length: 98
drop reason: OVS_DROP_ACTION_ERROR
With the patch, the same results in:
drop at: do_execute_actions+0x197/0xb20 [openvsw (0xffffffffc0db6f97)
origin: software
input port ifindex: 8
timestamp: Tue Aug 20 10:16:13 2024 475856608 nsec
protocol: 0x800
length: 98
original length: 98
drop reason: OVS_DROP_LAST_ACTION
Fix this by initializing ovs_drop_reasons with index.
Fixes: 9d802da40b7c ("net: openvswitch: add last-action drop reason")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Tested-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Link: https://patch.msgid.link/20240821123252.186305-1-dongml2@chinatelecom.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 disable BH when collecting stats via hardware offload to ensure
concurrent updates from packet path do not result in losing stats.
From Sebastian Andrzej Siewior.
Patch #2 uses write seqcount to reset counters serialize against reader.
Also from Sebastian Andrzej Siewior.
Patch #3 ensures vlan header is in place before accessing its fields,
according to KMSAN splat triggered by syzbot.
* tag 'nf-24-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: flowtable: validate vlan header
netfilter: nft_counter: Synchronize nft_counter_reset() against reader.
netfilter: nft_counter: Disable BH in nft_counter_offload_stats().
====================
Link: https://patch.msgid.link/20240822101842.4234-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Ensure there is sufficient room to access the protocol field of the
VLAN header, validate it once before the flowtable lookup.
=====================================================
BUG: KMSAN: uninit-value in nf_flow_offload_inet_hook+0x45a/0x5f0 net/netfilter/nf_flow_table_inet.c:32
nf_flow_offload_inet_hook+0x45a/0x5f0 net/netfilter/nf_flow_table_inet.c:32
nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
nf_hook_ingress include/linux/netfilter_netdev.h:34 [inline]
nf_ingress net/core/dev.c:5440 [inline]
Fixes: 4cd91f7c290f ("netfilter: flowtable: add vlan support")
Reported-by: syzbot+8407d9bb88cd4c6bf61a@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
nft_counter_reset() resets the counter by subtracting the previously
retrieved value from the counter. This is a write operation on the
counter and as such it requires to be performed with a write sequence of
nft_counter_seq to serialize against its possible reader.
Update the packets/ bytes within write-sequence of nft_counter_seq.
Fixes: d84701ecbcd6a ("netfilter: nft_counter: rework atomic dump and reset")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The sequence counter nft_counter_seq is a per-CPU counter. There is no
lock associated with it. nft_counter_do_eval() is using the same counter
and disables BH which suggest that it can be invoked from a softirq.
This in turn means that nft_counter_offload_stats(), which disables only
preemption, can be interrupted by nft_counter_do_eval() leading to two
writer for one seqcount_t.
This can lead to loosing stats or reading statistics while they are
updated.
Disable BH during stats update in nft_counter_offload_stats() to ensure
one writer at a time.
Fixes: b72920f6e4a9d ("netfilter: nftables: counter hardware offload support")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If skb_expand_head() returns NULL, skb has been freed
and the associated dst/idev could also have been freed.
We must use rcu_read_lock() to prevent a possible UAF.
Fixes: 0c9f227bee11 ("ipv6: use skb_expand_head in ip6_xmit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vasily Averin <vasily.averin@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240820160859.3786976-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If skb_expand_head() returns NULL, skb has been freed
and associated dst/idev could also have been freed.
We need to hold rcu_read_lock() to make sure the dst and
associated idev are alive.
Fixes: 5796015fa968 ("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vasily Averin <vasily.averin@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240820160859.3786976-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
syzbot reported an UAF in ip6_send_skb() [1]
After ip6_local_out() has returned, we no longer can safely
dereference rt, unless we hold rcu_read_lock().
A similar issue has been fixed in commit
a688caa34beb ("ipv6: take rcu lock in rawv6_send_hdrinc()")
Another potential issue in ip6_finish_output2() is handled in a
separate patch.
[1]
BUG: KASAN: slab-use-after-free in ip6_send_skb+0x18d/0x230 net/ipv6/ip6_output.c:1964
Read of size 8 at addr ffff88806dde4858 by task syz.1.380/6530
CPU: 1 UID: 0 PID: 6530 Comm: syz.1.380 Not tainted 6.11.0-rc3-syzkaller-00306-gdf6cbc62cc9b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
ip6_send_skb+0x18d/0x230 net/ipv6/ip6_output.c:1964
rawv6_push_pending_frames+0x75c/0x9e0 net/ipv6/raw.c:588
rawv6_sendmsg+0x19c7/0x23c0 net/ipv6/raw.c:926
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x1a6/0x270 net/socket.c:745
sock_write_iter+0x2dd/0x400 net/socket.c:1160
do_iter_readv_writev+0x60a/0x890
vfs_writev+0x37c/0xbb0 fs/read_write.c:971
do_writev+0x1b1/0x350 fs/read_write.c:1018
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f936bf79e79
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f936cd7f038 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 00007f936c115f80 RCX: 00007f936bf79e79
RDX: 0000000000000001 RSI: 0000000020000040 RDI: 0000000000000004
RBP: 00007f936bfe7916 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f936c115f80 R15: 00007fff2860a7a8
</TASK>
Allocated by task 6530:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
unpoison_slab_object mm/kasan/common.c:312 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slub.c:3988 [inline]
slab_alloc_node mm/slub.c:4037 [inline]
kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4044
dst_alloc+0x12b/0x190 net/core/dst.c:89
ip6_blackhole_route+0x59/0x340 net/ipv6/route.c:2670
make_blackhole net/xfrm/xfrm_policy.c:3120 [inline]
xfrm_lookup_route+0xd1/0x1c0 net/xfrm/xfrm_policy.c:3313
ip6_dst_lookup_flow+0x13e/0x180 net/ipv6/ip6_output.c:1257
rawv6_sendmsg+0x1283/0x23c0 net/ipv6/raw.c:898
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x1a6/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
___sys_sendmsg net/socket.c:2651 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2680
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 45:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
__kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
kasan_slab_free include/linux/kasan.h:184 [inline]
slab_free_hook mm/slub.c:2252 [inline]
slab_free mm/slub.c:4473 [inline]
kmem_cache_free+0x145/0x350 mm/slub.c:4548
dst_destroy+0x2ac/0x460 net/core/dst.c:124
rcu_do_batch kernel/rcu/tree.c:2569 [inline]
rcu_core+0xafd/0x1830 kernel/rcu/tree.c:2843
handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
__do_softirq kernel/softirq.c:588 [inline]
invoke_softirq kernel/softirq.c:428 [inline]
__irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
__kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
__call_rcu_common kernel/rcu/tree.c:3106 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:3210
refdst_drop include/net/dst.h:263 [inline]
skb_dst_drop include/net/dst.h:275 [inline]
nf_ct_frag6_queue net/ipv6/netfilter/nf_conntrack_reasm.c:306 [inline]
nf_ct_frag6_gather+0xb9a/0x2080 net/ipv6/netfilter/nf_conntrack_reasm.c:485
ipv6_defrag+0x2c8/0x3c0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:67
nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
nf_hook_slow+0xc3/0x220 net/netfilter/core.c:626
nf_hook include/linux/netfilter.h:269 [inline]
__ip6_local_out+0x6fa/0x800 net/ipv6/output_core.c:143
ip6_local_out+0x26/0x70 net/ipv6/output_core.c:153
ip6_send_skb+0x112/0x230 net/ipv6/ip6_output.c:1959
rawv6_push_pending_frames+0x75c/0x9e0 net/ipv6/raw.c:588
rawv6_sendmsg+0x19c7/0x23c0 net/ipv6/raw.c:926
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x1a6/0x270 net/socket.c:745
sock_write_iter+0x2dd/0x400 net/socket.c:1160
do_iter_readv_writev+0x60a/0x890
Fixes: 0625491493d9 ("ipv6: ip6_push_pending_frames() should increment IPSTATS_MIB_OUTDISCARDS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240820160859.3786976-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
netpoll_poll_disable() and netpoll_poll_enable() are only used
from core networking code, there is no need to export them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240820162053.3870927-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When assembling fraglist GSO packets, udp4_gro_complete does not set
skb->csum_start, which makes the extra validation in __udp_gso_segment fail.
Fixes: 89add40066f9 ("net: drop bad gso csum_start and offset in virtio_net_hdr")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240819150621.59833-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
select_local_address() and select_signal_address() both select an
endpoint entry from the list inside an RCU protected section, but return
a reference to it, to be read later on. If the entry is dereferenced
after the RCU unlock, reading info could cause a Use-after-Free.
A simple solution is to copy the required info while inside the RCU
protected section to avoid any risk of UaF later. The address ID might
need to be modified later to handle the ID0 case later, so a copy seems
OK to deal with.
Reported-by: Paolo Abeni <pabeni@redhat.com>
Closes: https://lore.kernel.org/45cd30d3-7710-491c-ae4d-a1368c00beb1@redhat.com
Fixes: 01cacb00b35c ("mptcp: add netlink-based PM")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-14-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When reacting upon the reception of an ADD_ADDR, the in-kernel PM first
looks for fullmesh endpoints. If there are some, it will pick them,
using their entry ID.
It should set the ID 0 when using the endpoint corresponding to the
initial subflow, it is a special case imposed by the MPTCP specs.
Note that msk->mpc_endpoint_id might not be set when receiving the first
ADD_ADDR from the server. So better to compare the addresses.
Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-12-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The ID 0 is specific per MPTCP connections. The per netns entries cannot
have this special ID 0 then.
But that's different for the userspace PM where the entries are per
connection, they can then use this special ID 0.
Fixes: f40be0db0b76 ("mptcp: unify pm get_flags_and_ifindex_by_id")
Cc: stable@vger.kernel.org
Acked-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-11-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The limits might have changed in between, it is best to check them
before accepting new ADD_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-10-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Adding the following warning ...
WARN_ON_ONCE(msk->pm.add_addr_accepted == 0)
... before decrementing the add_addr_accepted counter helped to find a
bug when running the "remove single subflow" subtest from the
mptcp_join.sh selftest.
Removing a 'subflow' endpoint will first trigger a RM_ADDR, then the
subflow closure. Before this patch, and upon the reception of the
RM_ADDR, the other peer will then try to decrement this
add_addr_accepted. That's not correct because the attached subflows have
not been created upon the reception of an ADD_ADDR.
A way to solve that is to decrement the counter only if the attached
subflow was an MP_JOIN to a remote id that was not 0, and initiated by
the host receiving the RM_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-9-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Adding the following warning ...
WARN_ON_ONCE(msk->pm.local_addr_used == 0)
... before decrementing the local_addr_used counter helped to find a bug
when running the "remove single address" subtest from the mptcp_join.sh
selftests.
Removing a 'signal' endpoint will trigger the removal of all subflows
linked to this endpoint via mptcp_pm_nl_rm_addr_or_subflow() with
rm_type == MPTCP_MIB_RMSUBFLOW. This will decrement the local_addr_used
counter, which is wrong in this case because this counter is linked to
'subflow' endpoints, and here it is a 'signal' endpoint that is being
removed.
Now, the counter is decremented, only if the ID is being used outside
of mptcp_pm_nl_rm_addr_or_subflow(), only for 'subflow' endpoints, and
if the ID is not 0 -- local_addr_used is not taking into account these
ones. This marking of the ID as being available, and the decrement is
done no matter if a subflow using this ID is currently available,
because the subflow could have been closed before.
Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-8-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This helper is confusing. It is in pm.c, but it is specific to the
in-kernel PM and it cannot be used by the userspace one. Also, it simply
calls one in-kernel specific function with the PM lock, while the
similar mptcp_pm_remove_addr() helper requires the PM lock.
What's left is the pr_debug(), which is not that useful, because a
similar one is present in the only function called by this helper:
mptcp_pm_nl_rm_subflow_received()
After these modifications, this helper can be marked as 'static', and
the lock can be taken only once in mptcp_pm_flush_addrs_and_subflows().
Note that it is not a bug fix, but it will help backporting the
following commits.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-7-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If no subflows are attached to the 'subflow' endpoints that are being
flushed, the corresponding addr IDs will not be marked as available
again.
Mark all ID as being available when flushing all the 'subflow'
endpoints, and reset local_addr_used counter to cover these cases.
Note that mptcp_pm_remove_addrs_and_subflows() helper is only called for
flushing operations, not to remove a specific set of addresses and
subflows.
Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-5-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If no subflow is attached to the 'subflow' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the 'subflow' endpoint if
no subflow is attached to it.
While at it, the local_addr_used counter is decremented if the ID was
marked as being used to reflect the reality, but also to allow adding
new endpoints after that.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-3-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If no subflow is attached to the 'signal' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the address entry from the
list to cover this case.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-1-38035d40de5b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is a bug in netem_enqueue() introduced by
commit 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec")
that can lead to a use-after-free.
This commit made netem_enqueue() always return NET_XMIT_SUCCESS
when a packet is duplicated, which can cause the parent qdisc's q.qlen
to be mistakenly incremented. When this happens qlen_notify() may be
skipped on the parent during destruction, leaving a dangling pointer
for some classful qdiscs like DRR.
There are two ways for the bug happen:
- If the duplicated packet is dropped by rootq->enqueue() and then
the original packet is also dropped.
- If rootq->enqueue() sends the duplicated packet to a different qdisc
and the original packet is dropped.
In both cases NET_XMIT_SUCCESS is returned even though no packets
are enqueued at the netem qdisc.
The fix is to defer the enqueue of the duplicate packet until after
the original packet has been guaranteed to return NET_XMIT_SUCCESS.
Fixes: 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240819175753.5151-1-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|