summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst211
-rw-r--r--Documentation/networking/bonding.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/google/gve.rst9
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst8
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst29
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst313
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst14
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst20
-rw-r--r--Documentation/networking/devlink/devlink-port.rst61
-rw-r--r--Documentation/networking/devlink/mlx5.rst182
-rw-r--r--Documentation/networking/ip-sysctl.rst9
-rw-r--r--Documentation/networking/mptcp-sysctl.rst8
-rw-r--r--Documentation/networking/napi.rst13
-rw-r--r--Documentation/networking/netconsole.rst11
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst4
-rw-r--r--Documentation/networking/packet_mmap.rst2
-rw-r--r--Documentation/networking/page_pool.rst149
-rw-r--r--Documentation/networking/phy.rst4
-rw-r--r--Documentation/networking/xfrm_device.rst1
20 files changed, 619 insertions, 432 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 1cc35de336a4..dceeb0d763aa 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt
Gets options from an XDP socket. The only one supported so far is
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+ frame. In the case the packet consists of a single frame, the
+ descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+ descriptors/frames of this packet are marked as invalid and not
+ completed. The next descriptor is treated as the start of a new
+ packet, even if this was not the intent (because we cannot guess
+ the intent). As before, if your program is producing invalid
+ descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+ equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+ descriptors accumulated so far are dropped and treated as
+ invalid. To produce an application that will work on any system
+ regardless of this config setting, limit the number of frags to 18,
+ as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+ supports. Usually at least five on the NICs we have checked. We
+ consciously chose to not enforce a rigid limit (such as
+ CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+ resulted in copy actions under the hood to fit into what limit the
+ NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+ probe for this limit is explained in the "probe for multi-buffer
+ support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
Usage
-=====
+-----
In order to use AF_XDP sockets two parts are needed. The
user-space application and the XDP program. For a complete setup and
@@ -541,6 +625,131 @@ like this:
But please use the libbpf functions as they are optimized and ready to
use. Will make your life easier.
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+ void rx_packets(struct xsk_socket_info *xsk)
+ {
+ static bool new_packet = true;
+ u32 idx_rx = 0, idx_fq = 0;
+ static char *pkt;
+
+ int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+ xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+ for (int i = 0; i < rcvd; i++) {
+ struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+ char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+ bool eop = !(desc->options & XDP_PKT_CONTD);
+
+ if (new_packet)
+ pkt = frag;
+ else
+ add_frag_to_pkt(pkt, frag);
+
+ if (eop)
+ process_pkt(pkt);
+
+ new_packet = eop;
+
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+ }
+
+ xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+ xsk_ring_cons__release(&xsk->rx, rcvd);
+ }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+ void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+ int batch_size)
+ {
+ u32 idx, i, pkt_nb = 0;
+
+ xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+ for (i = 0; i < batch_size;) {
+ u64 addr = pkts[pkt_nb].addr;
+ u32 len = pkts[pkt_nb].size;
+
+ do {
+ struct xdp_desc *tx_desc;
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+ tx_desc->addr = addr;
+
+ if (len > xsk_frame_size) {
+ tx_desc->len = xsk_frame_size;
+ tx_desc->options = XDP_PKT_CONTD;
+ } else {
+ tx_desc->len = len;
+ tx_desc->options = 0;
+ pkt_nb++;
+ }
+ len -= tx_desc->len;
+ addr += xsk_frame_size;
+
+ if (i == batch_size) {
+ /* Remember len, addr, pkt_nb for next iteration.
+ * Skipped for simplicity.
+ */
+ break;
+ }
+ } while (len);
+ }
+
+ xsk_ring_prod__submit(&xsk->tx, i);
+ }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+ one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+ returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
Sample application
==================
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 28925e19622d..f7a73421eb76 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -1636,7 +1636,7 @@ your init script::
-----------------------------------------
This section applies to distros which use /etc/network/interfaces file
-to describe network interface configuration, most notably Debian and it's
+to describe network interface configuration, most notably Debian and its
derivatives.
The ifup and ifdown commands on Debian don't support bonding out of
diff --git a/Documentation/networking/device_drivers/ethernet/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
index 6d73ee78f3d7..31d621bca82e 100644
--- a/Documentation/networking/device_drivers/ethernet/google/gve.rst
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -52,6 +52,15 @@ Descriptor Formats
GVE supports two descriptor formats: GQI and DQO. These two formats have
entirely different descriptors, which will be described below.
+Addressing Mode
+------------------
+GVE supports two addressing modes: QPL and RDA.
+QPL ("queue-page-list") mode communicates data through a set of
+pre-registered pages.
+
+For RDA ("raw DMA addressing") mode, the set of pages is dynamic.
+Therefore, the packet buffers can be anywhere in guest memory.
+
Registers
---------
All registers are MMIO.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index bfd233cfac35..1e196cb9ce25 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -332,3 +332,11 @@ Setup HTB offload
# tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1
# tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7
+
+4. Create tc classes with same priorities and different quantum::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416
+
+ # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index a395df9c2751..f69ee1ebee01 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -346,6 +346,24 @@ the software port.
- The number of receive packets with CQE compression on ring i [#accel]_.
- Acceleration
+ * - `rx[i]_arfs_add`
+ - The number of aRFS flow rules added to the device for direct RQ steering
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_in`
+ - Number of flow rules that have been requested to move into ring i for
+ direct RQ steering [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_out`
+ - Number of flow rules that have been requested to move out of ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_expired`
+ - Number of flow rules that have been expired and removed [#accel]_.
+ - Acceleration
+
* - `rx[i]_arfs_err`
- Number of flow rules that failed to be added to the flow table.
- Error
@@ -445,11 +463,6 @@ the software port.
context.
- Error
- * - `rx[i]_xsk_arfs_err`
- - aRFS (accelerated Receive Flow Steering) does not occur in the XSK RQ
- context, so this counter should never increment.
- - Error
-
* - `rx[i]_xdp_tx_xmit`
- The number of packets forwarded back to the port due to XDP program
`XDP_TX` action (bouncing). these packets are not counted by other
@@ -683,6 +696,12 @@ the software port.
time protocol.
- Error
+ * - `ptp_cq[i]_late_cqe`
+ - Number of times a CQE has been delivered on the PTP timestamping CQ when
+ the CQE was not expected since a certain amount of time had elapsed where
+ the device typically ensures not posting the CQE.
+ - Error
+
.. [#ring_global] The corresponding ring and global counters do not share the
same name (i.e. do not follow the common naming scheme).
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
deleted file mode 100644
index a4edf908b707..000000000000
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
+++ /dev/null
@@ -1,313 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
-.. include:: <isonum.txt>
-
-=======
-Devlink
-=======
-
-:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-
-Contents
-========
-
-- `Info`_
-- `Parameters`_
-- `Health reporters`_
-
-Info
-====
-
-The devlink info reports the running and stored firmware versions on device.
-It also prints the device PSID which represents the HCA board type ID.
-
-User command example::
-
- $ devlink dev info pci/0000:00:06.0
- pci/0000:00:06.0:
- driver mlx5_core
- versions:
- fixed:
- fw.psid MT_0000000009
- running:
- fw.version 16.26.0100
- stored:
- fw.version 16.26.0100
-
-Parameters
-==========
-
-flow_steering_mode: Device flow steering mode
----------------------------------------------
-The flow steering mode parameter controls the flow steering mode of the driver.
-Two modes are supported:
-
-1. 'dmfs' - Device managed flow steering.
-2. 'smfs' - Software/Driver managed flow steering.
-
-In DMFS mode, the HW steering entities are created and managed through the
-Firmware.
-In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into hardware without firmware intervention.
-
-SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
-
-User command examples:
-
-- Set SMFS flow steering mode::
-
- $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
-
-- Read device flow steering mode::
-
- $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
- pci/0000:06:00.0:
- name flow_steering_mode type driver-specific
- values:
- cmode runtime value smfs
-
-enable_roce: RoCE enablement state
-----------------------------------
-If the device supports RoCE disablement, RoCE enablement state controls device
-support for RoCE capability. Otherwise, the control occurs in the driver stack.
-When RoCE is disabled at the driver level, only raw ethernet QPs are supported.
-
-To change RoCE enablement state, a user must change the driverinit cmode value
-and run devlink reload.
-
-User command examples:
-
-- Disable RoCE::
-
- $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
- $ devlink dev reload pci/0000:06:00.0
-
-- Read RoCE enablement state::
-
- $ devlink dev param show pci/0000:06:00.0 name enable_roce
- pci/0000:06:00.0:
- name enable_roce type generic
- values:
- cmode driverinit value true
-
-esw_port_metadata: Eswitch port metadata state
-----------------------------------------------
-When applicable, disabling eswitch metadata can increase packet rate
-up to 20% depending on the use case and packet sizes.
-
-Eswitch port metadata state controls whether to internally tag packets with
-metadata. Metadata tagging must be enabled for multi-port RoCE, failover
-between representors and stacked devices.
-By default metadata is enabled on the supported devices in E-switch.
-Metadata is applicable only for E-switch in switchdev mode and
-users may disable it when NONE of the below use cases will be in use:
-
-1. HCA is in Dual/multi-port RoCE mode.
-2. VF/SF representor bonding (Usually used for Live migration)
-3. Stacked devices
-
-When metadata is disabled, the above use cases will fail to initialize if
-users try to enable them.
-
-- Show eswitch port metadata::
-
- $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
- pci/0000:06:00.0:
- name esw_port_metadata type driver-specific
- values:
- cmode runtime value true
-
-- Disable eswitch port metadata::
-
- $ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
-
-- Change eswitch mode to switchdev mode where after choosing the metadata value::
-
- $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
-
-hairpin_num_queues: Number of hairpin queues
---------------------------------------------
-We refer to a TC NIC rule that involves forwarding as "hairpin".
-
-Hairpin queues are mlx5 hardware specific implementation for hardware
-forwarding of such packets.
-
-- Show the number of hairpin queues::
-
- $ devlink dev param show pci/0000:06:00.0 name hairpin_num_queues
- pci/0000:06:00.0:
- name hairpin_num_queues type driver-specific
- values:
- cmode driverinit value 2
-
-- Change the number of hairpin queues::
-
- $ devlink dev param set pci/0000:06:00.0 name hairpin_num_queues value 4 cmode driverinit
-
-hairpin_queue_size: Size of the hairpin queues
-----------------------------------------------
-Control the size of the hairpin queues.
-
-- Show the size of the hairpin queues::
-
- $ devlink dev param show pci/0000:06:00.0 name hairpin_queue_size
- pci/0000:06:00.0:
- name hairpin_queue_size type driver-specific
- values:
- cmode driverinit value 1024
-
-- Change the size (in packets) of the hairpin queues::
-
- $ devlink dev param set pci/0000:06:00.0 name hairpin_queue_size value 512 cmode driverinit
-
-Health reporters
-================
-
-tx reporter
------------
-The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- tx timeout
- Report on kernel tx timeout detection.
- Recover by searching lost interrupts.
-- tx error completion
- Report on error tx completion.
- Recover by flushing the tx queue and reset it.
-
-tx reporter also support on demand diagnose callback, on which it provides
-real time information of its send queues status.
-
-User commands examples:
-
-- Diagnose send queues status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter tx
-
-.. note::
- This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of tx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter tx
-
-rx reporter
------------
-The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- rx queues' initialization (population) timeout
- Population of rx queues' descriptors on ring initialization is done
- in napi context via triggering an irq. In case of a failure to get
- the minimum amount of descriptors, a timeout would occur, and
- descriptors could be recovered by polling the EQ (Event Queue).
-- rx completions with errors (reported by HW on interrupt context)
- Report on rx completion error.
- Recover (if needed) by flushing the related queue and reset it.
-
-rx reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues' status.
-
-- Diagnose rx queues' status and corresponding completion queue::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter rx
-
-NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
-
-- Show number of rx errors indicated, number of recover flows ended successfully,
- is autorecover enabled, and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter rx
-
-fw reporter
------------
-The fw reporter implements `diagnose` and `dump` callbacks.
-It follows symptoms of fw error such as fw syndrome by triggering
-fw core dump and storing it into the dump buffer.
-The fw reporter diagnose command can be triggered any time by the user to check
-current fw status.
-
-User commands examples:
-
-- Check fw heath status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter fw
-
-- Read FW core dump if already stored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.0 reporter fw
-
-.. note::
- This command can run only on the PF which has fw tracer ownership,
- running it on other PF or any VF will return "Operation not permitted".
-
-fw fatal reporter
------------------
-The fw fatal reporter implements `dump` and `recover` callbacks.
-It follows fatal errors indications by CR-space dump and recover flow.
-The CR-space dump uses vsc interface which is valid even if the FW command
-interface is not functional, which is the case in most FW fatal errors.
-The recover function runs recover flow which reloads the driver and triggers fw
-reset if needed.
-On firmware error, the health buffer is dumped into the dmesg. The log
-level is derived from the error's severity (given in health buffer).
-
-User commands examples:
-
-- Run fw recover flow manually::
-
- $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-
-- Read FW CR-space dump if already stored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
-
-.. note::
- This command can run only on PF.
-
-vnic reporter
--------------
-The vnic reporter implements only the `diagnose` callback.
-It is responsible for querying the vnic diagnostic counters from fw and displaying
-them in realtime.
-
-Description of the vnic counters:
-
-- total_q_under_processor_handle
- number of queues in an error state due to
- an async error or errored command.
-- send_queue_priority_update_flow
- number of QP/SQ priority/SL update events.
-- cq_overrun
- number of times CQ entered an error state due to an overflow.
-- async_eq_overrun
- number of times an EQ mapped to async events was overrun.
- comp_eq_overrun number of times an EQ mapped to completion events was
- overrun.
-- quota_exceeded_command
- number of commands issued and failed due to quota exceeded.
-- invalid_command
- number of commands issued and failed dues to any reason other than quota
- exceeded.
-- nic_receive_steering_discard
- number of packets that completed RX flow
- steering but were discarded due to a mismatch in flow table.
-- generated_pkt_steering_fail
- number of packets generated by the VNIC experiencing unexpected steering
- failure (at any point in steering flow).
-- handled_pkt_steering_fail
- number of packets handled by the VNIC experiencing unexpected steering
- failure (at any point in steering flow owned by the VNIC, including the FDB
- for the eswitch owner).
-
-User commands examples:
-
-- Diagnose PF/VF vnic counters::
-
- $ devlink health diagnose pci/0000:82:00.1 reporter vnic
-
-- Diagnose representor vnic counters (performed by supplying devlink port of the
- representor, which can be obtained via devlink port command)::
-
- $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
-
-.. note::
- This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
index 3fdcd6b61ccf..581a91caa579 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
@@ -13,7 +13,6 @@ Contents:
:maxdepth: 2
kconfig
- devlink
switchdev
tracepoints
counters
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
index 43b1f7e87ec4..0a42c3395ffa 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -36,7 +36,7 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
-| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+| Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
**CONFIG_MLX5_CORE_IPOIB=(y/n)**
@@ -59,12 +59,12 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_EN_ARFS=(y/n)**
| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
-| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
+| https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4
**CONFIG_MLX5_EN_IPSEC=(y/n)**
-| Enables `IPSec XFRM cryptography-offload acceleration <https://support.mellanox.com/s/article/ConnectX-6DX-Bluefield-2-IPsec-HW-Full-Offload-Configuration-Guide>`_.
+| Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`.
**CONFIG_MLX5_EN_MACSEC=(y/n)**
@@ -87,8 +87,8 @@ Enabling the driver and kconfig options
| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
| and switching for the enabled VFs and PF in two available modes:
-| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
-| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_.
+| 2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`.
**CONFIG_MLX5_FPGA=(y/n)**
@@ -101,13 +101,13 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
-| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+| Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
**CONFIG_MLX5_MPFS=(y/n)**
| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
-| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
+| MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing
| user configured unicast MAC addresses to the requesting PF.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
index 6e3f5ee8b0d0..b617e93d7c2c 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -190,6 +190,26 @@ explicitly enable the VF migratable capability.
mlx5 driver support devlink port function attr mechanism to setup migratable
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+IPsec crypto capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need
+to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec packet capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need
+to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_packet
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
SF state setup
--------------
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 3da590953ce8..e33ad2401ad7 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -128,6 +128,12 @@ Users may also set the RoCE capability of the function using
Users may also set the function as migratable using
'devlink port function set migratable' command.
+Users may also set the IPsec crypto capability of the function using
+`devlink port function set ipsec_crypto` command.
+
+Users may also set the IPsec packet capability of the function using
+`devlink port function set ipsec_packet` command.
+
Function attributes
===================
@@ -240,6 +246,55 @@ Attach VF to the VM.
Start the VM.
Perform live migration.
+IPsec crypto capability setup
+-----------------------------
+When user enables IPsec crypto capability for a VF, user application can offload
+XFRM state crypto operation (Encrypt/Decrypt) to this VF.
+
+When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
+processed in software by the kernel.
+
+- Get IPsec crypto capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
+
+- Set IPsec crypto capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
+
+IPsec packet capability setup
+-----------------------------
+When user enables IPsec packet capability for a VF, user application can offload
+XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
+IPsec encapsulation.
+
+When IPsec packet capability is disabled (default) for a VF, the XFRM state and
+policy is processed in software by the kernel.
+
+- Get IPsec packet capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled
+
+- Set IPsec packet capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+
Subfunction
============
@@ -321,9 +376,9 @@ API allows to configure following rate object's parameters:
Allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with the
strict priority. As a node is configured with a higher rate it gets more
- BW relative to it's siblings. Values are relative like a percentage
+ BW relative to its siblings. Values are relative like a percentage
points, they basically tell how much BW should node take relative to
- it's siblings.
+ its siblings.
``parent``
Parent node name. Parent node rate limits are considered as additional limits
@@ -343,7 +398,7 @@ Arbitration flow from the high level:
#. If group of nodes have the same priority perform WFQ arbitration on
that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
-#. Select the winner node, and continue arbitration flow among it's children,
+#. Select the winner node, and continue arbitration flow among its children,
until leaf node is reached, and the winner is established.
#. If all the nodes from the highest priority sub-group are satisfied, or
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 202798d6501e..702f204a3dbd 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -18,6 +18,11 @@ Parameters
* - ``enable_roce``
- driverinit
- Type: Boolean
+
+ If the device supports RoCE disablement, RoCE enablement state controls
+ device support for RoCE capability. Otherwise, the control occurs in the
+ driver stack. When RoCE is disabled at the driver level, only raw
+ ethernet QPs are supported.
* - ``io_eq_size``
- driverinit
- The range is between 64 and 4096.
@@ -48,6 +53,9 @@ parameters.
* ``smfs`` Software managed flow steering. In SMFS mode, the HW
steering entities are created and manage through the driver without
firmware intervention.
+
+ SMFS mode is faster and provides better rule insertion rate compared to
+ default DMFS mode.
* - ``fdb_large_groups``
- u32
- driverinit
@@ -71,7 +79,24 @@ parameters.
deprecated.
Default: disabled
+ * - ``esw_port_metadata``
+ - Boolean
+ - runtime
+ - When applicable, disabling eswitch metadata can increase packet rate up
+ to 20% depending on the use case and packet sizes.
+
+ Eswitch port metadata state controls whether to internally tag packets
+ with metadata. Metadata tagging must be enabled for multi-port RoCE,
+ failover between representors and stacked devices. By default metadata is
+ enabled on the supported devices in E-switch. Metadata is applicable only
+ for E-switch in switchdev mode and users may disable it when NONE of the
+ below use cases will be in use:
+ 1. HCA is in Dual/multi-port RoCE mode.
+ 2. VF/SF representor bonding (Usually used for Live migration)
+ 3. Stacked devices
+ When metadata is disabled, the above use cases will fail to initialize if
+ users try to enable them.
* - ``hairpin_num_queues``
- u32
- driverinit
@@ -104,3 +129,160 @@ The ``mlx5`` driver reports the following versions
* - ``fw.version``
- stored, running
- Three digit major.minor.subminor firmware version number.
+
+Health reporters
+================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following three error scenarios:
+
+- tx timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- tx error completion
+ Report on error tx completion.
+ Recover by flushing the tx queue and reset it.
+- tx PTP port timestamping CQ unhealthy
+ Report too many CQEs never delivered on port ts CQ.
+ Recover by flushing and re-creating all PTP channels.
+
+tx reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+.. note::
+ This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
+
+- Diagnose rx queues' status and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+.. note::
+ This command has valid output only when interface is up. Otherwise, the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled, and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements `diagnose` and `dump` callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+.. note::
+ This command can run only on the PF which has fw tracer ownership,
+ running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements `dump` and `recover` callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+.. note::
+ This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+
+- total_q_under_processor_handle
+ number of queues in an error state due to
+ an async error or errored command.
+- send_queue_priority_update_flow
+ number of QP/SQ priority/SL update events.
+- cq_overrun
+ number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+ number of times an EQ mapped to async events was overrun.
+ comp_eq_overrun number of times an EQ mapped to completion events was
+ overrun.
+- quota_exceeded_command
+ number of commands issued and failed due to quota exceeded.
+- invalid_command
+ number of commands issued and failed dues to any reason other than quota
+ exceeded.
+- nic_receive_steering_discard
+ number of packets that completed RX flow
+ steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+ number of packets generated by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow).
+- handled_pkt_steering_fail
+ number of packets handled by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow owned by the VNIC, including the FDB
+ for the eswitch owner).
+
+User commands examples:
+
+- Diagnose PF/VF vnic counters::
+
+ $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+ representor, which can be obtained via devlink port command)::
+
+ $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+.. note::
+ This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 4a010a7cde7f..a66054d0763a 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -321,6 +321,7 @@ tcp_abort_on_overflow - BOOLEAN
option can harm clients of your server.
tcp_adv_win_scale - INTEGER
+ Obsolete since linux-6.6
Count buffering overhead as bytes/2^tcp_adv_win_scale
(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
if it is <= 0.
@@ -2287,6 +2288,14 @@ accept_ra_min_hop_limit - INTEGER
Default: 1
+accept_ra_min_lft - INTEGER
+ Minimum acceptable lifetime value in Router Advertisement.
+
+ RA sections with a lifetime less than this value shall be
+ ignored. Zero lifetimes stay unaffected.
+
+ Default: 0
+
accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 213510698014..15f1919d640c 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -74,3 +74,11 @@ stale_loss_cnt - INTEGER
This is a per-namespace sysctl.
Default: 4
+
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index a7a047742e93..7bf7b95c4f7a 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -65,15 +65,16 @@ argument - drivers can process completions for any number of Tx
packets but should only process up to ``budget`` number of
Rx packets. Rx processing is usually much more expensive.
-In other words, it is recommended to ignore the budget argument when
-performing TX buffer reclamation to ensure that the reclamation is not
-arbitrarily bounded; however, it is required to honor the budget argument
-for RX processing.
+In other words for Rx processing the ``budget`` argument limits how many
+packets driver can process in a single poll. Rx specific APIs like page
+pool or XDP cannot be used at all when ``budget`` is 0.
+skb Tx processing should happen regardless of the ``budget``, but if
+the argument is 0 driver cannot call any XDP (or page pool) APIs.
.. warning::
- The ``budget`` argument may be 0 if core tries to only process Tx completions
- and no Rx packets.
+ The ``budget`` argument may be 0 if core tries to only process
+ skb Tx completions and no Rx or XDP packets.
The poll method returns the amount of work done. If the driver still
has outstanding work to do (e.g. ``budget`` was exhausted)
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index dd0518e002f6..7a9de0568e84 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -13,6 +13,8 @@ IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
+Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023
+
Please send bug reports to Matt Mackall <mpm@selenic.com>
Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
@@ -34,10 +36,11 @@ Sender and receiver configuration:
It takes a string configuration parameter "netconsole" in the
following format::
- netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
+ netconsole=[+][r][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
where
+ if present, enable extended console support
+ r if present, prepend kernel version (release) to the message
src-port source for UDP packets (defaults to 6665)
src-ip source IP to use (interface address)
dev network interface (eth0)
@@ -125,6 +128,7 @@ The interface exposes these parameters of a netconsole target to userspace:
============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
+ release Prepend kernel release to message (read-write)
dev_name Local network interface name (read-write)
local_port Source UDP port to use (read-write)
remote_port Remote agent's UDP port (read-write)
@@ -165,6 +169,11 @@ following format which is the same as /dev/kmsg::
<level>,<sequnum>,<timestamp>,<contflag>;<message text>
+If 'r' (release) feature is enabled, the kernel release version is
+prepended to the start of the message. Example::
+
+ 6.4.0,6,444,501151268,-;netconsole: network logging started
+
Non printable characters in <message text> are escaped using "\xff"
notation. If the message contains optional dictionary, verbatim
newline is used as the delimiter.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 8b1045c3b59e..c383a394c665 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -178,10 +178,10 @@ nf_conntrack_sctp_timeout_established - INTEGER (seconds)
Default is set to (hb_interval * path_max_retrans + rto_max)
nf_conntrack_sctp_timeout_shutdown_sent - INTEGER (seconds)
- default 0.3
+ default 3
nf_conntrack_sctp_timeout_shutdown_recd - INTEGER (seconds)
- default 0.3
+ default 3
nf_conntrack_sctp_timeout_shutdown_ack_sent - INTEGER (seconds)
default 3
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
index c5da1a5d93de..30a3be3c48f3 100644
--- a/Documentation/networking/packet_mmap.rst
+++ b/Documentation/networking/packet_mmap.rst
@@ -755,7 +755,7 @@ AF_PACKET TPACKET_V3 example
============================
AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
-sizes by doing it's own memory management. It is based on blocks where polling
+sizes by doing its own memory management. It is based on blocks where polling
works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
It is said that TPACKET_V3 brings the following benefits:
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index 873efd97f822..215ebc92752c 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -4,22 +4,8 @@
Page Pool API
=============
-The page_pool allocator is optimized for the XDP mode that uses one frame
-per-page, but it can fallback on the regular page allocator APIs.
-
-Basic use involves replacing alloc_pages() calls with the
-page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages()
-replacing dev_alloc_pages().
-
-API keeps track of in-flight pages, in order to let API user know
-when it is safe to free a page_pool object. Thus, API users
-must run page_pool_release_page() when a page is leaving the page_pool or
-call page_pool_put_page() where appropriate in order to maintain correct
-accounting.
-
-API user must call page_pool_put_page() once on a page, as it
-will either recycle the page, or in case of refcnt > 1, it will
-release the DMA mapping and in-flight state accounting.
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :doc: page_pool allocator
Architecture overview
=====================
@@ -64,87 +50,68 @@ This lockless guarantee naturally comes from running under a NAPI softirq.
The protection doesn't strictly have to be NAPI, any guarantee that allocating
a page will cause no race conditions is enough.
-* page_pool_create(): Create a pool.
- * flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
- * order: 2^order pages on allocation
- * pool_size: size of the ptr_ring
- * nid: preferred NUMA node for allocation
- * dev: struct device. Used on DMA operations
- * dma_dir: DMA direction
- * max_len: max DMA sync memory size
- * offset: DMA address offset
-
-* page_pool_put_page(): The outcome of this depends on the page refcnt. If the
- driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1
- the allocator owns the page and will try to recycle it in one of the pool
- caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device
- using dma_sync_single_range_for_device().
-
-* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync
- for the entire memory area configured in area pool->max_len.
-
-* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller
- must guarantee safe context (e.g NAPI), since it will recycle the page
- directly into the pool fast cache.
-
-* page_pool_release_page(): Unmap the page (if mapped) and account for it on
- in-flight counters.
-
-* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool
- caches.
-
-* page_pool_get_dma_addr(): Retrieve the stored DMA address.
-
-* page_pool_get_dma_dir(): Retrieve the stored DMA direction.
-
-* page_pool_put_page_bulk(): Tries to refill a number of pages into the
- ptr_ring cache holding ptr_ring producer lock. If the ptr_ring is full,
- page_pool_put_page_bulk() will release leftover pages to the page allocator.
- page_pool_put_page_bulk() is suitable to be run inside the driver NAPI tx
- completion loop for the XDP_REDIRECT use case.
- Please note the caller must not use data area after running
- page_pool_put_page_bulk(), as this function overwrites it.
-
-* page_pool_get_stats(): Retrieve statistics about the page_pool. This API
- is only available if the kernel has been configured with
- ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct
- page_pool_stats`` structure is passed to this API which is filled in. The
- caller can then report those stats to the user (perhaps via ethtool,
- debugfs, etc.). See below for an example usage of this API.
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_create
+
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_params
+
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :identifiers: page_pool_put_page page_pool_put_full_page
+ page_pool_recycle_direct page_pool_dev_alloc_pages
+ page_pool_get_dma_addr page_pool_get_dma_dir
+
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_put_page_bulk page_pool_get_stats
+
+DMA sync
+--------
+Driver is always responsible for syncing the pages for the CPU.
+Drivers may choose to take care of syncing for the device as well
+or set the ``PP_FLAG_DMA_SYNC_DEV`` flag to request that pages
+allocated from the page pool are already synced for the device.
+
+If ``PP_FLAG_DMA_SYNC_DEV`` is set, the driver must inform the core what portion
+of the buffer has to be synced. This allows the core to avoid syncing the entire
+page when the drivers knows that the device only accessed a portion of the page.
+
+Most drivers will reserve headroom in front of the frame. This part
+of the buffer is not touched by the device, so to avoid syncing
+it drivers can set the ``offset`` field in struct page_pool_params
+appropriately.
+
+For pages recycled on the XDP xmit and skb paths the page pool will
+use the ``max_len`` member of struct page_pool_params to decide how
+much of the page needs to be synced (starting at ``offset``).
+When directly freeing pages in the driver (page_pool_put_page())
+the ``dma_sync_size`` argument specifies how much of the buffer needs
+to be synced.
+
+If in doubt set ``offset`` to 0, ``max_len`` to ``PAGE_SIZE`` and
+pass -1 as ``dma_sync_size``. That combination of arguments is always
+correct.
+
+Note that the syncing parameters are for the entire page.
+This is important to remember when using fragments (``PP_FLAG_PAGE_FRAG``),
+where allocated buffers may be smaller than a full page.
+Unless the driver author really understands page pool internals
+it's recommended to always use ``offset = 0``, ``max_len = PAGE_SIZE``
+with fragmented page pools.
Stats API and structures
------------------------
If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
-``page_pool_get_stats()`` and structures described below are available. It
-takes a pointer to a ``struct page_pool`` and a pointer to a ``struct
-page_pool_stats`` allocated by the caller.
+page_pool_get_stats() and structures described below are available.
+It takes a pointer to a ``struct page_pool`` and a pointer to a struct
+page_pool_stats allocated by the caller.
-The API will fill in the provided ``struct page_pool_stats`` with
+The API will fill in the provided struct page_pool_stats with
statistics about the page_pool.
-The stats structure has the following fields::
-
- struct page_pool_stats {
- struct page_pool_alloc_stats alloc_stats;
- struct page_pool_recycle_stats recycle_stats;
- };
-
-
-The ``struct page_pool_alloc_stats`` has the following fields:
- * ``fast``: successful fast path allocations
- * ``slow``: slow path order-0 allocations
- * ``slow_high_order``: slow path high order allocations
- * ``empty``: ptr ring is empty, so a slow path allocation was forced.
- * ``refill``: an allocation which triggered a refill of the cache
- * ``waive``: pages obtained from the ptr ring that cannot be added to
- the cache due to a NUMA mismatch.
-
-The ``struct page_pool_recycle_stats`` has the following fields:
- * ``cached``: recycling placed page in the page pool cache
- * ``cache_full``: page pool cache was full
- * ``ring``: page placed into the ptr ring
- * ``ring_full``: page released from page pool because the ptr ring was full
- * ``released_refcnt``: page released (and not recycled) because refcnt > 1
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_recycle_stats
+ struct page_pool_alloc_stats
+ struct page_pool_stats
Coding examples
===============
@@ -194,7 +161,7 @@ NAPI poller
if XDP_DROP:
page_pool_recycle_direct(page_pool, page);
} else (packet_is_skb) {
- page_pool_release_page(page_pool, page);
+ skb_mark_for_recycle(skb);
new_page = page_pool_dev_alloc_pages(page_pool);
}
}
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index b7ac4c64cf67..1283240d7620 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -323,6 +323,10 @@ Some of the interface modes are described below:
contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this
interface mode has different autonegotiation and only supports full duplex.
+``PHY_INTERFACE_MODE_PSGMII``
+ This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
+ SGMII lines into a single link compared to 4 on QSGMII.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index 83abdfef4ec3..535077cbeb07 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -1,4 +1,5 @@
.. SPDX-License-Identifier: GPL-2.0
+.. _xfrm_device:
===============================================
XFRM device - offloading the IPsec computations