linux-stable.git - Linux kernel stable tree

	Commit message (Collapse)	Author	Age	Files	Lines
...
\| *	af_unix: Save listener for embryo socket.	Kuniyuki Iwashima	2024-03-29	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a prep patch for the following change, where we need to fetch the listening socket from the successor embryo socket during GC. We add a new field to struct unix_sock to save a pointer to a listening socket. We set it when connect() creates a new socket, and clear it when accept() is called. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Detect Strongly Connected Components.	Kuniyuki Iwashima	2024-03-29	2	-2/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the new GC, we use a simple graph algorithm, Tarjan's Strongly Connected Components (SCC) algorithm, to find cyclic references. The algorithm visits every vertex exactly once using depth-first search (DFS). DFS starts by pushing an input vertex to a stack and assigning it a unique number. Two fields, index and lowlink, are initialised with the number, but lowlink could be updated later during DFS. If a vertex has an edge to an unvisited inflight vertex, we visit it and do the same processing. So, we will have vertices in the stack in the order they appear and number them consecutively in the same order. If a vertex has a back-edge to a visited vertex in the stack, we update the predecessor's lowlink with the successor's index. After iterating edges from the vertex, we check if its index equals its lowlink. If the lowlink is different from the index, it shows there was a back-edge. Then, we go backtracking and propagate the lowlink to its predecessor and resume the previous edge iteration from the next edge. If the lowlink is the same as the index, we pop vertices before and including the vertex from the stack. Then, the set of vertices is SCC, possibly forming a cycle. At the same time, we move the vertices to unix_visited_vertices. When we finish the algorithm, all vertices in each SCC will be linked via unix_vertex.scc_entry. Let's take an example. We have a graph including five inflight vertices (F is not inflight): A -> B -> C -> D -> E (-> F) ^ \| `---------' Suppose that we start DFS from C. We will visit C, D, and B first and initialise their index and lowlink. Then, the stack looks like this: > B = (3, 3) (index, lowlink) D = (2, 2) C = (1, 1) When checking B's edge to C, we update B's lowlink with C's index and propagate it to D. B = (3, 1) (index, lowlink) > D = (2, 1) C = (1, 1) Next, we visit E, which has no edge to an inflight vertex. > E = (4, 4) (index, lowlink) B = (3, 1) D = (2, 1) C = (1, 1) When we leave from E, its index and lowlink are the same, so we pop E from the stack as single-vertex SCC. Next, we leave from B and D but do nothing because their lowlink are different from their index. B = (3, 1) (index, lowlink) D = (2, 1) > C = (1, 1) Then, we leave from C, whose index and lowlink are the same, so we pop B, D and C as SCC. Last, we do DFS for the rest of vertices, A, which is also a single-vertex SCC. Finally, each unix_vertex.scc_entry is linked as follows: A -. B -> C -> D E -. ^ \| ^ \| ^ \| `--' `---------' `--' We use SCC later to decide whether we can garbage-collect the sockets. Note that we still cannot detect SCC properly if an edge points to an embryo socket. The following two patches will sort it out. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Iterate all vertices by DFS.	Kuniyuki Iwashima	2024-03-29	2	-0/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new GC will use a depth first search graph algorithm to find cyclic references. The algorithm visits every vertex exactly once. Here, we implement the DFS part without recursion so that no one can abuse it. unix_walk_scc() marks every vertex unvisited by initialising index as UNIX_VERTEX_INDEX_UNVISITED and iterates inflight vertices in unix_unvisited_vertices and call __unix_walk_scc() to start DFS from an arbitrary vertex. __unix_walk_scc() iterates all edges starting from the vertex and explores the neighbour vertices with DFS using edge_stack. After visiting all neighbours, __unix_walk_scc() moves the visited vertex to unix_visited_vertices so that unix_walk_scc() will not restart DFS from the visited vertex. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb.	Kuniyuki Iwashima	2024-03-29	1	-11/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we track the number of inflight sockets in two variables. unix_tot_inflight is the total number of inflight AF_UNIX sockets on the host, and user->unix_inflight is the number of inflight fds per user. We update them one by one in unix_inflight(), which can be done once in batch. Also, sendmsg() could fail even after unix_inflight(), then we need to acquire unix_gc_lock only to decrement the counters. Let's bulk update the counters in unix_add_edges() and unix_del_edges(), which is called only for successfully passed fds. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Link struct unix_edge when queuing skb.	Kuniyuki Iwashima	2024-03-29	5	-3/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Just before queuing skb with inflight fds, we call scm_stat_add(), which is a good place to set up the preallocated struct unix_vertex and struct unix_edge in UNIXCB(skb).fp. Then, we call unix_add_edges() and construct the directed graph as follows: 1. Set the inflight socket's unix_sock to unix_edge.predecessor. 2. Set the receiver's unix_sock to unix_edge.successor. 3. Set the preallocated vertex to inflight socket's unix_sock.vertex. 4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices. 5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges. Let's say we pass the fd of AF_UNIX socket A to B and the fd of B to C. The graph looks like this: +-------------------------+ \| unix_unvisited_vertices \| <-------------------------. +-------------------------+ \| + \| \| +--------------+ +--------------+ \| +--------------+ \| \| unix_sock A \| <---. .---> \| unix_sock B \| <-\|-. .---> \| unix_sock C \| \| +--------------+ \| \| +--------------+ \| \| \| +--------------+ \| .-+ \| vertex \| \| \| .-+ \| vertex \| \| \| \| \| vertex \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| '-> \| unix_vertex \| \| \| '-> \| unix_vertex \| \| \| \| \| +--------------+ \| \| +--------------+ \| \| \| `---> \| entry \| +---------> \| entry \| +-' \| \| \|--------------\| \| \| \|--------------\| \| \| \| edges \| <-. \| \| \| edges \| <-. \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| .----------------------' \| \| .----------------------' \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| unix_edge \| \| \| \| \| unix_edge \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| `-> \| vertex_entry \| \| \| `-> \| vertex_entry \| \| \| \|--------------\| \| \| \|--------------\| \| \| \| predecessor \| +---' \| \| predecessor \| +---' \| \|--------------\| \| \|--------------\| \| \| successor \| +-----' \| successor \| +-----' +--------------+ +--------------+ Henceforth, we denote such a graph as A -> B (-> C). Now, we can express all inflight fd graphs that do not contain embryo sockets. We will support the particular case later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.	Kuniyuki Iwashima	2024-03-29	4	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As with the previous patch, we preallocate to skb's scm_fp_list an array of struct unix_edge in the number of inflight AF_UNIX fds. There we just preallocate memory and do not use immediately because sendmsg() could fail after this point. The actual use will be in the next patch. When we queue skb with inflight edges, we will set the inflight socket's unix_sock as unix_edge->predecessor and the receiver's unix_sock as successor, and then we will link the edge to the inflight socket's unix_vertex.edges. Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup() so that MSG_PEEK does not change the shape of the directed graph. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.	Kuniyuki Iwashima	2024-03-29	5	-0/+63
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We will replace the garbage collection algorithm for AF_UNIX, where we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. This patch introduces a new struct unix_vertex representing a vertex in the graph and adds its pointer to struct unix_sock. When we send a fd using the SCM_RIGHTS message, we allocate struct scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump each refcount of the inflight fds' struct file and save them in scm_fp_list.fp. After that, unix_attach_fds() inexplicably clones scm_fp_list of scm_cookie and sets it to skb. (We will remove this part after replacing GC.) Here, we add a new function call in unix_attach_fds() to preallocate struct unix_vertex per inflight AF_UNIX fd and link each vertex to skb's scm_fp_list.vertices. When sendmsg() succeeds later, if the socket of the inflight fd is still not inflight yet, we will set the preallocated vertex to struct unix_sock.vertex and link it to a global list unix_unvisited_vertices under spin_lock(&unix_gc_lock). If the socket is already inflight, we free the preallocated vertex. This is to avoid taking the lock unnecessarily when sendmsg() could fail later. In the following patch, we will similarly allocate another struct per edge, which will finally be linked to the inflight socket's unix_vertex.edges. And then, we will count the number of edges as unix_vertex.out_degree. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
*	tcp/dccp: bypass empty buckets in inet_twsk_purge()	Eric Dumazet	2024-03-29	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	TCP ehash table is often sparsely populated. inet_twsk_purge() spends too much time calling cond_resched(). This patch can reduce time spent in inet_twsk_purge() by 20x. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
*	net: dsa: hellcreek: Convert to gettimex64()	Kurt Kanzenbach	2024-03-29	1	-10/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	As of commit 916444df305e ("ptp: deprecate gettime64() in favor of gettimex64()") (new) PTP drivers should rather implement gettimex64(). In addition, this variant provides timestamps from the system clock. The readings have to be recorded right before and after reading the lowest bits of the PHC timestamp. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net/smc: make smc_hash_sk/smc_unhash_sk static	Zhengchao Shao	2024-03-29	2	-7/+2
\| \| \| \| \| \| \| \| \| \| \|	smc_hash_sk and smc_unhash_sk are only used in af_smc.c, so make them static and remove the output symbol. They can be called under the path .prot->hash()/unhash(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'net-sched-skip_sw'	David S. Miller	2024-03-29	4	-0/+64
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Asbjørn Sloth Tønnesen says: ==================== make skip_sw actually skip software During development of flower-route[1], which I recently presented at FOSDEM[2], I noticed that CPU usage, would increase the more rules I installed into the hardware for IP forwarding offloading. Since we use TC flower offload for the hottest prefixes, and leave the long tail to the normal (non-TC) Linux network stack for slow-path IP forwarding. We therefore need both the hardware and software datapath to perform well. I found that skip_sw rules, are quite expensive in the kernel datapath, since they must be evaluated and matched upon, before the kernel checks the skip_sw flag. This patchset optimizes the case where all rules are skip_sw, by implementing a TC bypass for these cases, where TC is only used as a control plane for the hardware path. v4: - Rebased onto net-next, now that net-next is open again v3: https://lore.kernel.org/netdev/20240306165813.656931-1-ast@fiberby.net/ - Patch 3: - Fix source_inline - Fix build failure, when CONFIG_NET_CLS without CONFIG_NET_CLS_ACT. v2: https://lore.kernel.org/netdev/20240305144404.569632-1-ast@fiberby.net/ - Patch 1: - Add Reviewed-By from Jiri Pirko - Patch 2: - Move code, to avoid forward declaration (Jiri). - Patch 3 - Refactor to use a static key. - Add performance data for trapping, or sending a packet to a non-existent chain (as suggested by Marcelo). v1: https://lore.kernel.org/netdev/20240215160458.1727237-1-ast@fiberby.net/ [1] flower-route https://github.com/fiberby-dk/flower-route [2] FOSDEM talk https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: sched: make skip_sw actually skip software	Asbjørn Sloth Tønnesen	2024-03-29	4	-0/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TC filters come in 3 variants: - no flag (try to process in hardware, but fallback to software)) - skip_hw (do not process filter by hardware) - skip_sw (do not process filter by software) However skip_sw is implemented so that the skip_sw flag can first be checked, after it has been matched. IMHO it's common when using skip_sw, to use it on all rules. So if all filters in a block is skip_sw filters, then we can bail early, we can thus avoid having to match the filters, just to check for the skip_sw flag. This patch adds a bypass, for when only TC skip_sw rules are used. The bypass is guarded by a static key, to avoid harming other workloads. There are 3 ways that a packet from a skip_sw ruleset, can end up in the kernel path. Although the send packets to a non-existent chain way is only improved a few percents, then I believe it's worth optimizing the trap and fall-though use-cases. +----------------------------+--------+--------+--------+ \| Test description \| Pre- \| Post- \| Rel. \| \| \| kpps \| kpps \| chg. \| +----------------------------+--------+--------+--------+ \| basic forwarding + notrack \| 3589.3 \| 3587.9 \| 1.00x \| \| switch to eswitch mode \| 3081.8 \| 3094.7 \| 1.00x \| \| add ingress qdisc \| 3042.9 \| 3063.6 \| 1.01x \| \| tc forward in hw / skip_sw \|37024.7 \|37028.4 \| 1.00x \| \| tc forward in sw / skip_hw \| 3245.0 \| 3245.3 \| 1.00x \| +----------------------------+--------+--------+--------+ \| tests with only skip_sw rules below: \| +----------------------------+--------+--------+--------+ \| 1 non-matching rule \| 2694.7 \| 3058.7 \| 1.14x \| \| 1 n-m rule, match trap \| 2611.2 \| 3323.1 \| 1.27x \| \| 1 n-m rule, goto non-chain \| 2886.8 \| 2945.9 \| 1.02x \| \| 5 non-matching rules \| 1958.2 \| 3061.3 \| 1.56x \| \| 5 n-m rules, match trap \| 1911.9 \| 3327.0 \| 1.74x \| \| 5 n-m rules, goto non-chain\| 2883.1 \| 2947.5 \| 1.02x \| \| 10 non-matching rules \| 1466.3 \| 3062.8 \| 2.09x \| \| 10 n-m rules, match trap \| 1444.3 \| 3317.9 \| 2.30x \| \| 10 n-m rules,goto non-chain\| 2883.1 \| 2939.5 \| 1.02x \| \| 25 non-matching rules \| 838.5 \| 3058.9 \| 3.65x \| \| 25 n-m rules, match trap \| 824.5 \| 3323.0 \| 4.03x \| \| 25 n-m rules,goto non-chain\| 2875.8 \| 2944.7 \| 1.02x \| \| 50 non-matching rules \| 488.1 \| 3054.7 \| 6.26x \| \| 50 n-m rules, match trap \| 484.9 \| 3318.5 \| 6.84x \| \| 50 n-m rules,goto non-chain\| 2884.1 \| 2939.7 \| 1.02x \| +----------------------------+--------+--------+--------+ perf top (25 n-m skip_sw rules - pre patch): 20.39% [kernel] [k] __skb_flow_dissect 16.43% [kernel] [k] rhashtable_jhash2 10.58% [kernel] [k] fl_classify 10.23% [kernel] [k] fl_mask_lookup 4.79% [kernel] [k] memset_orig 2.58% [kernel] [k] tcf_classify 1.47% [kernel] [k] __x86_indirect_thunk_rax 1.42% [kernel] [k] __dev_queue_xmit 1.36% [kernel] [k] nft_do_chain 1.21% [kernel] [k] __rcu_read_lock perf top (25 n-m skip_sw rules - post patch): 5.12% [kernel] [k] __dev_queue_xmit 4.77% [kernel] [k] nft_do_chain 3.65% [kernel] [k] dev_gro_receive 3.41% [kernel] [k] check_preemption_disabled 3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear 2.88% [kernel] [k] __netif_receive_skb_core.constprop.0 2.49% [kernel] [k] mlx5e_xmit 2.15% [kernel] [k] ip_forward 1.95% [kernel] [k] mlx5e_tc_restore_tunnel 1.92% [kernel] [k] vlan_gro_receive Test setup: DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G Data rate measured on switch (Extreme X690), and DUT connected as a router on a stick, with pktgen and pktsink as VLANs. Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests. Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/ Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: sched: cls_api: add filter counter	Asbjørn Sloth Tønnesen	2024-03-29	2	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Maintain a count of filters per block. Counter updates are protected by cb_lock, which is also used to protect the offload counters. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: sched: cls_api: add skip_sw counter	Asbjørn Sloth Tønnesen	2024-03-29	2	-0/+5
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	Maintain a count of skip_sw filters. This counter is protected by the cb_lock, and is updated at the same time as offloadcnt. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch '100GbE' of ↵	Jakub Kicinski	2024-03-28	18	-489/+232
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: use less resources in switchdev Michal Swiatkowski says: Switchdev is using one queue per created port representor. This can quickly lead to Rx queue shortage, as with subfunction support user can create high number of PRs. Save one MSI-X and 'number of PRs' * 1 queues. Refactor switchdev slow-path to use less resources (even no additional resources). Do this by removing control plane VSI and move its functionality to PF VSI. Even with current solution PF is acting like uplink and can't be used simultaneously for other use cases (adding filters can break slow-path). In short, do Tx via PF VSI and Rx packets using PF resources. Rx needs additional code in interrupt handler to choose correct PR netdev. Previous solution had to queue filters, it was way more elegant but needed one queue per PRs. Beside that this refactor mostly simplifies switchdev configuration. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: count representor stats ice: do switchdev slow-path Rx using PF VSI ice: change repr::id values ice: remove switchdev control plane VSI ice: control default Tx rule in lag ice: default Tx rule instead of to queue ice: do Tx through PF netdev in slow-path ice: remove eswitch changing queues algorithm ==================== Link: https://lore.kernel.org/r/20240325202623.1012287-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	ice: count representor stats	Michal Swiatkowski	2024-03-25	4	-30/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removing control plane VSI result in no information about slow-path statistic. In current solution statistics need to be counted in driver. Patch is based on similar implementation done by Simon Horman in nfp: commit eadfa4c3be99 ("nfp: add stats and xmit helpers for representors") Add const modifier to netdev parameter in ice_netdev_to_repr(). It isn't (and shouldn't be) modified in the function. Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: do switchdev slow-path Rx using PF VSI	Michal Swiatkowski	2024-03-25	5	-1/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an ICE_RX_FLAG_MULTIDEV flag to Rx ring. If it is set try to find correct port representor. Do it based on src_vsi value stored in flex descriptor. Ids of representor pointers stored in xarray are equal to corresponding src_vsi value. Thanks to that we can directly get correct representor if we have src_vsi value. Set multidev flag during ring configuration. If the mode is switchdev, change the ring descriptor to the one that contains src_vsi value. PF netdev should be reconfigured, do it by calling ice_down() and ice_up() if the netdev was up before configuring switchdev. Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: change repr::id values	Michal Swiatkowski	2024-03-25	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of getting repr::id from xa_alloc() value, set it to the src_vsi::num_vsi value. It is unique for each PR. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: remove switchdev control plane VSI	Michal Swiatkowski	2024-03-25	12	-277/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For slow-path Rx and Tx PF VSI is used. There is no need to have control plane VSI. Remove all code related to it. Eswitch rebuild can't fail without rebuilding control plane VSI. Return void from ice_eswitch_rebuild(). Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: control default Tx rule in lag	Michal Swiatkowski	2024-03-25	2	-10/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tx rule in switchdev was changed to use PF instead of additional control plane VSI. Because of that during lag we should control it. Control means to add and remove the default Tx rule during lag active/inactive switching. It can be done the same way as default Rx rule. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: default Tx rule instead of to queue	Michal Swiatkowski	2024-03-25	4	-97/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Steer all packets that miss other rules to PF VSI. Previously in switchdev mode, PF VSI received missed packets, but only ones marked as Rx. Now it is receiving all missed packets. To queue rule per PR isn't needed, because we use PF VSI instead of control VSI now, and it's already correctly configured. Add flag to correctly set LAN_EN bit in default Tx rule. It shouldn't allow packet to go outside when there is a match. Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: do Tx through PF netdev in slow-path	Michal Swiatkowski	2024-03-25	3	-34/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tx can be done using PF netdev. Checks before Tx are unnecessary. Checking if switchdev mode is set seems too defensive (there is no PR netdev in legacy mode). If corresponding VF is disabled or during reset, PR netdev also should be down. Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| *	ice: remove eswitch changing queues algorithm	Michal Swiatkowski	2024-03-25	4	-47/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Changing queues used by eswitch will be done through PF netdev. There is no need to reserve queues if the number of used queues is known. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* \|	Merge branch 'bnxt_en-ptp-and-rss-updates'	Jakub Kicinski	2024-03-28	6	-136/+509
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Michael Chan says: ==================== bnxt_en: PTP and RSS updates The first 2 patches are v2 of the PTP patches posted about 3 weeks ago: https://lore.kernel.org/netdev/20240229070202.107488-1-michael.chan@broadcom.com/ The devlink parameter is dropped and v2 is just to increase the timeout accuracy and to use a default timeout of 1 second. Patches 3 to 12 implement additional RSS contexts and ntuple filters for the RSS contexts. ==================== Link: https://lore.kernel.org/r/20240325222902.220712-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Support adding ntuple rules on RSS contexts	Pavan Chebbi	2024-03-28	3	-9/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the user wants to add an ntuple filter to an RSS context, select the appropriate VNIC belonging to the selected RSS context and add the VNIC destination rule. Make the necessary changes to bnxt_add_ntuple_cls_rule(). Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-13-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Refactor bnxt_cfg_rfs_ring_tbl_idx()	Pavan Chebbi	2024-03-28	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Refactor bnxt_cfg_rfs_ring_tbl_idx() to pass in the filter structure pointer instead of the RX ring number. This will allow an ntuple filter to be set up for the non-default RSS contexts in the next patch. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-12-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Support RSS contexts in ethtool .{get\|set}_rxfh()	Pavan Chebbi	2024-03-28	3	-8/+196
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Support up to 32 RSS contexts per device if supported by the device. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-11-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Refactor bnxt_set_rxfh()	Michael Chan	2024-03-28	1	-13/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new bnxt_modify_rss() function to modify the RSS key and RSS indirection table. The new function can modify the parameters for the default context or additional contexts. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-10-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Add a new_rss_ctx parameter to bnxt_rfs_capable()	Pavan Chebbi	2024-03-28	2	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modify bnxt_rfs_capable() to check that there are enough resources to support aRFS/ntuple filters for a new RSS context requested by the user. Existing use cases in the driver will always set the new parameter to false. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-9-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Simplify bnxt_rfs_capable()	Michael Chan	2024-03-28	1	-15/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bnxt_rfs_capable() determines the number of VNICs and RSS_CTXs required to support aRFS and then reserves the resources. We already have functions bnxt_get_total_vnics() and bnxt_get_total_rss_ctxs() to do that. Simplify the code by calling these functions. It is also more correct to do the resource reservation after bnxt_can_reserve_rings() returns true. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-8-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Refactor RSS indir alloc/set functions	Pavan Chebbi	2024-03-28	2	-10/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We will need to dynamically allocate and change indirection tables for additional RSS contexts. Add the rss_ctx pointer parameter to bnxt_alloc_rss_indir_tbl() and bnxt_set_dflt_rss_indir_tbl(). Existing usage will always pass rss_ctx as NULL which means the default RSS context. When supporting additional RSS contexts in subsequent patches, we'll pass the valid rss_ctx to these 2 functions. Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-7-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Introduce rss ctx structure, alloc/free functions	Pavan Chebbi	2024-03-28	3	-0/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add struct bnxt_rss_ctx, related storage lists, required defines, and its alloc/free functions. Later patches will use them in order to support multiple RSS contexts. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-6-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Refactor VNIC alloc and cfg functions	Pavan Chebbi	2024-03-28	3	-72/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current VNIC structures are stored in an array bp->vnic_info[]. The index of the array (vnic_id) is passed to all the functions that need to reference the VNIC. This patch changes the scheme to pass the VNIC pointer instead of the vnic index. Subsequent patches will create additional VNICs that will not be stored in the bp->vnic_info[] array. Using the VNIC pointer will work for all the VNICs. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Add helper function bnxt_hwrm_vnic_rss_cfg_p5()	Pavan Chebbi	2024-03-28	2	-11/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a pure refactoring patch. The new function bnxt_hwrm_vnic_set_rss_p5() will set up the P5_PLUS specific RSS ring table and then call bnxt_hwrm_vnic_cfg() to setup the vnic for proper RSS operations. This new function will be used later for additional RSS contexts. Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240325222902.220712-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Retry PTP TX timestamp from FW for 1 second	Pavan Chebbi	2024-03-28	2	-1/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use a new default 1 second timeout value instead of the existing 1 msec value. The driver will keep track of the remaining time before timeout and will pass this value to bnxt_hwrm_port_ts_query(). The firmware supports timeout values up to 65535 usecs. If the timeout value passed to bnxt_hwrm_port_ts_query() is less than the FW max value, we will use that value to precisely control the specified timeout. If it is larger than the FW max value, we will use the FW max value and any additional retry to reach the desired timeout will be done in the context of bnxt_ptp_ts_aux_eork(). Link: https://lore.kernel.org/netdev/20240229070202.107488-2-michael.chan@broadcom.com/ Cc: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20240325222902.220712-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	bnxt_en: Add a timeout parameter to bnxt_hwrm_port_ts_query()	Michael Chan	2024-03-28	2	-4/+13
\|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The caller can pass this new timeout parameter to the function to specify the firmware timeout value when requesting the TX timestamp from the firmware. This will allow the caller to precisely control the timeout and will be used in the next patch. In this patch, the parameter is 0 which means to use the current default value. Cc: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20240325222902.220712-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	Merge branch 'fix-missing-phy-to-mac-rx-clock'	Jakub Kicinski	2024-03-28	8	-13/+111
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Romain Gantois says: ==================== Fix missing PHY-to-MAC RX clock There is an issue with some stmmac/PHY combinations that has been reported some time ago in a couple of different series: Clark Wang's report: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/ Clément Léger's report: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/ Stmmac controllers require an RX clock signal from the MII bus to perform their hardware initialization successfully. This causes issues with some PHY/PCS devices. If these devices do not bring the clock signal up before the MAC driver initializes its hardware, then said initialization will fail. This can happen at probe time or when the system wakes up from a suspended state. This series introduces new flags for phy_device and phylink_pcs. These flags allow MAC drivers to signal to PHY/PCS drivers that the RX clock signal should be enabled as soon as possible, and that it should always stay enabled. I have included specific uses of these flags that fix the RZN1 GMAC1 stmmac driver that I am currently working on and that is not yet upstream. I have also included changes to the at803x PHY driver that should fix the issue that Clark Wang was having. ==================== Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-0-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: pcs: rzn1-miic: Init RX clock early if MAC requires it	Romain Gantois	2024-03-28	1	-0/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The GMAC1 controller in the RZN1 IP requires the RX MII clock signal to be started before it initializes its own hardware, thus before it calls phylink_start. Implement the pcs_pre_init() callback so that the RX clock signal can be enabled early if necessary. Reported-by: Clément Léger <clement.leger@bootlin.com> Link: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/ Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-7-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: phy: qcom: at803x: Avoid hibernating if MAC requires RX clock	Russell King (Oracle)	2024-03-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Stmmac controllers connected to an at803x PHY cannot resume properly after suspend when WoL is enabled. This happens because the MAC requires an RX clock generated by the PHY to initialize its hardware properly. But the RX clock is cut when the PHY suspends and isn't brought up until the MAC driver resumes the phylink. Prevent the at803x PHY driver from going into suspend if the attached MAC driver always requires an RX clock signal. Reported-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/ Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [rgantois: commit log] Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-6-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: stmmac: Signal to PHY/PCS drivers to keep RX clock on	Romain Gantois	2024-03-28	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a reocurring issue with stmmac controllers where the MAC fails to initialize its hardware if an RX clock signal isn't provided on the MAC/PHY link. This causes issues when PHY or PCS devices either go into suspend while cutting the RX clock or do not bring the clock signal up early enough for the MAC to initialize successfully. Set the mac_requires_rxc flag in the stmmac phylink config so that PHY/PCS drivers know to keep the RX clock up at all times. Reported-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/ Reported-by: Clément Léger <clement.leger@bootlin.com> Link: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/ Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-5-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: stmmac: Support a generic PCS field in mac_device_info	Romain Gantois	2024-03-28	3	-9/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Global stmmac support for early initialization of PCS devices requires a generic PCS reference that can be passed to phylink_pcs_pre_init(). Currently, the mac_device_info struct contains only one PCS field, which is specific to the Lynx model. As PCS models are hardware-specific, it is more appropriate to have a generic PCS field in the mac_device_info struct. Refactor the lynx_pcs field into a generic phylink_pcs field. Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-4-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: stmmac: don't rely on lynx_pcs presence to check for a PHY	Maxime Chevallier	2024-03-28	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When initializing attached PHYs, there are some cases where we don't expect any PHY to be connected. The logic uses conditions based on various local PCS configuration, but also calls-in phylink_expects_phy() via stmmac_init_phy(), which is enough to ensure we don't try to initialize a PHY when using a Lynx PCS, as long as we have the phy_interface set to a 802.3z mode and are using inband negociation. Drop the lynx check, making the stmmac generic code more pcs_lynx-agnostic. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> [rgantois: commit log] Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-3-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: phylink: add rxc_always_on flag to phylink_pcs	Romain Gantois	2024-03-28	2	-0/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some MAC drivers (e.g. stmmac) require a continuous receive clock signal to be generated by a PCS that is handled by a standalone PCS driver. Such a PCS driver does not have access to a PHY device, thus cannot check the PHY_F_RXC_ALWAYS_ON flag. They cannot check max_requires_rxc in the phylink config either, since it is a private member. Therefore, a new flag is needed to signal to the PCS that it should keep the RX clock signal up at all times. Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-2-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	net: phylink: add PHY_F_RXC_ALWAYS_ON to PHY dev flags	Russell King (Oracle)	2024-03-28	3	-1/+14
\|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some MAC controllers (e.g. stmmac) require their connected PHY to continuously provide a receive clock signal. This can cause issues in two cases: 1. The clock signal hasn't been started yet by the time the MAC driver initializes its hardware. This can make the initialization fail, as in the case of the rzn1 GMAC1 driver. 2. The clock signal is cut during a power saving event. By the time the MAC is brought back up, the clock signal is still not active since phylink_start hasn't been called yet. This brings us back to case 1. If a PHY driver reads this flag, it should ensure that the receive clock signal is started as soon as possible, and that it isn't brought down when the PHY goes into suspend. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [rgantois: commit log] Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-1-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	Merge branch 'compiler_types-add-endianness-dependent-__counted_by_-le-be'	Jakub Kicinski	2024-03-28	5	-12/+28
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Alexander Lobakin says: ==================== compiler_types: add Endianness-dependent __counted_by_{le,be} Some structures contain flexible arrays at the end and the counter for them, but the counter has explicit Endianness and thus __counted_by() can't be used directly. To increase test coverage for potential problems without breaking anything, introduce __counted_by_{le,be} defined depending on platform's Endianness to either __counted_by() when applicable or noop otherwise. The first user will be virtchnl2.h from idpf just as example with 9 flex structures having Little Endian counters. Maybe it would be a good idea to introduce such attributes on compiler level if possible, but for now let's stop on what we have. ==================== Link: https://lore.kernel.org/r/20240327142241.1745989-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	idpf: sprinkle __counted_by{,_le}() in the virtchnl2 header	Alexander Lobakin	2024-03-28	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both virtchnl2.h and its consumer idpf_virtchnl.c are very error-prone. There are 10 structures with flexible arrays at the end, but 9 of them has flex member counter in Little Endian. Make the code a bit more robust by applying __counted_by_le() to those 9. LE platforms is the main target for this driver, so they would receive additional protection. While we're here, add __counted_by() to virtchnl2_ptype::proto_id, as its counter is `u8` regardless of the Endianness. Compile test on x86_64 (LE) didn't reveal any new issues after applying the attributes. Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20240327142241.1745989-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	idpf: make virtchnl2.h self-contained	Alexander Lobakin	2024-03-28	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To ease maintaining of virtchnl2.h, which already is messy enough, make it self-contained by adding missing if_ether.h include due to %ETH_ALEN usage. At the same time, virtchnl2_lan_desc.h is not used anywhere in the file, so move this include to idpf_txrx.h to speed up C preprocessing. Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20240327142241.1745989-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	compiler_types: add Endianness-dependent __counted_by_{le,be}	Alexander Lobakin	2024-03-28	3	-0/+14
\|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some structures contain flexible arrays at the end and the counter for them, but the counter has explicit Endianness and thus __counted_by() can't be used directly. To increase test coverage for potential problems without breaking anything, introduce __counted_by_{le,be}() defined depending on platform's Endianness to either __counted_by() when applicable or noop otherwise. Maybe it would be a good idea to introduce such attributes on compiler level if possible, but for now let's stop on what we have. Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20240327142241.1745989-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: remove gfp_mask from napi_alloc_skb()	Jakub Kicinski	2024-03-28	13	-36/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	__napi_alloc_skb() is napi_alloc_skb() with the added flexibility of choosing gfp_mask. This is a NAPI function, so GFP_ATOMIC is implied. The only practical choice the caller has is whether to set __GFP_NOWARN. But that's a false choice, too, allocation failures in atomic context will happen, and printing warnings in logs, effectively for a packet drop, is both too much and very likely non-actionable. This leads me to a conclusion that most uses of napi_alloc_skb() are simply misguided, and should use __GFP_NOWARN in the first place. We also have a "standard" way of reporting allocation failures via the queue stat API (qstats::rx-alloc-fail). The direct motivation for this patch is that one of the drivers used at Meta calls napi_alloc_skb() (so prior to this patch without __GFP_NOWARN), and the resulting OOM warning is the top networking warning in our fleet. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240327040213.3153864-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	qed: Drop useless pci_params.pm_cap	Bjorn Helgaas	2024-03-28	2	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	qed_init_pci() used pci_params.pm_cap only to cache the pci_dev.pm_cap. Drop the cache and use pci_dev.pm_cap directly. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240325224931.1462051-1-helgaas@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>