summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* bpf: Zeroing allocated object from slab in bpf memory allocatorHou Tao2023-02-151-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below: NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK> For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly. Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add basic bpf_rb_{root,node} supportDave Marchevsky2023-02-132-1/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new types, and adds bpf_rb_root_free function for freeing bpf_rb_root in map_values. structs bpf_rb_root and bpf_rb_node are opaque types meant to obscure structs rb_root_cached rb_node, respectively. btf_struct_access will prevent BPF programs from touching these special fields automatically now that they're recognized. btf_check_and_fixup_fields now groups list_head and rb_root together as "graph root" fields and {list,rb}_node as "graph node", and does same ownership cycle checking as before. Note that this function does _not_ prevent ownership type mixups (e.g. rb_root owning list_node) - that's handled by btf_parse_graph_root. After this patch, a bpf program can have a struct bpf_rb_root in a map_value, but not add anything to nor do anything useful with it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Migrate release_on_unlock logic to non-owning ref semanticsDave Marchevsky2023-02-132-20/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces non-owning reference semantics to the verifier, specifically linked_list API kfunc handling. release_on_unlock logic for refs is refactored - with small functional changes - to implement these semantics, and bpf_list_push_{front,back} are migrated to use them. When a list node is pushed to a list, the program still has a pointer to the node: n = bpf_obj_new(typeof(*n)); bpf_spin_lock(&l); bpf_list_push_back(&l, n); /* n still points to the just-added node */ bpf_spin_unlock(&l); What the verifier considers n to be after the push, and thus what can be done with n, are changed by this patch. Common properties both before/after this patch: * After push, n is only a valid reference to the node until end of critical section * After push, n cannot be pushed to any list * After push, the program can read the node's fields using n Before: * After push, n retains the ref_obj_id which it received on bpf_obj_new, but the associated bpf_reference_state's release_on_unlock field is set to true * release_on_unlock field and associated logic is used to implement "n is only a valid ref until end of critical section" * After push, n cannot be written to, the node must be removed from the list before writing to its fields * After push, n is marked PTR_UNTRUSTED After: * After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF type flag is added to reg's type, indicating that it's a non-owning reference. * NON_OWN_REF flag and logic is used to implement "n is only a valid ref until end of critical section" * n can be written to (except for special fields e.g. bpf_list_node, timer, ...) Summary of specific implementation changes to achieve the above: * release_on_unlock field, ref_set_release_on_unlock helper, and logic to "release on unlock" based on that field are removed * The anonymous active_lock struct used by bpf_verifier_state is pulled out into a named struct bpf_active_lock. * NON_OWN_REF type flag is introduced along with verifier logic changes to handle non-owning refs * Helpers are added to use NON_OWN_REF flag to implement non-owning ref semantics as described above * invalidate_non_owning_refs - helper to clobber all non-owning refs matching a particular bpf_active_lock identity. Replaces release_on_unlock logic in process_spin_lock. * ref_set_non_owning - set NON_OWN_REF type flag after doing some sanity checking * ref_convert_owning_non_owning - convert owning reference w/ specified ref_obj_id to non-owning references. Set NON_OWN_REF flag for each reg with that ref_obj_id and 0-out its ref_obj_id * Update linked_list selftests to account for minor semantic differences introduced by this patch * Writes to a release_on_unlock node ref are not allowed, while writes to non-owning reference pointees are. As a result the linked_list "write after push" failure tests are no longer scenarios that should fail. * The test##missing_lock##op and test##incorrect_lock##op macro-generated failure tests need to have a valid node argument in order to have the same error output as before. Otherwise verification will fail early and the expected error output won't be seen. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: allow to disable bpf map memory accountingYafang Shao2023-02-101-0/+8
| | | | | | | | | | | | | We can simply set root memcg as the map's memcg to disable bpf memory accounting. bpf_map_area_alloc is a little special as it gets the memcg from current rather than from the map, so we need to disable GFP_ACCOUNT specifically for it. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: use bpf_map_kvcalloc in bpf_local_storageYafang Shao2023-02-101-0/+8
| | | | | | | | | | | | | | Introduce new helper bpf_map_kvcalloc() for the memory allocation in bpf_local_storage(). Then the allocation will charge the memory from the map instead of from current, though currently they are the same thing as it is only used in map creation path now. By charging map's memory into the memcg from the map, it will be more clear. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* mm: memcontrol: add new kernel parameter cgroup.memory=nobpfYafang Shao2023-02-101-0/+11
| | | | | | | | | | | Add new kernel parameter cgroup.memory=nobpf to allow user disable bpf memory accounting. This is a preparation for the followup patch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20230210154734.4416-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Daniel Borkmann says:Jakub Kicinski2023-02-106-8/+109
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net, xdp: Add missing xdp_features descriptionLorenzo Bianconi2023-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing xdp_features field description in the struct net_device documentation. This patch fix the following warning: [...] ./include/linux/netdevice.h:2375: warning: Function parameter or member 'xdp_features' not described in 'net_device' [...] Fixes: d3d854fd6a1d ("netdev-genl: create a simple family for netdev stuff") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7878544903d855b49e838c9d59f715bde0b5e63b.1675705948.git.lorenzo@kernel.org
| * bpf: fix typo in header for bpf_perf_prog_read_valueFlorian Lehner2023-02-031-1/+1
| | | | | | | | | | | | | | | | | | Fix a simple typo in the documentation for bpf_perf_prog_read_value. Signed-off-by: Florian Lehner <dev@der-flo.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230203121439.25884-1-dev@der-flo.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
| * drivers: net: turn on XDP featuresMarek Majtyka2023-02-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A summary of the flags being set for various drivers is given below. Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features that can be turned off and on at runtime. This means that these flags may be set and unset under RTNL lock protection by the driver. Hence, READ_ONCE must be used by code loading the flag value. Also, these flags are not used for synchronization against the availability of XDP resources on a device. It is merely a hint, and hence the read may race with the actual teardown of XDP resources on the device. This may change in the future, e.g. operations taking a reference on the XDP resources of the driver, and in turn inhibiting turning off this flag. However, for now, it can only be used as a hint to check whether device supports becoming a redirection target. Turn 'hw-offload' feature flag on for: - netronome (nfp) - netdevsim. Turn 'native' and 'zerocopy' features flags on for: - intel (i40e, ice, ixgbe, igc) - mellanox (mlx5). - stmmac - netronome (nfp) Turn 'native' features flags on for: - amazon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2, enetc) - funeth - intel (igb) - marvell (mvneta, mvpp2, octeontx2) - mellanox (mlx4) - mtk_eth_soc - qlogic (qede) - sfc - socionext (netsec) - ti (cpsw) - tap - tsnep - veth - xen - virtio_net. Turn 'basic' (tx, pass, aborted and drop) features flags on for: - netronome (nfp) - cavium (thunder) - hyperv. Turn 'redirect_target' feature flag on for: - amanzon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2) - intel (i40e, ice, igb, ixgbe) - ti (cpsw) - marvell (mvneta, mvpp2) - sfc - socionext (netsec) - qlogic (qede) - mellanox (mlx5) - tap - veth - virtio_net - xen Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Marek Majtyka <alardam@gmail.com> Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * netdev-genl: create a simple family for netdev stuffJakub Kicinski2023-02-023-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a Netlink spec-compatible family for netdevs. This is a very simple implementation without much thought going into it. It allows us to reap all the benefits of Netlink specs, one can use the generic client to issue the commands: $ ./cli.py --spec netdev.yaml --dump dev_get [{'ifindex': 1, 'xdp-features': set()}, {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}}, {'ifindex': 3, 'xdp-features': {'rx-sg'}}] the generic python library does not have flags-by-name support, yet, but we also don't have to carry strings in the messages, as user space can get the names from the spec. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Drop always true do_idr_lock parameter to bpf_map_free_idTobias Klauser2023-02-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The do_idr_lock parameter to bpf_map_free_id was introduced by commit bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID"). However, all callers set do_idr_lock = true since commit 1e0bd5a091e5 ("bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails"). While at it also inline __bpf_map_put into its only caller bpf_map_put now that do_idr_lock can be dropped from its signature. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Link: https://lore.kernel.org/r/20230202141921.4424-1-tklauser@distanz.ch Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add __bpf_kfunc tag for marking kernel functions as kfuncsDavid Vernet2023-02-021-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kfuncs are functions defined in the kernel, which may be invoked by BPF programs. They may or may not also be used as regular kernel functions, implying that they may be static (in which case the compiler could e.g. inline it away, or elide one or more arguments), or it could have external linkage, but potentially be elided in an LTO build if a function is observed to never be used, and is stripped from the final kernel binary. This has already resulted in some issues, such as those discussed in [0] wherein changes in DWARF that identify when a parameter has been optimized out can break BTF encodings (and in general break the kfunc). [0]: https://lore.kernel.org/all/1675088985-20300-2-git-send-email-alan.maguire@oracle.com/ We therefore require some convenience macro that kfunc developers can use just add to their kfuncs, and which will prevent all of the above issues from happening. This is in contrast with what we have today, where some kfunc definitions have "noinline", some have "__used", and others are static and have neither. Note that longer term, this mechanism may be replaced by a macro that more closely resembles EXPORT_SYMBOL_GPL(), as described in [1]. For now, we're going with this shorter-term approach to fix existing issues in kfuncs. [1]: https://lore.kernel.org/lkml/Y9AFT4pTydKh+PD3@maniforge.lan/ Note as well that checkpatch complains about this patch with the following: ERROR: Macros with complex values should be enclosed in parentheses +#define __bpf_kfunc __used noinline There seems to be a precedent for using this pattern in other places such as compiler_types.h (see e.g. __randomize_layout and noinstr), so it seems appropriate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-2-void@manifault.com
| * s390/bpf: Implement arch_prepare_bpf_trampoline()Ilya Leoshkevich2023-01-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arch_prepare_bpf_trampoline() is used for direct attachment of eBPF programs to various places, bypassing kprobes. It's responsible for calling a number of eBPF programs before, instead and/or after whatever they are attached to. Add a s390x implementation, paying attention to the following: - Reuse the existing JIT infrastructure, where possible. - Like the existing JIT, prefer making multiple passes instead of backpatching. Currently 2 passes is enough. If literal pool is introduced, this needs to be raised to 3. However, at the moment adding literal pool only makes the code larger. If branch shortening is introduced, the number of passes needs to be increased even further. - Support both regular and ftrace calling conventions, depending on the trampoline flags. - Use expolines for indirect calls. - Handle the mismatch between the eBPF and the s390x ABIs. - Sign-extend fmod_ret return values. invoke_bpf_prog() produces about 120 bytes; it might be possible to slightly optimize this, but reaching 50 bytes, like on x86_64, looks unrealistic: just loading cookie, __bpf_prog_enter, bpf_func, insnsi and __bpf_prog_exit as literals already takes at least 5 * 12 = 60 bytes, and we can't use relative addressing for most of them. Therefore, lower BPF_MAX_TRAMP_LINKS on s390x. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230129190501.1624747-5-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: btf: Add BTF_FMODEL_SIGNED_ARG flagIlya Leoshkevich2023-01-282-5/+14
| | | | | | | | | | | | | | | | | | | | s390x eBPF JIT needs to know whether a function return value is signed and which function arguments are signed, in order to generate code compliant with the s390x ABI. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230128000650.1516334-26-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Change BPF_MAX_TRAMP_LINKS to enumIlya Leoshkevich2023-01-281-1/+3
| | | | | | | | | | | | | | | | This way it's possible to query its value from testcases using BTF. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230128000650.1516334-3-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | net: extract nf_ct_handle_fragments to nf_conntrack_ovsXin Long2023-02-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now handle_fragments() in OVS and TC have the similar code, and this patch removes the duplicate code by moving the function to nf_conntrack_ovs. Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be done only when defrag returns 0, as it does in other places in kernel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: extract nf_ct_skb_network_trim function to nf_conntrack_ovsXin Long2023-02-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | There are almost the same code in ovs_skb_network_trim() and tcf_ct_skb_network_trim(), this patch extracts them into a function nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: skbuff: drop the word head from skb cacheJakub Kicinski2023-02-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skbuff_head_cache is misnamed (perhaps for historical reasons?) because it does not hold heads. Head is the buffer which skb->data points to, and also where shinfo lives. struct sk_buff is a metadata structure, not the head. Eric recently added skb_small_head_cache (which allocates actual head buffers), let that serve as an excuse to finally clean this up :) Leave the user-space visible name intact, it could possibly be uAPI. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'rxrpc-next-20230208' of ↵David S. Miller2023-02-101-4/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc development Here are some miscellaneous changes for rxrpc: (1) Use consume_skb() rather than kfree_skb_reason(). (2) Fix unnecessary waking when poking and already-poked call. (3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how many incoming DATA packets we're telling the peer that we are currently willing to accept on this call. (4) Reduce duplicate ACK transmission. We send ACKs to let the peer know that we're increasing the receive window (ack.rwind) as we consume packets locally. Normal ACK transmission is triggered in three places and that leads to duplicates being sent. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | rxrpc: Trace ack.rwindDavid Howells2023-02-071-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Log ack.rwind in the rxrpc_tx_ack tracepoint. This value is useful to see as it represents flow-control information to the peer. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* | | string_helpers: Move string_is_valid() to the headerAndy Shevchenko2023-02-091-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move string_is_valid() to the header for wider use. While at it, rename to string_is_terminated() to be precise about its semantics. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230208133153.22528-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | net: introduce default_rps_mask netns attributePaolo Abeni2023-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If RPS is enabled, this allows configuring a default rps mask, which is effective since receive queue creation time. A default RPS mask allows the system admin to ensure proper isolation, avoiding races at network namespace or device creation time. The default RPS mask is initially empty, and can be modified via a newly added sysctl entry. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-02-0912-22/+44
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/devlink/leftover.c / net/core/devlink.c: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") f05bd8ebeb69 ("devlink: move code to a dedicated directory") 687125b5799c ("devlink: split out core code") https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * \ \ Merge tag 'net-6.2-rc8' of ↵Linus Torvalds2023-02-093-3/+12
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can and ipsec subtrees. Current release - regressions: - sched: fix off by one in htb_activate_prios() - eth: mana: fix accessing freed irq affinity_hint - eth: ice: fix out-of-bounds KASAN warning in virtchnl Current release - new code bugs: - eth: mtk_eth_soc: enable special tag when any MAC uses DSA Previous releases - always broken: - core: fix sk->sk_txrehash default - neigh: make sure used and confirmed times are valid - mptcp: be careful on subflow status propagation on errors - xfrm: prevent potential spectre v1 gadget in xfrm_xlate32_attr() - phylink: move phy_device_free() to correctly release phy device - eth: mlx5: - fix crash unsetting rx-vlan-filter in switchdev mode - fix hang on firmware reset - serialize module cleanup with reload and remove" * tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits) selftests: forwarding: lib: quote the sysctl values net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used rds: rds_rm_zerocopy_callback() use list_first_entry() net: txgbe: Update support email address selftests: Fix failing VXLAN VNI filtering test selftests: mptcp: stop tests earlier selftests: mptcp: allow more slack for slow test-case mptcp: be careful on subflow status propagation on errors mptcp: fix locking for in-kernel listener creation mptcp: fix locking for setsockopt corner-case mptcp: do not wait for bare sockets' timeout net: ethernet: mtk_eth_soc: fix DSA TX tag hwaccel for switch port 0 nfp: ethtool: fix the bug of setting unsupported port speed txhash: fix sk->sk_txrehash default net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg() net: ethernet: mtk_eth_soc: enable special tag when any MAC uses DSA net: sched: sch: Fix off by one in htb_activate_prios() igc: Add ndo_tx_timeout support net: mana: Fix accessing freed irq affinity_hint hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC ...
| | * | | net/mlx5: Expose SF firmware pages counterMaher Sanalla2023-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, each core device has VF pages counter which stores number of fw pages used by its VFs and SFs. The current design led to a hang when performing firmware reset on DPU, where the DPU PFs stalled in sriov unload flow due to waiting on release of SFs pages instead of waiting on only VFs pages. Thus, Add a separate counter for SF firmware pages, which will prevent the stall scenario described above. Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| | * | | net/mlx5: Store page counters in a single arrayMaher Sanalla2023-02-071-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, an independent page counter is used for tracking memory usage for each function type such as VF, PF and host PF (DPU). For better code-readibilty, use a single array that stores the number of allocated memory pages for each function type. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| | * | | uapi: add missing ip/ipv6 header dependencies for linux/stddef.hHerton R. Krzesinski2023-02-062-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 58e0be1ef6118 ("net: use struct_group to copy ip/ipv6 header addresses"), ip and ipv6 headers started to use the __struct_group definition, which is defined at include/uapi/linux/stddef.h. However, linux/stddef.h isn't explicitly included in include/uapi/linux/{ip,ipv6}.h, which breaks build of xskxceiver bpf selftest if you install the uapi headers in the system: $ make V=1 xskxceiver -C tools/testing/selftests/bpf ... make: Entering directory '(...)/tools/testing/selftests/bpf' gcc -g -O0 -rdynamic -Wall -Werror (...) In file included from xskxceiver.c:79: /usr/include/linux/ip.h:103:9: error: expected specifier-qualifier-list before ‘__struct_group’ 103 | __struct_group(/* no tag */, addrs, /* no attrs */, | ^~~~~~~~~~~~~~ ... Include the missing <linux/stddef.h> dependency in ip.h and do the same for the ipv6.h header. Fixes: 58e0be1ef611 ("net: use struct_group to copy ip/ipv6 header addresses") Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge tag 'char-misc-6.2-rc7' of ↵Linus Torvalds2023-02-051-2/+0
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are a number of small char/misc/whatever driver fixes. They include: - IIO driver fixes for some reported problems - nvmem driver fixes - fpga driver fixes - debugfs memory leak fix in the hv_balloon and irqdomain code (irqdomain change was acked by the maintainer) All have been in linux-next with no reported problems" * tag 'char-misc-6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (33 commits) kernel/irq/irqdomain.c: fix memory leak with using debugfs_lookup() HV: hv_balloon: fix memory leak with using debugfs_lookup() nvmem: qcom-spmi-sdam: fix module autoloading nvmem: core: fix return value nvmem: core: fix cell removal on error nvmem: core: fix device node refcounting nvmem: core: fix registration vs use race nvmem: core: fix cleanup after dev_set_name() nvmem: core: remove nvmem_config wp_gpio nvmem: core: initialise nvmem->id early nvmem: sunxi_sid: Always use 32-bit MMIO reads nvmem: brcm_nvram: Add check for kzalloc iio: imu: fxos8700: fix MAGN sensor scale and unit iio: imu: fxos8700: remove definition FXOS8700_CTRL_ODR_MIN iio: imu: fxos8700: fix failed initialization ODR mode assignment iio: imu: fxos8700: fix incorrect ODR mode readback iio: light: cm32181: Fix PM support on system with 2 I2C resources iio: hid: fix the retval in gyro_3d_capture_sample iio: hid: fix the retval in accel_3d_capture_sample iio: imu: st_lsm6dsx: fix build when CONFIG_IIO_TRIGGERED_BUFFER=m ...
| | * | | | nvmem: core: remove nvmem_config wp_gpioRussell King (Oracle)2023-01-281-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No one provides wp_gpio, so let's remove it to avoid issues with the nvmem core putting this gpio. Cc: stable@vger.kernel.org Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-5-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | Merge tag 'perf_urgent_for_v6.2_rc7' of ↵Linus Torvalds2023-02-051-0/+9
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Lock the proper critical section when dealing with perf event context * tag 'perf_urgent_for_v6.2_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix perf_event_pmu_context serialization
| | * | | | | perf: Fix perf_event_pmu_context serializationJames Clark2023-01-311-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller triggered a WARN in put_pmu_ctx(). WARNING: CPU: 1 PID: 2245 at kernel/events/core.c:4925 put_pmu_ctx+0x1f0/0x278 This is because there is no locking around the access of "if (!epc->ctx)" in find_get_pmu_context() and when it is set to NULL in put_pmu_ctx(). The decrement of the reference count in put_pmu_ctx() also happens outside of the spinlock, leading to the possibility of this order of events, and the context being cleared in put_pmu_ctx(), after its refcount is non zero: CPU0 CPU1 find_get_pmu_context() if (!epc->ctx) == false put_pmu_ctx() atomic_dec_and_test(&epc->refcount) == true epc->refcount == 0 atomic_inc(&epc->refcount); epc->refcount == 1 list_del_init(&epc->pmu_ctx_entry); epc->ctx = NULL; Another issue is that WARN_ON for no active PMU events in put_pmu_ctx() is outside of the lock. If the perf_event_pmu_context is an embedded one, even after clearing it, it won't be deleted and can be re-used. So the warning can trigger. For this reason it also needs to be moved inside the lock. The above warning is very quick to trigger on Arm by running these two commands at the same time: while true; do perf record -- ls; done while true; do perf record -- ls; done [peterz: atomic_dec_and_raw_lock*()] Fixes: bd2756811766 ("perf: Rewrite core context handling") Reported-by: syzbot+697196bc0265049822bd@syzkaller.appspotmail.com Signed-off-by: James Clark <james.clark@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20230127143141.1782804-2-james.clark@arm.com
| * | | | | | Merge tag 'rtc-6.2-fixes' of ↵Linus Torvalds2023-02-041-1/+2
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC fixes from Alexandre Belloni: "Here are a few fixes for 6.2. The EFI one is the most important as it allows some RTCs to actually work. The other two are warnings that are worth fixing. - efi: make WAKEUP services optional - sunplus: fix format string warning" * tag 'rtc-6.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: rtc: sunplus: fix format string for printing resource dt-bindings: rtc: qcom-pm8xxx: allow 'wakeup-source' property rtc: efi: Enable SET/GET WAKEUP services as optional
| | * | | | | | rtc: efi: Enable SET/GET WAKEUP services as optionalShanker Donthineni2023-01-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of rtc-efi is expecting all the 4 time services GET{SET}_TIME{WAKEUP} must be supported by UEFI firmware. As per the EFI_RT_PROPERTIES_TABLE, the platform specific implementations can choose to enable selective time services based on the RTC device capabilities. This patch does the following changes to provide GET/SET RTC services on platforms that do not support the WAKEUP feature. 1) Relax time services cap check when creating a platform device. 2) Clear RTC_FEATURE_ALARM bit in the absence of WAKEUP services. 3) Conditional alarm entries in '/proc/driver/rtc'. Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Link: https://lore.kernel.org/r/20230102230630.192911-1-sdonthineni@nvidia.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
| * | | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2023-02-041-1/+1
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "ARM64: - Yet another fix for non-CPU accesses to the memory backing the VGICv3 subsystem - A set of fixes for the setlftest checking for the S1PTW behaviour after the fix that went in ealier in the cycle" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: selftests: aarch64: Test read-only PT memory regions KVM: selftests: aarch64: Fix check of dirty log PT write KVM: selftests: aarch64: Do not default to dirty PTE pages on all S1PTWs KVM: selftests: aarch64: Relax userfaultfd read vs. write checks KVM: arm64: Allow no running vcpu on saving vgic3 pending table KVM: arm64: Allow no running vcpu on restoring vgic3 LPI pending status KVM: arm64: Add helper vgic_write_guest_lock()
| | * \ \ \ \ \ \ Merge tag 'kvmarm-fixes-6.2-3' of ↵Paolo Bonzini2023-02-041-1/+1
| | |\ \ \ \ \ \ \ | | | |_|_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.2, take #3 - Yet another fix for non-CPU accesses to the memory backing the VGICv3 subsystem - A set of fixes for the setlftest checking for the S1PTW behaviour after the fix that went in ealier in the cycle
| | | * | | | | | KVM: arm64: Add helper vgic_write_guest_lock()Gavin Shan2023-01-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the unknown no-running-vcpu sites are reported when a dirty page is tracked by mark_page_dirty_in_slot(). Until now, the only known no-running-vcpu site is saving vgic/its tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on KVM device "kvm-arm-vgic-its". Unfortunately, there are more unknown sites to be handled and no-running-vcpu context will be allowed in these sites: (1) KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} command on KVM device "kvm-arm-vgic-its" to restore vgic/its tables. The vgic3 LPI pending status could be restored. (2) Save vgic3 pending table through KVM_DEV_ARM_{VGIC_GRP_CTRL, VGIC_SAVE_PENDING_TABLES} command on KVM device "kvm-arm-vgic-v3". In order to handle those unknown cases, we need a unified helper vgic_write_guest_lock(). struct vgic_dist::save_its_tables_in_progress is also renamed to struct vgic_dist::save_tables_in_progress. No functional change intended. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230126235451.469087-3-gshan@redhat.com
| * | | | | | | | Merge tag 'ceph-for-6.2-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds2023-02-031-10/+0
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fix from Ilya Dryomov: "A safeguard to prevent the kernel client from further damaging the filesystem after running into a case of an invalid snap trace. The root cause of this metadata corruption is still being investigated but it appears to be stemming from the MDS. As such, this is the best we can do for now" * tag 'ceph-for-6.2-rc7' of https://github.com/ceph/ceph-client: ceph: blocklist the kclient when receiving corrupted snap trace ceph: move mount state enum to super.h
| | * | | | | | | | ceph: move mount state enum to super.hXiubo Li2023-02-021-10/+0
| | | |_|_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These flags are only used in ceph filesystem in fs/ceph, so just move it to the place it should be. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | | | | Merge tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of ↵Linus Torvalds2023-02-034-5/+20
| |\ \ \ \ \ \ \ \ | | |_|_|_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes, mainly for MM. 13 are cc:stable" * tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits) mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath() Kconfig.debug: fix the help description in SCHED_DEBUG mm/swapfile: add cond_resched() in get_swap_pages() mm: use stack_depot_early_init for kmemleak Squashfs: fix handling and sanity checking of xattr_ids count sh: define RUNTIME_DISCARD_EXIT highmem: round down the address passed to kunmap_flush_on_unmap() migrate: hugetlb: check for hugetlb shared PMD in node migration mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups Revert "mm: kmemleak: alloc gray object for reserved region with direct map" freevxfs: Kconfig: fix spelling maple_tree: should get pivots boundary by type .mailmap: update e-mail address for Eugen Hristev mm, mremap: fix mremap() expanding for vma's with vm_ops->close() squashfs: harden sanity check in squashfs_read_xattr_id_table ia64: fix build error due to switch case label appearing next to declaration mm: multi-gen LRU: fix crash during cgroup migration Revert "mm: add nodes= arg to memory.reclaim" zsmalloc: fix a race with deferred_handles storing ...
| | * | | | | | | mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()Kefeng Wang2023-01-311-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As commit 18365225f044 ("hwpoison, memcg: forcibly uncharge LRU pages"), hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could occurs a NULL pointer dereference, let's do not record the foreign writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to fix it. Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Ma Wupeng <mawupeng1@huawei.com> Tested-by: Miko Larsson <mikoxyzzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| | * | | | | | | highmem: round down the address passed to kunmap_flush_on_unmap()Matthew Wilcox (Oracle)2023-01-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already round down the address in kunmap_local_indexed() which is the other implementation of __kunmap_local(). The only implementation of kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned address. This may be causing PA-RISC to be flushing the wrong addresses currently. Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 298fa1ad5571 ("highmem: Provide generic variant of kmap_atomic*") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Helge Deller <deller@gmx.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| | * | | | | | | mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smapsMike Kravetz2023-01-311-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs". This issue of mapcount in hugetlb pages referenced by shared PMDs was discussed in [1]. The following two patches address user visible behavior caused by this issue. [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/ This patch (of 2): A hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. This is because only the first process increases the map count, and subsequent processes just add the shared PMD page to their page table. page_mapcount is being used to decide if a hugetlb page is shared or private in /proc/PID/smaps. Pages referenced via a shared PMD were incorrectly being counted as private. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found count the hugetlb page as shared. A new helper to check for a shared PMD is added. [akpm@linux-foundation.org: simplification, per David] [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()] Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| | * | | | | | | Revert "mm: add nodes= arg to memory.reclaim"Michal Hocko2023-01-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5. Although it is recognized that a finer grained pro-active reclaim is something we need and want the semantic of this implementation is really ambiguous. In a follow up discussion it became clear that there are two essential usecases here. One is to use memory.reclaim to pro-actively reclaim memory and expectation is that the requested and reported amount of memory is uncharged from the memcg. Another usecase focuses on pro-active demotion when the memory is merely shuffled around to demotion targets while the overall charged memory stays unchanged. The current implementation considers demoted pages as reclaimed and that break both usecases. [1] has tried to address the reporting part but there are more issues with that summarized in [2] and follow up emails. Let's revert the nodemask based extension of the memcg pro-active reclaim for now until we settle with a more robust semantic. [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz Fixes: 12a5d3955227b0d ("mm: add nodes= arg to memory.reclaim") Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| | * | | | | | | Sync with v6.2-rc4Andrew Morton2023-01-1811-50/+149
| | |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge branch 'master' into mm-hotfixes-stable
* | | \ \ \ \ \ \ \ Merge tag 'linux-can-next-for-6.3-20230208' of ↵Jakub Kicinski2023-02-081-1/+1
|\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== can-next 2023-02-08 The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's raw_setsockopt() for CAN_RAW_FD_FRAMES. The 2nd patch is by me and fixes the compilation if CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last pull request to next-next.) * tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: bittiming: can_calc_bittiming(): add missing parameter to no-op function can: raw: use temp variable instead of rolling back config ==================== Link: https://lore.kernel.org/r/20230208210014.3169347-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | | | | can: bittiming: can_calc_bittiming(): add missing parameter to no-op functionMarc Kleine-Budde2023-02-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 286c0e09e8e0 ("can: bittiming: can_changelink() pass extack down callstack") a new parameter was added to can_calc_bittiming(), however the static inline no-op (which is used if CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted. Add the new parameter to the static inline no-op of can_calc_bittiming(). Fixes: 286c0e09e8e0 ("can: bittiming: can_changelink() pass extack down callstack") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/20230207201734.2905618-1-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
* | | | | | | | | | | Merge tag 'mlx5-next-netdev-deadlock' of ↵Jakub Kicinski2023-02-083-5/+48
|\ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-next-netdev-deadlock This series from Jiri solves a deadlock when removing a network namespace with mlx5 devlink instance being in it. The deadlock is between: 1) mlx5_ib->unregister_netdevice_notifier() AND 2) mlx5_core->devlink_reload->cleanup_net() To slove this introduced mlx5 netdev added/removed events to track uplink netdev to be used for register_netdevice_notifier_dev_net() purposes. * tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister net/mlx5e: Propagate an internal event in case uplink netdev changes net/mlx5e: Fix trap event handling net/mlx5: Introduce CQE error syndrome ==================== Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | | | | | net/mlx5e: Propagate an internal event in case uplink netdev changesJiri Pirko2023-02-082-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever uplink netdev is set/cleared, propagate newly introduced event to inform notifier blocks netdev was added/removed. Move the set() helper to core.c from header, introduce clear() and netdev_added_event_replay() helpers. The last one is going to be called from rdma driver, so export it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * | | | | | | | | | | net/mlx5: Introduce CQE error syndromePatrisious Haddad2023-01-151-5/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduces CQE error syndrome bits which are inside qp_context_extension and are used to report the reason the QP was moved to error state. Useful for cases in which a CQE isn't generated, such as remote write rkey violation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/f8359315f8130f6d2abe4b94409ac7802f54bce3.1672821186.git.leonro@nvidia.com Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>