summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* net: ieee802154: return -EINVAL for unknown addr typeAlexander Aring2022-10-261-3/+9
| | | | | | | | | | | | | commit 30393181fdbc1608cc683b4ee99dcce05ffcc8c7 upstream. This patch adds handling to return -EINVAL for an unknown addr type. The current behaviour is to return 0 as successful but the size of an unknown addr type is not defined and should return an error like -EINVAL. Fixes: 94160108a70c ("net/ieee802154: fix uninit value bug in dgram_sendmsg") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* io_uring/af_unix: defer registered files gc to io_uring releasePavel Begunkov2022-10-261-0/+2
| | | | | | | | | | | | | | | | | | | | [ upstream commit 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 ] Instead of putting io_uring's registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down. Because io_uring can't register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free(). Cc: stable@vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-and-tested-by: David Bouman <dbouman03@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> [axboe: add kerneldoc comment to skb, fold in skb leak fix] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* scsi: tracing: Fix compile error in trace_array calls when TRACING is disabledArun Easi2022-10-261-2/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1a77dd1c2bb5d4a58c16d198cf593720787c02e4 ] Fix this compilation error seen when CONFIG_TRACING is not enabled: drivers/scsi/qla2xxx/qla_os.c: In function 'qla_trace_init': drivers/scsi/qla2xxx/qla_os.c:2854:25: error: implicit declaration of function 'trace_array_get_by_name'; did you mean 'trace_array_set_clr_event'? [-Werror=implicit-function-declaration] 2854 | qla_trc_array = trace_array_get_by_name("qla2xxx"); | ^~~~~~~~~~~~~~~~~~~~~~~ | trace_array_set_clr_event drivers/scsi/qla2xxx/qla_os.c: In function 'qla_trace_uninit': drivers/scsi/qla2xxx/qla_os.c:2869:9: error: implicit declaration of function 'trace_array_put' [-Werror=implicit-function-declaration] 2869 | trace_array_put(qla_trc_array); | ^~~~~~~~~~~~~~~ Link: https://lore.kernel.org/r/20220907233308.4153-2-aeasi@marvell.com Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* eventfd: guard wake_up in eventfd fs calls as wellDylan Yudaken2022-10-262-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f0deaa12d832f488500a5afe9b912e9b3cfc432 ] Guard wakeups that the user can trigger, and that may end up triggering a call back into eventfd_signal. This is in addition to the current approach that only guards in eventfd_signal. Rename in_eventfd_signal -> in_eventfd at the same time to reflect this. Without this there would be a deadlock in the following code using libaio: int main() { struct io_context *ctx = NULL; struct iocb iocb; struct iocb *iocbs[] = { &iocb }; int evfd; uint64_t val = 1; evfd = eventfd(0, EFD_CLOEXEC); assert(!io_setup(2, &ctx)); io_prep_poll(&iocb, evfd, POLLIN); io_set_eventfd(&iocb, evfd); assert(1 == io_submit(ctx, 1, iocbs)); write(evfd, &val, 8); } Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220816135959.1490641-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* iommu/iova: Fix module config properlyRobin Murphy2022-10-261-1/+1
| | | | | | | | | | | | | | | | | | | [ Upstream commit 4f58330fcc8482aa90674e1f40f601e82f18ed4a ] IOMMU_IOVA is intended to be an optional library for users to select as and when they desire. Since it can be a module now, this means that built-in code which has chosen not to select it should not fail to link if it happens to have selected as a module by someone else. Replace IS_ENABLED() with IS_REACHABLE() to do the right thing. CC: Thierry Reding <thierry.reding@gmail.com> Reported-by: John Garry <john.garry@huawei.com> Fixes: 15bbdec3931e ("iommu: Make the iova library a module") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Thierry Reding <treding@nvidia.com> Link: https://lore.kernel.org/r/548c2f683ca379aface59639a8f0cccc3a1ac050.1663069227.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* scsi: iscsi: Add recv workqueue helpersMike Christie2022-10-261-0/+4
| | | | | | | | | | | | | | [ Upstream commit 8af809966c0b34cfacd8da9a412689b8e9910354 ] Add helpers to allow the drivers to run their recv paths from libiscsi's workqueue. Link: https://lore.kernel.org/r/20220616224557.115234-3-michael.christie@oracle.com Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Stable-dep-of: 57569c37f0ad ("scsi: iscsi: iscsi_tcp: Fix null-ptr-deref while calling getpeername()") Signed-off-by: Sasha Levin <sashal@kernel.org>
* scsi: iscsi: Rename iscsi_conn_queue_work()Mike Christie2022-10-261-1/+1
| | | | | | | | | | | | | | | [ Upstream commit 4b9f8ce4d5e823e42944c5a0a4842b0f936365ad ] Rename iscsi_conn_queue_work() to iscsi_conn_queue_xmit() to reflect that it handles queueing of xmits only. Link: https://lore.kernel.org/r/20220616224557.115234-2-michael.christie@oracle.com Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Stable-dep-of: 57569c37f0ad ("scsi: iscsi: iscsi_tcp: Fix null-ptr-deref while calling getpeername()") Signed-off-by: Sasha Levin <sashal@kernel.org>
* serial: 8250: Toggle IER bits on only after irq has been set upIlpo Järvinen2022-10-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 039d4926379b1d1c17b51cf21c500a5eed86899e ] Invoking TIOCVHANGUP on 8250_mid port on Ice Lake-D and then reopening the port triggers these faults during serial8250_do_startup(): DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Write NO_PASID] Request device [00:1a.0] fault addr 0x0 [fault reason 0x05] PTE Write access is not set If the IRQ hasn't been set up yet, the UART will have zeroes in its MSI address/data registers. Disabling the IRQ at the interrupt controller won't stop the UART from performing a DMA write to the address programmed in its MSI address register (zero) when it wants to signal an interrupt. The UARTs (in Ice Lake-D) implement PCI 2.1 style MSI without masking capability, so there is no way to mask the interrupt at the source PCI function level, except disabling the MSI capability entirely, but that would cause it to fall back to INTx# assertion, and the PCI specification prohibits disabling the MSI capability as a way to mask a function's interrupt service request. The MSI address register is zeroed by the hangup as the irq is freed. The interrupt is signalled during serial8250_do_startup() performing a THRE test that temporarily toggles THRI in IER. The THRE test currently occurs before UART's irq (and MSI address) is properly set up. Refactor serial8250_do_startup() such that irq is set up before the THRE test. The current irq setup code is intermixed with the timer setup code. As THRE test must be performed prior to the timer setup, extract it into own function and call it only after the THRE test. The ->setup_timer() needs to be part of the struct uart_8250_ops in order to not create circular dependency between 8250 and 8250_base modules. Fixes: 40b36daad0ac ("[PATCH] 8250 UART backup timer") Reported-by: Lennert Buytenhek <buytenh@arista.com> Tested-by: Lennert Buytenhek <buytenh@arista.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220922070005.2965-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ata: fix ata_id_has_dipm()Niklas Cassel2022-10-261-11/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 630624cb1b5826d753ac8e01a0e42de43d66dedf ] ACS-5 section 7.13.6.36 Word 78: Serial ATA features supported states that: If word 76 is not 0000h or FFFFh, word 78 reports the features supported by the device. If this word is not supported, the word shall be cleared to zero. (This text also exists in really old ACS standards, e.g. ACS-3.) The problem with ata_id_has_dipm() is that the while it performs a check against 0 and 0xffff, it performs the check against ATA_ID_FEATURE_SUPP (word 78), the same word where the feature bit is stored. Fix this by performing the check against ATA_ID_SATA_CAPABILITY (word 76), like required by the spec. The feature bit check itself is of course still performed against ATA_ID_FEATURE_SUPP (word 78). Additionally, move the macro to the other ATA_ID_FEATURE_SUPP macros (which already have this check), thus making it more likely that the next ATA_ID_FEATURE_SUPP macro that is added will include this check. Fixes: ca77329fb713 ("[libata] Link power management infrastructure") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ata: fix ata_id_has_ncq_autosense()Niklas Cassel2022-10-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a5fb6bf853148974dbde092ec1bde553bea5e49f ] ACS-5 section 7.13.6.36 Word 78: Serial ATA features supported states that: If word 76 is not 0000h or FFFFh, word 78 reports the features supported by the device. If this word is not supported, the word shall be cleared to zero. (This text also exists in really old ACS standards, e.g. ACS-3.) Additionally, move the macro to the other ATA_ID_FEATURE_SUPP macros (which already have this check), thus making it more likely that the next ATA_ID_FEATURE_SUPP macro that is added will include this check. Fixes: 5b01e4b9efa0 ("libata: Implement NCQ autosense") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ata: fix ata_id_has_devslp()Niklas Cassel2022-10-261-1/+4
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9c6e09a434e1317e09b78b3b69cd384022ec9a03 ] ACS-5 section 7.13.6.36 Word 78: Serial ATA features supported states that: If word 76 is not 0000h or FFFFh, word 78 reports the features supported by the device. If this word is not supported, the word shall be cleared to zero. (This text also exists in really old ACS standards, e.g. ACS-3.) Additionally, move the macro to the other ATA_ID_FEATURE_SUPP macros (which already have this check), thus making it more likely that the next ATA_ID_FEATURE_SUPP macro that is added will include this check. Fixes: 65fe1f0f66a5 ("ahci: implement aggressive SATA device sleep support") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ata: fix ata_id_sense_reporting_enabled() and ata_id_has_sense_reporting()Niklas Cassel2022-10-261-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 690aa8c3ae308bc696ec8b1b357b995193927083 ] ACS-5 section 7.13.6.41 Words 85..87, 120: Commands and feature sets supported or enabled states that: If bit 15 of word 86 is set to one, bit 14 of word 119 is set to one, and bit 15 of word 119 is cleared to zero, then word 119 is valid. If bit 15 of word 86 is set to one, bit 14 of word 120 is set to one, and bit 15 of word 120 is cleared to zero, then word 120 is valid. (This text also exists in really old ACS standards, e.g. ACS-3.) Currently, ata_id_sense_reporting_enabled() and ata_id_has_sense_reporting() both check bit 15 of word 86, but neither of them check that bit 14 of word 119 is set to one, or that bit 15 of word 119 is cleared to zero. Additionally, make ata_id_sense_reporting_enabled() return false if !ata_id_has_sense_reporting(), similar to how e.g. ata_id_flush_ext_enabled() returns false if !ata_id_has_flush_ext(). Fixes: e87fd28cf9a2 ("libata: Implement support for sense data reporting") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* dyndbg: drop EXPORTed dynamic_debug_exec_queriesJim Cromie2022-10-261-9/+0
| | | | | | | | | | | | | | | | | | | | [ Upstream commit e26ef3af964acfea311403126acee8c56c89e26b ] This exported fn is unused, and will not be needed. Lets dump it. The export was added to let drm control pr_debugs, as part of using them to avoid drm_debug_enabled overheads. But its better to just implement the drm.debug bitmap interface, then its available for everyone. Fixes: a2d375eda771 ("dyndbg: refine export, rename to dynamic_debug_exec_queries()") Fixes: 4c0d77828d4f ("dyndbg: export ddebug_exec_queries") Acked-by: Jason Baron <jbaron@akamai.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-10-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* dyndbg: fix module.dyndbg handlingJim Cromie2022-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 85d6b66d31c35158364058ee98fb69ab5bb6a6b1 ] For CONFIG_DYNAMIC_DEBUG=N, the ddebug_dyndbg_module_param_cb() stub-fn is too permissive: bash-5.1# modprobe drm JUNKdyndbg bash-5.1# modprobe drm dyndbgJUNK [ 42.933220] dyndbg param is supported only in CONFIG_DYNAMIC_DEBUG builds [ 42.937484] ACPI: bus type drm_connector registered This caused no ill effects, because unknown parameters are either ignored by default with an "unknown parameter" warning, or ignored because dyndbg allows its no-effect use on non-dyndbg builds. But since the code has an explicit feedback message, it should be issued accurately. Fix with strcmp for exact param-name match. Fixes: b48420c1d301 dynamic_debug: make dynamic-debug work for module initialization Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Jason Baron <jbaron@akamai.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-3-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Don't compare mkey tags in DEVX indirect mkeyAharon Landau2022-10-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 13ad1125b941a5f257d9d3ae70485773abd34792 ] According to the ib spec: If the CI supports the Base Memory Management Extensions defined in this specification, the L_Key format must consist of: 24 bit index in the most significant bits of the R_Key, and 8 bit key in the least significant bits of the R_Key Through a successful Allocate L_Key verb invocation, the CI must let the consumer own the key portion of the returned R_Key Therefore, when creating a mkey using DEVX, the consumer is allowed to change the key part. The kernel should compare only the index part of a R_Key to determine equality with another R_Key. Adding capability in order not to break backward compatibility. Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP") Link: https://lore.kernel.org/r/3d669aacea85a3a15c3b3b953b3eaba3f80ef9be.1659255945.git.leonro@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* once: add DO_ONCE_SLOW() for sleepable contextsEric Dumazet2022-10-261-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 62c07983bef9d3e78e71189441e1a470f0d1e653 ] Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation. This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context. Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table. Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell2022-10-262-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ] This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Fix reference state management for synchronous callbacksKumar Kartikeya Dwivedi2022-10-261-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9d9d00ac29d0ef7ce426964de46fa6b380357d0a ] Currently, verifier verifies callback functions (sync and async) as if they will be executed once, (i.e. it explores execution state as if the function was being called once). The next insn to explore is set to start of subprog and the exit from nested frame is handled using curframe > 0 and prepare_func_exit. In case of async callback it uses a customized variant of push_stack simulating a kind of branch to set up custom state and execution context for the async callback. While this approach is simple and works when callback really will be executed only once, it is unsafe for all of our current helpers which are for_each style, i.e. they execute the callback multiple times. A callback releasing acquired references of the caller may do so multiple times, but currently verifier sees it as one call inside the frame, which then returns to caller. Hence, it thinks it released some reference that the cb e.g. got access through callback_ctx (register filled inside cb from spilled typed register on stack). Similarly, it may see that an acquire call is unpaired inside the callback, so the caller will copy the reference state of callback and then will have to release the register with new ref_obj_ids. But again, the callback may execute multiple times, but the verifier will only account for acquired references for a single symbolic execution of the callback, which will cause leaks. Note that for async callback case, things are different. While currently we have bpf_timer_set_callback which only executes it once, even for multiple executions it would be safe, as reference state is NULL and check_reference_leak would force program to release state before BPF_EXIT. The state is also unaffected by analysis for the caller frame. Hence async callback is safe. Since we want the reference state to be accessible, e.g. for pointers loaded from stack through callback_ctx's PTR_TO_STACK, we still have to copy caller's reference_state to callback's bpf_func_state, but we enforce that whatever references it adds to that reference_state has been released before it hits BPF_EXIT. This requires introducing a new callback_ref member in the reference state to distinguish between caller vs callee references. Hence, check_reference_leak now errors out if it sees we are in callback_fn and we have not released callback_ref refs. Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2 etc. we need to also distinguish between whether this particular ref belongs to this callback frame or parent, and only error for our own, so we store state->frameno (which is always non-zero for callbacks). In short, callbacks can read parent reference_state, but cannot mutate it, to be able to use pointers acquired by the caller. They must only undo their changes (by releasing their own acquired_refs before BPF_EXIT) on top of caller reference_state before returning (at which point the caller and callback state will match anyway, so no need to copy it back to caller). Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* SUNRPC: Fix svcxdr_init_encode's buflen calculationChuck Lever2022-10-261-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 1242a87da0d8cd2a428e96ca68e7ea899b0f4624 ] Commit 2825a7f90753 ("nfsd4: allow encoding across page boundaries") added an explicit computation of the remaining length in the rq_res XDR buffer. The computation appears to suffer from an "off-by-one" bug. Because buflen is too large by one page, XDR encoding can run off the end of the send buffer by eventually trying to use the struct page address in rq_page_end, which always contains NULL. Fixes: bddfdbcddbe2 ("NFSD: Extract the svcxdr_init_encode() helper") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* SUNRPC: Fix svcxdr_init_decode's end-of-buffer calculationChuck Lever2022-10-261-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 90bfc37b5ab91c1a6165e3e5cfc49bf04571b762 ] Ensure that stream-based argument decoding can't go past the actual end of the receive buffer. xdr_init_decode's calculation of the value of xdr->end over-estimates the end of the buffer because the Linux kernel RPC server code does not remove the size of the RPC header from rqstp->rq_arg before calling the upper layer's dispatcher. The server-side still uses the svc_getnl() macros to decode the RPC call header. These macros reduce the length of the head iov but do not update the total length of the message in the buffer (buf->len). A proper fix for this would be to replace the use of svc_getnl() and friends in the RPC header decoder, but that would be a large and invasive change that would be difficult to backport. Fixes: 5191955d6fc6 ("SUNRPC: Prepare for xdr_stream-style decoding on the server-side") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Wake up ring buffer waiters on closing of the fileSteven Rostedt (Google)2022-10-261-0/+1
| | | | | | | | | | | | | | | | | | commit f3ddb74ad0790030c9592229fb14d8c451f4e9a8 upstream. When the file that represents the ring buffer is closed, there may be waiters waiting on more input from the ring buffer. Call ring_buffer_wake_waiters() to wake up any waiters when the file is closed. Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ring-buffer: Add ring_buffer_wake_waiters()Steven Rostedt (Google)2022-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | commit 7e9fbbb1b776d8d7969551565bc246f74ec53b27 upstream. On closing of a file that represents a ring buffer or flushing the file, there may be waiters on the ring buffer that needs to be woken up and exit the ring_buffer_wait() function. Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer and allow them to exit the wait loop. Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODELukas Czerner2022-10-261-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cbfecb927f429a6fa613d74b998496bd71e4438a upstream. Currently the I_DIRTY_TIME will never get set if the inode already has I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's true, however ext4 will only update the on-disk inode in ->dirty_inode(), not on actual writeback. As a result if the inode already has I_DIRTY_INODE state by the time we get to __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled into on-disk inode and will not get updated until the next I_DIRTY_INODE update, which might never come if we crash or get a power failure. The problem can be reproduced on ext4 by running xfstest generic/622 with -o iversion mount option. Fix it by allowing I_DIRTY_TIME to be set even if the inode already has I_DIRTY_INODE. Also make sure that the case is properly handled in writeback_single_inode() as well. Additionally changes in xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag. Thanks Jan Kara for suggestions on how to make this work properly. Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* serial: 8250: Let drivers request full 16550A feature probingMaciej W. Rozycki2022-10-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | commit 9906890c89e4dbd900ed87ad3040080339a7f411 upstream. A SERIAL_8250_16550A_VARIANTS configuration option has been recently defined that lets one request the 8250 driver not to probe for 16550A device features so as to reduce the driver's device startup time in virtual machines. Some actual hardware devices require these features to have been fully determined however for their driver to work correctly, so define a flag to let drivers request full 16550A feature probing on a device-by-device basis if required regardless of the SERIAL_8250_16550A_VARIANTS option setting chosen. Fixes: dc56ecb81a0a ("serial: 8250: Support disabling mdelay-filled probes of 16550A variants") Cc: stable@vger.kernel.org # v5.6+ Reported-by: Anders Blomdell <anders.blomdell@control.lth.se> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2209202357520.41633@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* scsi: stex: Properly zero out the passthrough command structureLinus Torvalds2022-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | commit 6022f210461fef67e6e676fd8544ca02d1bcfa7a upstream. The passthrough structure is declared off of the stack, so it needs to be set to zero before copied back to userspace to prevent any unintentional data leakage. Switch things to be statically allocated which will fill the unused fields with 0 automatically. Link: https://lore.kernel.org/r/YxrjN3OOw2HHl9tx@kroah.com Cc: stable@kernel.org Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: hdthky <hdthky0@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net/ieee802154: fix uninit value bug in dgram_sendmsgHaimin Zhang2022-10-121-0/+37
| | | | | | | | | | | | | | | | | | [ Upstream commit 94160108a70c8af17fa1484a37e05181c0e094af ] There is uninit value bug in dgram_sendmsg function in net/ieee802154/socket.c when the length of valid data pointed by the msg->msg_name isn't verified. We introducing a helper function ieee802154_sockaddr_check_size to check namelen. First we check there is addr_type in ieee802154_addr_sa. Then, we check namelen according to addr_type. Also fixed in raw_bind, dgram_bind, dgram_connect. Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* firmware: arm_scmi: Improve checks in the info_get operationsCristian Marussi2022-10-121-2/+2
| | | | | | | | | | | | | | | | [ Upstream commit 1ecb7d27b1af6705e9a4e94415b4d8cc8cf2fbfb ] SCMI protocols abstract and expose a number of protocol specific resources like clocks, sensors and so on. Information about such specific domain resources are generally exposed via an `info_get` protocol operation. Improve the sanity check on these operations where needed. Link: https://lore.kernel.org/r/20220817172731.1185305-3-cristian.marussi@arm.com Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xsk: Inherit need_wakeup flag for shared socketsJalal Mostafa2022-10-121-1/+1
| | | | | | | | | | | | | | | | commit 60240bc26114543fcbfcd8a28466e67e77b20388 upstream. The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM` flag and of different queue ids and/or devices. They should inherit the flag from the first socket buffer pool since no flags can be specified once `XDP_SHARED_UMEM` is specified. Fixes: b5aea28dca134 ("xsk: Add shared umem support between queue ids") Signed-off-by: Jalal Mostafa <jalal.a.mostapha@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* serial: Create uart_xmit_advance()Ilpo Järvinen2022-09-281-0/+17
| | | | | | | | | | | | | | commit e77cab77f2cb3a1ca2ba8df4af45bb35617ac16d upstream. A very common pattern in the drivers is to advance xmit tail index and do bookkeeping of Tx'ed characters. Create uart_xmit_advance() to handle it. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220901143934.8850-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: bonding: Share lacpdu_mcast_addr definitionBenjamin Poirier2022-09-282-2/+3
| | | | | | | | | | | | | | [ Upstream commit 1d9a143ee3408349700f44a9197b7ae0e4faae5d ] There are already a few definitions of arrays containing MULTICAST_LACPDU_ADDR and the next patch will add one more use. These all contain the same constant data so define one common instance for all bonding code. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 86247aba599e ("net: bonding: Unsync device addresses on ndo_stop") Signed-off-by: Sasha Levin <sashal@kernel.org>
* vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignmentWill Deacon2022-09-281-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 13b0566962914e167cb3238fbe29ced618f07a27 upstream. Due to undocumented, hysterical raisins on x86, the CFI jump-table sections in .text are needlessly aligned to PMD_SIZE in the vmlinux linker script. When compiling a CFI-enabled arm64 kernel with a 64KiB page-size, a PMD maps 512MiB of virtual memory and so the .text section increases to a whopping 940MiB and blows the final Image up to 960MiB. Others report a link failure. Since the CFI jump-table requires only instruction alignment, reduce the alignment directives to function alignment for parity with other parts of the .text section. This reduces the size of the .text section for the aforementioned 64KiB page size arm64 kernel to 19MiB for a much more reasonable total Image size of 39MiB. Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Mohan Rao .vanimina" <mailtoc.mohanrao@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/CAL_GTzigiNOMYkOPX1KDnagPhJtFNqSK=1USNbS0wUL4PW6-Uw@mail.gmail.com/ Fixes: cf68fffb66d6 ("add support for Clang CFI") Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220922215715.13345-1-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTESPhil Auld2022-09-281-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d7f06bdd6ee87fbefa05af5f57361d85e7715b11 upstream. As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS <= 3. This leads to very large file sizes: topology$ ls -l total 0 -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 core_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 core_cpus_list -r--r--r-- 1 root root 4096 Sep 5 10:58 core_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 core_siblings -r--r--r-- 1 root root 4096 Sep 5 11:59 core_siblings_list -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 die_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 die_cpus_list -r--r--r-- 1 root root 4096 Sep 5 11:59 die_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 package_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 package_cpus_list -r--r--r-- 1 root root 4096 Sep 5 10:58 physical_package_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 thread_siblings -r--r--r-- 1 root root 4096 Sep 5 11:59 thread_siblings_list Adjust the inequality to catch the case when NR_CPUS is configured to a small value. Fixes: 7ee951acd31a ("drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist") Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Yury Norov <yury.norov@gmail.com> Cc: stable@vger.kernel.org Cc: feng xiangjun <fengxj325@gmail.com> Reported-by: feng xiangjun <fengxj325@gmail.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20220906203542.1796629-1-pauld@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang2022-09-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 683412ccf61294d727ead4a73d97397396e69a6b upstream. Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: Find dst with sk's xfrm policy not ctl_sksewookseo2022-09-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e22aa14866684f77b4f6b6cae98539e520ddb731 upstream. If we set XFRM security policy by calling setsockopt with option IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock' struct. However tcp_v6_send_response doesn't look up dst_entry with the actual socket but looks up with tcp control socket. This may cause a problem that a RST packet is sent without ESP encryption & peer's TCP socket can't receive it. This patch will make the function look up dest_entry with actual socket, if the socket has XFRM policy(sock_policy), so that the TCP response packet via this function can be encrypted, & aligned on the encrypted TCP socket. Tested: We encountered this problem when a TCP socket which is encrypted in ESP transport mode encryption, receives challenge ACK at SYN_SENT state. After receiving challenge ACK, TCP needs to send RST to establish the socket at next SYN try. But the RST was not encrypted & peer TCP socket still remains on ESTABLISHED state. So we verified this with test step as below. [Test step] 1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED). 2. Client tries a new connection on the same TCP ports(src & dst). 3. Server will return challenge ACK instead of SYN,ACK. 4. Client will send RST to server to clear the SOCKET. 5. Client will retransmit SYN to server on the same TCP ports. [Expected result] The TCP connection should be established. Cc: Maciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Sehee Lee <seheele@google.com> Signed-off-by: Sewook Seo <sewookseo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* of/device: Fix up of_dma_configure_id() stubThierry Reding2022-09-231-2/+3
| | | | | | | | | | | | | | | | | | commit 40bfe7a86d84cf08ac6a8fe2f0c8bf7a43edd110 upstream. Since the stub version of of_dma_configure_id() was added in commit a081bd4af4ce ("of/device: Add input id to of_dma_configure()"), it has not matched the signature of the full function, leading to build failure reports when code using this function is built on !OF configurations. Fixes: a081bd4af4ce ("of/device: Add input id to of_dma_configure()") Cc: stable@vger.kernel.org Signed-off-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Frank Rowand <frank.rowand@sony.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20220824153256.1437483-1-thierry.reding@gmail.com Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* iommu/vt-d: Fix kdump kernels boot failure with scalable modeLu Baolu2022-09-201-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0c5f6c0d8201a809a6585b07b6263e9db2c874a3 ] The translation table copying code for kdump kernels is currently based on the extended root/context entry formats of ECS mode defined in older VT-d v2.5, and doesn't handle the scalable mode formats. This causes the kexec capture kernel boot failure with DMAR faults if the IOMMU was enabled in scalable mode by the previous kernel. The ECS mode has already been deprecated by the VT-d spec since v3.0 and Intel IOMMU driver doesn't support this mode as there's no real hardware implementation. Hence this converts ECS checking in copying table code into scalable mode. The existing copying code consumes a bit in the context entry as a mark of copied entry. It needs to work for the old format as well as for the extended context entries. As it's hard to find such a common bit for both legacy and scalable mode context entries. This replaces it with a per- IOMMU bitmap. Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support") Cc: stable@vger.kernel.org Reported-by: Jerry Snitselaar <jsnitsel@redhat.com> Tested-by: Wen Jin <wen.jin@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20220817011035.3250131-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* task_stack, x86/cea: Force-inline stack helpersBorislav Petkov2022-09-201-1/+1
| | | | | | | | | | | | | | | [ Upstream commit e87f4152e542610d0b4c6c8548964a68a59d2040 ] Force-inline two stack helpers to fix the following objtool warnings: vmlinux.o: warning: objtool: in_task_stack()+0xc: call to task_stack_page() leaves .noinstr.text section vmlinux.o: warning: objtool: in_entry_stack()+0x10: call to cpu_entry_stack() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220324183607.31717-2-bp@alien8.de Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip") Signed-off-by: Sasha Levin <sashal@kernel.org>
* lockdep: Fix -Wunused-parameter for _THIS_IP_Nick Desaulniers2022-09-202-3/+3
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ] While looking into a bug related to the compiler's handling of addresses of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep. Drive by cleanup. -Wunused-parameter: kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip' kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip' kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip' Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip") Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFS: Fix WARN_ON due to unionization of nfs_inode.nrequestsDave Wysochanski2022-09-201-1/+3
| | | | | | | | | | | | | | | | | | | | commit 0ebeebcf59601bcfa0284f4bb7abdec051eb856d upstream. Fixes the following WARN_ON WARNING: CPU: 2 PID: 18678 at fs/nfs/inode.c:123 nfs_clear_inode+0x3b/0x50 [nfs] ... Call Trace: nfs4_evict_inode+0x57/0x70 [nfsv4] evict+0xd1/0x180 dispose_list+0x48/0x60 evict_inodes+0x156/0x190 generic_shutdown_super+0x37/0x110 nfs_kill_super+0x1d/0x40 [nfs] deactivate_locked_super+0x36/0xa0 Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ARM: at91: ddr: remove CONFIG_SOC_SAMA7 dependencyClaudiu Beznea2022-09-151-4/+0
| | | | | | | | | | | | | [ Upstream commit dc3005703f8cd893d325081c20b400e08377d9bb ] Remove CONFIG_SOC_SAMA7 dependency to avoid having #ifdef preprocessor directives in driver code (arch/arm/mach-at91/pm.c). This prepares the code for next commits. Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com> Link: https://lore.kernel.org/r/20220113144900.906370-2-claudiu.beznea@microchip.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* tcp: TX zerocopy should not sense pfmemalloc statusEric Dumazet2022-09-151-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3261400639463a853ba2b3be8bd009c2a8089775 ] We got a recent syzbot report [1] showing a possible misuse of pfmemalloc page status in TCP zerocopy paths. Indeed, for pages coming from user space or other layers, using page_is_pfmemalloc() is moot, and possibly could give false positives. There has been attempts to make page_is_pfmemalloc() more robust, but not using it in the first place in this context is probably better, removing cpu cycles. Note to stable teams : You need to backport 84ce071e38a6 ("net: introduce __skb_fill_page_desc_noacc") as a prereq. Race is more probable after commit c07aea3ef4d4 ("mm: add a signature in struct page") because page_is_pfmemalloc() is now using low order bit from page->lru.next, which can change more often than page->index. Low order bit should never be set for lru.next (when used as an anchor in LRU list), so KCSAN report is mostly a false positive. Backporting to older kernel versions seems not necessary. [1] BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:105 [inline] lru_add_fn+0x440/0x520 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 folio_batch_add_and_move mm/swap.c:263 [inline] folio_add_lru+0xf1/0x140 mm/swap.c:490 filemap_add_folio+0xf8/0x150 mm/filemap.c:948 __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981 pagecache_get_page+0x26/0x190 mm/folio-compat.c:104 grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116 ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988 generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738 ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270 ext4_file_write_iter+0x2e3/0x1210 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x468/0x760 fs/read_write.c:578 ksys_write+0xe8/0x1a0 fs/read_write.c:631 __do_sys_write fs/read_write.c:643 [inline] __se_sys_write fs/read_write.c:640 [inline] __x64_sys_write+0x3e/0x50 fs/read_write.c:640 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1740 [inline] __skb_fill_page_desc include/linux/skbuff.h:2422 [inline] skb_fill_page_desc include/linux/skbuff.h:2443 [inline] tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018 do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075 tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline] tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1249 __do_sys_sendfile64 fs/read_write.c:1317 [inline] __se_sys_sendfile64 fs/read_write.c:1303 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea0004a1d288 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b5d05-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022 Fixes: c07aea3ef4d4 ("mm: add a signature in struct page") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* net: introduce __skb_fill_page_desc_noaccPavel Begunkov2022-09-151-11/+17
| | | | | | | | | | | | | [ Upstream commit 84ce071e38a6e25ea3ea91188e5482ac1f17b3af ] Managed pages contain pinned userspace pages and controlled by upper layers, there is no need in tracking skb->pfmemalloc for them. Introduce a helper for filling frags but ignoring page tracking, it'll be needed later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* rxrpc: Fix ICMP/ICMP6 error handlingDavid Howells2022-09-152-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ac56a0b48da86fd1b4389632fb7c4c8a5d86eefa ] Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing it to siphon off UDP packets early in the handling of received UDP packets thereby avoiding the packet going through the UDP receive queue, it doesn't get ICMP packets through the UDP ->sk_error_report() callback. In fact, it doesn't appear that there's any usable option for getting hold of ICMP packets. Fix this by adding a new UDP encap hook to distribute error messages for UDP tunnels. If the hook is set, then the tunnel driver will be able to see ICMP packets. The hook provides the offset into the packet of the UDP header of the original packet that caused the notification. An alternative would be to call the ->error_handler() hook - but that requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error() do, though isn't really necessary or desirable in rxrpc's case is we want to parse them there and then, not queue them). Changes ======= ver #3) - Fixed an uninitialised variable. ver #2) - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals. Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: at91: pm: fix DDR recalibration when resuming from backup and self-refreshClaudiu Beznea2022-09-151-0/+4
| | | | | | | | | | | | | | | | | | | [ Upstream commit 7a94b83a7dc551607b6c4400df29151e6a951f07 ] On SAMA7G5, when resuming from backup and self-refresh, the bootloader performs DDR PHY recalibration by restoring the value of ZQ0SR0 (stored in RAM by Linux before going to backup and self-refresh). It has been discovered that the current procedure doesn't work for all possible values that might go to ZQ0SR0 due to hardware bug. The workaround to this is to avoid storing some values in ZQ0SR0. Thus Linux will read the ZQ0SR0 register and cache its value in RAM after processing it (using modified_gray_code array). The bootloader will restore the processed value. Fixes: d2d4716d8384 ("ARM: at91: pm: save ddr phy calibration data to securam") Suggested-by: Frederic Schumacher <frederic.schumacher@microchip.com> Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Link: https://lore.kernel.org/r/20220826083927.3107272-4-claudiu.beznea@microchip.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: at91: pm: fix self-refresh for sama7g5Claudiu Beznea2022-09-151-0/+4
| | | | | | | | | | | | | | | | | | | | [ Upstream commit a02875c4cbd6f3d2f33d70cc158a19ef02d4b84f ] It has been discovered that on some parts, from time to time, self-refresh procedure doesn't work as expected. Debugging and investigating it proved that disabling AC DLL introduce glitches in RAM controllers which leads to unexpected behavior. This is confirmed as a hardware bug. DLL bypass disables 3 DLLs: 2 DX DLLs and AC DLL. Thus, keep only DX DLLs disabled. This introduce 6mA extra current consumption on VDDCORE when switching to any ULP mode or standby mode but the self-refresh procedure still works. Fixes: f0bbf17958e8 ("ARM: at91: pm: add self-refresh support for sama7g5") Suggested-by: Frederic Schumacher <frederic.schumacher@microchip.com> Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Tested-by: Cristian Birsan <cristian.birsan@microchip.com> Link: https://lore.kernel.org/r/20220826083927.3107272-3-claudiu.beznea@microchip.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFS: Fix another fsync() issue after a server rebootTrond Myklebust2022-09-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 67f4b5dc49913abcdb5cc736e73674e2f352f81d ] Currently, when the writeback code detects a server reboot, it redirties any pages that were not committed to disk, and it sets the flag NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor that dirtied the file. While this allows the file descriptor in question to redrive its own writes, it violates the fsync() requirement that we should be synchronising all writes to disk. While the problem is infrequent, we do see corner cases where an untimely server reboot causes the fsync() call to abandon its attempt to sync data to disk and causing data corruption issues due to missed error conditions or similar. In order to tighted up the client's ability to deal with this situation without introducing livelocks, add a counter that records the number of times pages are redirtied due to a server reboot-like condition, and use that in fsync() to redrive the sync to disk. Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFS: Save some space in the inodeTrond Myklebust2022-09-151-18/+24
| | | | | | | | | | [ Upstream commit e591b298d7ecb851e200f65946e3d53fe78a3c4f ] Save some space in the nfs_inode by setting up an anonymous union with the fields that are peculiar to a specific type of filesystem object. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFS: Further optimisations for 'ls -l'Trond Myklebust2022-09-151-3/+2
| | | | | | | | | | | | | | | | | | [ Upstream commit ff81dfb5d721fff87bd516c558847f6effb70031 ] If a user is doing 'ls -l', we have a heuristic in GETATTR that tells the readdir code to try to use READDIRPLUS in order to refresh the inode attributes. In certain cirumstances, we also try to invalidate the remaining directory entries in order to ensure this refresh. If there are multiple readers of the directory, we probably should avoid invalidating the page cache, since the heuristic breaks down in that situation anyway. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* debugfs: add debugfs_lookup_and_remove()Greg Kroah-Hartman2022-09-151-0/+6
| | | | | | | | | | | | | | | | | commit dec9b2f1e0455a151a7293c367da22ab973f713e upstream. There is a very common pattern of using debugfs_remove(debufs_lookup(..)) which results in a dentry leak of the dentry that was looked up. Instead of having to open-code the correct pattern of calling dput() on the dentry, create debugfs_lookup_and_remove() to handle this pattern automatically and properly without any memory leaks. Cc: stable <stable@kernel.org> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Link: https://lore.kernel.org/r/YxIaQ8cSinDR881k@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs: only do a memory barrier for the first set_buffer_uptodate()Linus Torvalds2022-09-151-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2f79cdfe58c13949bbbb65ba5926abfe9561d0ec upstream. Commit d4252071b97d ("add barriers to buffer_uptodate and set_buffer_uptodate") added proper memory barriers to the buffer head BH_Uptodate bit, so that anybody who tests a buffer for being up-to-date will be guaranteed to actually see initialized state. However, that commit didn't _just_ add the memory barrier, it also ended up dropping the "was it already set" logic that the BUFFER_FNS() macro had. That's conceptually the right thing for a generic "this is a memory barrier" operation, but in the case of the buffer contents, we really only care about the memory barrier for the _first_ time we set the bit, in that the only memory ordering protection we need is to avoid anybody seeing uninitialized memory contents. Any other access ordering wouldn't be about the BH_Uptodate bit anyway, and would require some other proper lock (typically BH_Lock or the folio lock). A reader that races with somebody invalidating the buffer head isn't an issue wrt the memory ordering, it's a serialization issue. Now, you'd think that the buffer head operations don't matter in this day and age (and I certainly thought so), but apparently some loads still end up being heavy users of buffer heads. In particular, the kernel test robot reported that not having this bit access optimization in place caused a noticeable direct IO performance regression on ext4: fxmark.ssd_ext4_no_jnl_DWTL_54_directio.works/sec -26.5% regression although you presumably need a fast disk and a lot of cores to actually notice. Link: https://lore.kernel.org/all/Yw8L7HTZ%2FdE2%2Fo9C@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Fengwei Yin <fengwei.yin@intel.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>