| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 90ad525c2d9a8a6591ab822234a94b82871ef8e0 ]
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
Currently prb_next_seq() is used as the base for the 32bit seq
macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
a follow-up commit, prb_next_seq() will need to make use of the
32bit seq macros.
Use prb_first_seq() as the base for the 32bit seq macros instead
because it is guaranteed to return 64bit sequence numbers without
relying on any 32bit seq macros.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 418ec1961c07d84293cc3cd54d67b90bbeba7feb ]
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
__ulseq_to_u64seq() computes the upper 32 bits of the passed
argument value (@ulseq). The upper bits are derived from a base
value (@rb_next_seq) in a way that assumes @ulseq represents a
64bit number that is less than or equal to @rb_next_seq.
Until now this mapping has been correct for all call sites. However,
in a follow-up commit, values of @ulseq will be passed in that are
higher than the base value. This requires a change to how the 32bit
value is mapped to a 64bit sequence number.
Rather than mapping @ulseq such that the base value is the end of a
32bit block, map @ulseq such that the base value is in the middle of
a 32bit block. This allows supporting 31 bits before and after the
base value, which is deemed acceptable for the console sequence
number during runtime.
Here is an example to illustrate the previous and new mappings.
For a base value (@rb_next_seq) of 2 2000 0000...
Before this change the range of possible return values was:
1 2000 0001 to 2 2000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 1 2000 0001
__ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
__ulseq_to_u64seq(a000 0000) => 1 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
After this change the range of possible return values are:
1 a000 0001 to 2 a000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 2 2000 0001
__ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
__ulseq_to_u64seq(a000 0000) => 2 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
[ john.ogness: Rewrite commit message. ]
Reported-by: Francesco Dolcini <francesco@dolcini.it>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 00bf63122459e87193ee7f1bc6161c83a525569f ]
When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.
Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8f8cd6c0a43ed637e620bbe45a8d0e0c2f4d5130 ]
The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.
Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).
Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.
Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.
[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process
With this fix, the above delay can be eliminated.
Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit d6170e4aaf86424c24ce06e355b4573daa891b17 ]
On some architectures like ARM64, PMD_SIZE can be really large in some
configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is
512MB.
Use 2MB * num_possible_nodes() as the size for allocations done through
the prog pack allocator. On most architectures, PMD_SIZE will be equal
to 2MB in case of 4KB pages and will be greater than 2MB for bigger page
sizes.
Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Message-ID: <20240311122722.86232-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 7a4b21250bf79eef26543d35bd390448646c536b ]
The stackmap code relies on roundup_pow_of_two() to compute the number
of hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code.
The commit in the fixes tag actually attempted to fix this, but the fix
did not account for the UB, so the fix only works on CPUs where an
overflow does result in a neat truncation to zero, which is not
guaranteed. Checking the value before rounding does not have this
problem.
Fixes: 6183f4d3a0a2 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Message-ID: <20240307120340.99577-4-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 6787d916c2cf9850c97a0a3f73e08c43e7d973b1 ]
The hashtab code relies on roundup_pow_of_two() to compute the number of
hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code. So apply the same
fix to hashtab, by moving the overflow check to before the roundup.
Fixes: daaf427c6ab3 ("bpf: fix arraymap NULL deref and missing overflow and zero size checks")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240307120340.99577-3-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 281d464a34f540de166cee74b723e97ac2515ec3 ]
The devmap code allocates a number hash buckets equal to the next power
of two of the max_entries value provided when creating the map. When
rounding up to the next power of two, the 32-bit variable storing the
number of buckets can overflow, and the code checks for overflow by
checking if the truncated 32-bit value is equal to 0. However, on 32-bit
arches the rounding up itself can overflow mid-way through, because it
ends up doing a left-shift of 32 bits on an unsigned long value. If the
size of an unsigned long is four bytes, this is undefined behaviour, so
there is no guarantee that we'll end up with a nice and tidy 0-value at
the end.
Syzbot managed to turn this into a crash on arm32 by creating a
DEVMAP_HASH with max_entries > 0x80000000 and then trying to update it.
Fix this by moving the overflow check to before the rounding up
operation.
Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Link: https://lore.kernel.org/r/000000000000ed666a0611af6818@google.com
Reported-and-tested-by: syzbot+8cd36f6b65f3cafd400a@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240307120340.99577-2-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 178c54666f9c4d2f49f2ea661d0c11b52f0ed190 ]
Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}()
helper calls. This is to prevent deadlock for the following cases:
- there is a prog (prog-A) calling bpf_spin_{lock,unlock}().
- there is a tracing program (prog-B), e.g., fentry, attached
to bpf_spin_lock() and/or bpf_spin_unlock().
- prog-B calls bpf_spin_{lock,unlock}().
For such a case, when prog-A calls bpf_spin_{lock,unlock}(),
a deadlock will happen.
The related source codes are below in kernel/bpf/helpers.c:
notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
notrace is supposed to prevent fentry prog from attaching to
bpf_spin_{lock,unlock}().
But actually this is not the case and fentry prog can successfully
attached to bpf_spin_lock(). Siddharth Chintamaneni reported
the issue in [1]. The following is the macro definition for
above BPF_CALL_1:
#define BPF_CALL_x(x, name, ...) \
static __always_inline \
u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \
u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \
{ \
return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
} \
static __always_inline \
u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__))
#define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__)
The notrace attribute is actually applied to the static always_inline function
____bpf_spin_{lock,unlock}(). The actual callback function
bpf_spin_{lock,unlock}() is not marked with notrace, hence
allowing fentry prog to attach to two helpers, and this
may cause the above mentioned deadlock. Siddharth Chintamaneni
actually has a reproducer in [2].
To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which
will add notrace attribute to the original function instead of
the hidden always_inline function and this fixed the problem.
[1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/
Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit d04d5882cd678b898a9d7c5aee6afbe9e6e77fcd ]
The commit d51507098ff91 ("printk: disable optimistic spin
during panic") added checks to avoid becoming a console waiter
if a panic is in progress.
However, the transition to panic can occur while there is
already a waiter. The current owner should not pass the lock to
the waiter because it might get stopped or blocked anytime.
Also the panic context might pass the console lock owner to an
already stopped waiter by mistake. It might happen when
console_flush_on_panic() ignores the current lock owner, for
example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[CPU1 now console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix these problems by disabling all spinning operations
completely during panic().
Another advantage is that it prevents possible deadlocks caused
by "console_owner_lock". The panic() context does not need to
take it any longer. The lockless checks are safe because the
functions become NOPs when they see the panic in progress. All
operations manipulating the state are still synchronized by the
lock even when non-panic CPUs would notice the panic
synchronously.
The current owner might stay spinning. But non-panic() CPUs
would get stopped anyway and the panic context will never start
spinning.
Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20240207134103.1357162-12-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit b1c4c67a5e90db8fbdb5b5504fe16e17b564cca8 ]
Normally a reader will stop once reaching a non-finalized
record. However, when a panic happens, writers from other CPUs
(or an interrupted context on the panic CPU) may have been
writing a record and were unable to finalize it. The panic CPU
will reserve/commit/finalize its panic records, but these will
be located after the non-finalized records. This results in
panic() not flushing the panic messages.
Extend _prb_read_valid() to skip over non-finalized records if
on the panic CPU.
Fixes: 896fbe20b4e2 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-11-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 584528d621459d1a5c31da7a591218ad3bb96d6c ]
With the lockless ringbuffer, it is allowed that multiple
CPUs/contexts write simultaneously into the buffer. This creates
an ambiguity as some writers will finalize sooner.
The documentation for the prb_read functions is not clear as it
refers to "not yet written" and "no data available". Clarify the
return values and language to be in terms of the reader: records
available for reading.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-9-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Stable-dep-of: b1c4c67a5e90 ("printk: ringbuffer: Skip non-finalized records in panic")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 36652d0f3bf34899e82d31a5fa9e2bdd02fd6381 ]
There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.
Add this_cpu_in_panic() to complete the set.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-8-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Stable-dep-of: b1c4c67a5e90 ("printk: ringbuffer: Skip non-finalized records in panic")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit ac7d7844c64d15603daa3e905a311ddcfbb4bc91 ]
Currently pr_flush() will only wait for records that were
available to readers at the time of the call (using
prb_next_seq()). But there may be more records (non-finalized)
that have following finalized records. pr_flush() should wait
for these to print as well. Particularly because any trailing
finalized records may be the messages that the calling context
wants to ensure are printed.
Add a new ringbuffer function prb_next_reserve_seq() to return
the sequence number following the most recently reserved record.
This guarantees that pr_flush() will wait until all current
printk() messages (completed or in progress) have been printed.
Fixes: 3b604ca81202 ("printk: add pr_flush()")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 5f72e52ba959e50680b8d83599da1368cd7a6ee2 ]
Commit f244b4dc53e5 ("printk: ringbuffer: Improve
prb_next_seq() performance") introduced an optimization for
prb_next_seq() by using best-effort to track recently finalized
records. However, the order of finalization does not
necessarily match the order of the records. The optimization
changed prb_next_seq() to return inconsistent results, possibly
yielding sequence numbers that are not available to readers
because they are preceded by non-finalized records or they are
not yet visible to the reader CPU.
Rather than simply best-effort tracking recently finalized
records, force the committing writer to read records and
increment the last "contiguous block" of finalized records. In
order to do this, the sequence number instead of ID must be
stored because ID's cannot be directly compared.
A new memory barrier pair is introduced to guarantee that a
reader can always read the records up until the sequence number
returned by prb_next_seq() (unless the records have since
been overwritten in the ringbuffer).
This restores the original functionality of prb_next_seq()
while also keeping the optimization.
For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number using the 32bit seq macros,
which fold in the value returned by prb_first_seq().
Fixes: f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 5b73e706f00f3553e1a4efbb31951ce9fe18f2dc ]
The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
used to provide support for atomic handling of sequence numbers
on 32bit systems. Until now this was only used by nbcon.c,
which is why they were located in nbcon.c and include nbcon in
the name.
In a follow-up commit this functionality is also needed by
printk_ringbuffer. Rather than duplicating the functionality,
relocate the macros to printk_ringbuffer.h.
Also, since the macros will be no longer nbcon-specific, rename
them to __u64seq_to_ulseq() and __ulseq_to_u64seq().
This does not result in any functional change.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Stable-dep-of: 5f72e52ba959 ("printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 1eb986746a67952df86eb2c50a36450ef103d01b ]
When btf_prepare_func_args() was generalized to handle both static and
global subprogs, a few warnings/errors that are meant only for global
subprog cases started to be emitted for static subprogs, where they are
sort of expected and irrelavant.
Stop polutting verifier logs with irrelevant scary-looking messages.
Fixes: e26080d0da87 ("bpf: prepare btf_prepare_func_args() for handling static subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 18810ad3929ff6b5d8e67e3adc40d690bd780fd6 ]
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stable-dep-of: 1eb986746a67 ("bpf: don't emit warnings intended for global subprogs for static subprogs")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 23d04d8c6b8ec339057264659b7834027f3e6a63 ]
When picking a CPU on task wakeup, select_idle_core() has to take
into account the scheduling domain where the function looks for the CPU.
This is because the "isolcpus" kernel command line option can remove CPUs
from the domain to isolate them from other SMT siblings.
This change replaces the set of CPUs allowed to run the task from
p->cpus_ptr by the intersection of p->cpus_ptr and sched_domain_span(sd)
which is stored in the 'cpus' argument provided by select_idle_cpu().
Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240110131707.437301-2-keisuke.nishimura@inria.fr
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8aeaffef8c6eceab0e1498486fdd4f3dc3b7066c ]
When picking a CPU on task wakeup, select_idle_smt() has to take
into account the scheduling domain of @target. This is because the
"isolcpus" kernel command line option can remove CPUs from the domain to
isolate them from other SMT siblings.
This fix checks if the candidate CPU is in the target scheduling domain.
Commit:
df3cb4ea1fb6 ("sched/fair: Fix wrong cpu selecting from isolated domain")
... originally introduced this fix by adding the check of the scheduling
domain in the loop.
However, commit:
3e6efe87cd5cc ("sched/fair: Remove redundant check in select_idle_smt()")
... accidentally removed the check. Bring it back.
Fixes: 3e6efe87cd5c ("sched/fair: Remove redundant check in select_idle_smt()")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240110131707.437301-1-keisuke.nishimura@inria.fr
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 14274d0bd31b4debf28284604589f596ad2e99f2 ]
So far, get_device_system_crosststamp() unconditionally passes
system_counterval.cycles to timekeeping_cycles_to_ns(). But when
interpolating system time (do_interp == true), system_counterval.cycles is
before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns()
expectations.
On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on
interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw
and xtstamp->sys_realtime are then set to the last update time, as
implicitly expected by adjust_historical_crosststamp(). On other
architectures, the resulting nonsense xtstamp->sys_monoraw and
xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in
adjust_historical_crosststamp().
Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from
the last update time when interpolating, by using the local variable
"cycles". The local variable already has the right value when
interpolating, unlike system_counterval.cycles.
Fixes: 2c756feb18d9 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 87a41130881995f82f7adbafbfeddaebfb35f0ef ]
The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.
get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.
This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.
Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.
This changes the get_device_system_crosststamp() behavior for three corner
cases:
1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
fixing the problem described above.
2. At the first timestamp_in_interval() call site, cycles == now no longer
causes failure.
3. At the second timestamp_in_interval() call site, history_begin->cycles
== system_counterval.cycles no longer causes failure.
adjust_historical_crosststamp() also works for this corner case,
where partial_history_cycles == total_history_cycles.
These behavioral changes should not cause any problems.
Fixes: 2c756feb18d9 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 84dccadd3e2a3f1a373826ad71e5ced5e76b0c00 ]
cycle_between() decides whether get_device_system_crosststamp() will
interpolate for older counter readings.
cycle_between() yields wrong results for a counter wrap-around where after
< before < test, and for the case after < test < before.
Fix the comparison logic.
Fixes: 2c756feb18d9 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-2-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 133e267ef4a26d19c93996a874714e9f3f8c70aa ]
'days' is a s64 (from div_s64), and so should use a %lld specifier.
This was found by extending KUnit's assertion macros to use gcc's
__printf attribute.
Fixes: 276010551664 ("time: Improve performance of time64_to_tm()")
Signed-off-by: David Gow <davidgow@google.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit e7539ffc9a770f36bacedcf0fbfb4bf2f244f4a5 ]
Just like is done for the kworker performing nodes initialization,
gracefully handle the possible allocation failure of the RCU expedited
grace period main kworker.
While at it perform a rename of the related checking functions to better
reflect the expedited specifics.
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
recovery
[ Upstream commit a636c5e6f8fc34be520277e69c7c6ee1d4fc1d17 ]
Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited
grace periods is queued to a kworker. However if the allocation of that
kworker failed, the nodes initialization is performed synchronously by
the caller instead.
Now the check for kworker initialization failure relies on the kworker
pointer to be NULL while its value might actually encapsulate an
allocation failure error.
Make sure to handle this case.
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
wq_update_node_max_active()
[ Upstream commit 15930da42f8981dc42c19038042947b475b19f47 ]
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:
Unable to handle kernel paging request at virtual address ffff0002100296e0
..
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
...
Call trace:
wq_update_node_max_active+0x50/0x1fc
apply_wqattrs_commit+0xf0/0x114
apply_workqueue_attrs_locked+0x58/0xa0
alloc_workqueue+0x5ac/0x774
workqueue_init_early+0x460/0x540
start_kernel+0x258/0x684
__primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 5797b1c18919cd9c289ded7954383e499f729ce0 ]
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 91ccc6e7233bb10a9c176aa4cc70d6f432a441a5 ]
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.
As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.
Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.
This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.
v4: - Fixed out-of-bound access when freeing per-cpu workqueues.
v3: - Use flexible array for wq->node_nr_active as suggested by Lai.
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
- Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
pwq->max_active check. Restored. As the next patch replaces the
max_active enforcement mechanism, this doesn't change the end result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 9f66cff212bb3c1cd25996aaa0dfd0c9e9d8baab ]
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.
- Make wq->dfl_pwq RCU protected.
- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
and ->cpu_pwq. The former returns the double pointer that can be used
access and update the pwqs. The latter performs locking check and
dereferences the double pointer.
- pwq accesses and updates are converted to use unbound_pwq[_slot]().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit c5404d4e6df6faba1007544b5f4e62c7c14416dd ]
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.
In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 1c270b79ce0b8290f146255ea9057243f6dd3c17 ]
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.
- pwq_tryinc_nr_active() is added. It increments nr_active if under
max_active limit and return a boolean indicating whether inc was
successful. Note that the function is structured to accommodate future
changes. __queue_work() is updated to use the new helper.
- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
thus no longer assumes that nr_active is under max_active and returns a
boolean to indicate whether a work item has been activated.
- wq_adjust_max_active() no longer tests directly whether a work item can be
activated. Instead, it's updated to use the return value of
pwq_activate_first_inactive() to tell whether a work item has been
activated.
- nr_active decrement and activating the first inactive work item is
factored into pwq_dec_nr_active().
v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
now we're calling the function unconditionally from
pwq_activate_first_inactive().
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 4c6380305d21e36581b451f7337a36c93b64e050 ]
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit afa87ce85379e2d93863fce595afdb5771a84004 ]
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit a045a272d887575da17ad86d6573e82871b50c27 ]
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.
This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.
As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.
v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 31c89007285d365aa36f71d8fb0701581c770a27 ]
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff349 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Stable-dep-of: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Do not allow large strings (> 4096) as single write to trace_marker
The size of a string written into trace_marker was determined by the
size of the sub-buffer in the ring buffer. That size is dependent on
the PAGE_SIZE of the architecture as it can be mapped into user
space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of
the string of writing into trace_marker 64K.
One of the selftests looks at the size of the ring buffer sub-buffers
and writes that plus more into the trace_marker. The write will take
what it can and report back what it consumed so that the user space
application (like echo) will write the rest of the string. The string
is stored in the ring buffer and can be read via the "trace" or
"trace_pipe" files.
The reading of the ring buffer uses vsnprintf(), which uses a
precision "%.*s" to make sure it only reads what is stored in the
buffer, as a bug could cause the string to be non terminated.
With the combination of the precision change and the PAGE_SIZE of 64K
allowing huge strings to be added into the ring buffer, plus the test
that would actually stress that limit, a bug was reported that the
precision used was too big for "%.*s" as the string was close to 64K
in size and the max precision of vsnprintf is 32K.
Linus suggested not to have that precision as it could hide a bug if
the string was again stored without a nul byte.
Another issue that was brought up is that the trace_seq buffer is
also based on PAGE_SIZE even though it is not tied to the
architecture limit like the ring buffer sub-buffer is. Having it be
64K * 2 is simply just too big and wasting memory on systems with 64K
page sizes. It is now hardcoded to 8K which is what all other
architectures with 4K PAGE_SIZE has.
Finally, the write to trace_marker is now limited to 4K as there is
no reason to write larger strings into trace_marker.
- ring_buffer_wait() should not loop.
The ring_buffer_wait() does not have the full context (yet) on if it
should loop or not. Just exit the loop as soon as its woken up and
let the callers decide to loop or not (they already do, so it's a bit
redundant).
- Fix shortest_full field to be the smallest amount in the ring buffer
that a waiter is waiting for. The "shortest_full" field is updated
when a new waiter comes in and wants to wait for a smaller amount of
data in the ring buffer than other waiters. But after all waiters are
woken up, it's not reset, so if another waiter comes in wanting to
wait for more data, it will be woken up when the ring buffer has a
smaller amount from what the previous waiters were waiting for.
- The wake up all waiters on close is incorrectly called frome
.release() and not from .flush() so it will never wake up any waiters
as the .release() will not get called until all .read() calls are
finished. And the wakeup is for the waiters in those .read() calls.
* tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Use .flush() call to wake up readers
ring-buffer: Fix resetting of shortest_full
ring-buffer: Fix waking up ring buffer readers
tracing: Limit trace_marker writes to just 4K
tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE
tracing: Remove precision vsnprintf() check from print event
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The .release() function does not get called until all readers of a file
descriptor are finished.
If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
and another thread closes the file descriptor, it will not wake up the
other thread as ring_buffer_wake_waiters() is called by .release(), and
that will not get called until the .read() is finished.
The issue originally showed up in trace-cmd, but the readers are actually
other processes with their own file descriptors. So calling close() would wake
up the other tasks because they are blocked on another descriptor then the
one that was closed(). But there's other wake ups that solve that issue.
When a thread is blocked on a read, it can still hang even when another
thread closed its descriptor.
This is what the .flush() callback is for. Have the .flush() wake up the
readers.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The "shortest_full" variable is used to keep track of the waiter that is
waiting for the smallest amount on the ring buffer before being woken up.
When a tasks waits on the ring buffer, it passes in a "full" value that is
a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to
100% full buffer.
As all waiters are on the same wait queue, the wake up happens for the
waiter with the smallest percentage.
The problem is that the smallest_full on the cpu_buffer that stores the
smallest amount doesn't get reset when all the waiters are woken up. It
does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace).
This means that tasks may be woken up more often then when they want to
be. Instead, have the shortest_full field get reset just before waking up
all the tasks. If the tasks wait again, they will update the shortest_full
before sleeping.
Also add locking around setting of shortest_full in the poll logic, and
change "work" to "rbwork" to match the variable name for rb_irq_work
structures that are used in other places.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A task can wait on a ring buffer for when it fills up to a specific
watermark. The writer will check the minimum watermark that waiters are
waiting for and if the ring buffer is past that, it will wake up all the
waiters.
The waiters are in a wait loop, and will first check if a signal is
pending and then check if the ring buffer is at the desired level where it
should break out of the loop.
If a file that uses a ring buffer closes, and there's threads waiting on
the ring buffer, it needs to wake up those threads. To do this, a
"wait_index" was used.
Before entering the wait loop, the waiter will read the wait_index. On
wakeup, it will check if the wait_index is different than when it entered
the loop, and will exit the loop if it is. The waker will only need to
update the wait_index before waking up the waiters.
This had a couple of bugs. One trivial one and one broken by design.
The trivial bug was that the waiter checked the wait_index after the
schedule() call. It had to be checked between the prepare_to_wait() and
the schedule() which it was not.
The main bug is that the first check to set the default wait_index will
always be outside the prepare_to_wait() and the schedule(). That's because
the ring_buffer_wait() doesn't have enough context to know if it should
break out of the loop.
The loop itself is not needed, because all the callers to the
ring_buffer_wait() also has their own loop, as the callers have a better
sense of what the context is to decide whether to break out of the loop
or not.
Just have the ring_buffer_wait() block once, and if it gets woken up, exit
the function and let the callers decide what to do next.
Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Limit the max print event of trace_marker to just 4K string size. This must
also be less than the amount that can be held by a trace_seq along with
the text that is before the output (like the task name, PID, CPU, state,
etc). As trace_seq is made to handle large events (some greater than 4K).
Make the max size of a trace_marker write event be 4K which is guaranteed
to fit in the trace_seq buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20240304223433.4ba47dff@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts 60be76eeabb3d ("tracing: Add size check when printing
trace_marker output"). The only reason the precision check was added
was because of a bug that miscalculated the write size of the string into
the ring buffer and it truncated it removing the terminating nul byte. On
reading the trace it crashed the kernel. But this was due to the bug in
the code that happened during development and should never happen in
practice. If anything, the precision can hide bugs where the string in the
ring buffer isn't nul terminated and it will not be checked.
Link: https://lore.kernel.org/all/C7E7AF1A-D30F-4D18-B8E5-AF1EF58004F5@linux.ibm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279ac2@gandalf.local.home
Link: https://lore.kernel.org/all/20240302111244.3a1674be@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240304174341.2a561d9f@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf, ipsec and netfilter.
No solution yet for the stmmac issue mentioned in the last PR, but it
proved to be a lockdep false positive, not a blocker.
Current release - regressions:
- dpll: move all dpll<>netdev helpers to dpll code, fix build
regression with old compilers
Current release - new code bugs:
- page_pool: fix netlink dump stop/resume
Previous releases - regressions:
- bpf: fix verifier to check bpf_func_state->callback_depth when
pruning states as otherwise unsafe programs could get accepted
- ipv6: avoid possible UAF in ip6_route_mpath_notify()
- ice: reconfig host after changing MSI-X on VF
- mlx5:
- e-switch, change flow rule destination checking
- add a memory barrier to prevent a possible null-ptr-deref
- switch to using _bh variant of of spinlock where needed
Previous releases - always broken:
- netfilter: nf_conntrack_h323: add protection for bmp length out of
range
- bpf: fix to zero-initialise xdp_rxq_info struct before running XDP
program in CPU map which led to random xdp_md fields
- xfrm: fix UDP encapsulation in TX packet offload
- netrom: fix data-races around sysctls
- ice:
- fix potential NULL pointer dereference in ice_bridge_setlink()
- fix uninitialized dplls mutex usage
- igc: avoid returning frame twice in XDP_REDIRECT
- i40e: disable NAPI right after disabling irqs when handling
xsk_pool
- geneve: make sure to pull inner header in geneve_rx()
- sparx5: fix use after free inside sparx5_del_mact_entry
- dsa: microchip: fix register write order in ksz8_ind_write8()
Misc:
- selftests: mptcp: fixes for diag.sh"
* tag 'net-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
net: pds_core: Fix possible double free in error handling path
netrom: Fix data-races around sysctl_net_busy_read
netrom: Fix a data-race around sysctl_netrom_link_fails_count
netrom: Fix a data-race around sysctl_netrom_routing_control
netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout
netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size
netrom: Fix a data-race around sysctl_netrom_transport_busy_delay
netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay
netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries
netrom: Fix a data-race around sysctl_netrom_transport_timeout
netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser
netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser
netrom: Fix a data-race around sysctl_netrom_default_path_quality
netfilter: nf_conntrack_h323: Add protection for bmp length out of range
netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout
netfilter: nft_ct: fix l3num expectations with inet pseudo family
netfilter: nf_tables: reject constant set with timeout
netfilter: nf_tables: disallow anonymous set with timeout flag
net/rds: fix WARNING in rds_conn_connect_if_down
net: dsa: microchip: fix register write order in ksz8_ind_write8()
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When running an XDP program that is attached to a cpumap entry, we don't
initialise the xdp_rxq_info data structure being used in the xdp_buff
that backs the XDP program invocation. Tobias noticed that this leads to
random values being returned as the xdp_md->rx_queue_index value for XDP
programs running in a cpumap.
This means we're basically returning the contents of the uninitialised
memory, which is bad. Fix this by zero-initialising the rxq data
structure before running the XDP program.
Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
Reported-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240305213132.11955-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When comparing current and cached states verifier should consider
bpf_func_state->callback_depth. Current state cannot be pruned against
cached state, when current states has more iterations left compared to
cached state. Current state has more iterations left when it's
callback_depth is smaller.
Below is an example illustrating this bug, minimized from mailing list
discussion [0] (assume that BPF_F_TEST_STATE_FREQ is set).
The example is not a safe program: if loop_cb point (1) is followed by
loop_cb point (2), then division by zero is possible at point (4).
struct ctx {
__u64 a;
__u64 b;
__u64 c;
};
static void loop_cb(int i, struct ctx *ctx)
{
/* assume that generated code is "fallthrough-first":
* if ... == 1 goto
* if ... == 2 goto
* <default>
*/
switch (bpf_get_prandom_u32()) {
case 1: /* 1 */ ctx->a = 42; return 0; break;
case 2: /* 2 */ ctx->b = 42; return 0; break;
default: /* 3 */ ctx->c = 42; return 0; break;
}
}
SEC("tc")
__failure
__flag(BPF_F_TEST_STATE_FREQ)
int test(struct __sk_buff *skb)
{
struct ctx ctx = { 7, 7, 7 };
bpf_loop(2, loop_cb, &ctx, 0); /* 0 */
/* assume generated checks are in-order: .a first */
if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
asm volatile("r0 /= 0;":::"r0"); /* 4 */
return 0;
}
Prior to this commit verifier built the following checkpoint tree for
this example:
.------------------------------------- Checkpoint / State name
| .-------------------------------- Code point number
| | .---------------------------- Stack state {ctx.a,ctx.b,ctx.c}
| | | .------------------- Callback depth in frame #0
v v v v
- (0) {7P,7P,7},depth=0
- (3) {7P,7P,7},depth=1
- (0) {7P,7P,42},depth=1
- (3) {7P,7,42},depth=2
- (0) {7P,7,42},depth=2 loop terminates because of depth limit
- (4) {7P,7,42},depth=0 predicted false, ctx.a marked precise
- (6) exit
(a) - (2) {7P,7,42},depth=2
- (0) {7P,42,42},depth=2 loop terminates because of depth limit
- (4) {7P,42,42},depth=0 predicted false, ctx.a marked precise
- (6) exit
(b) - (1) {7P,7P,42},depth=2
- (0) {42P,7P,42},depth=2 loop terminates because of depth limit
- (4) {42P,7P,42},depth=0 predicted false, ctx.{a,b} marked precise
- (6) exit
- (2) {7P,7,7},depth=1 considered safe, pruned using checkpoint (a)
(c) - (1) {7P,7P,7},depth=1 considered safe, pruned using checkpoint (b)
Here checkpoint (b) has callback_depth of 2, meaning that it would
never reach state {42,42,7}.
While checkpoint (c) has callback_depth of 1, and thus
could yet explore the state {42,42,7} if not pruned prematurely.
This commit makes forbids such premature pruning,
allowing verifier to explore states sub-tree starting at (c):
(c) - (1) {7,7,7P},depth=1
- (0) {42P,7,7P},depth=1
...
- (2) {42,7,7},depth=2
- (0) {42,42,7},depth=2 loop terminates because of depth limit
- (4) {42,42,7},depth=0 predicted true, ctx.{a,b,c} marked precise
- (5) division by zero
[0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
Fixes: bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240222154121.6991-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two cpuset fixes. Both are for bugs in error handling paths and low
risk"
* tag 'cgroup-for-6.8-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Fix retval in update_cpumask()
cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask()
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The update_cpumask(), checks for newly requested cpumask by calling
validate_change(), which returns an error on passing an invalid set
of cpu(s). Independent of the error returned, update_cpumask() always
returns zero, suppressing the error and returning success to the user
on writing an invalid cpu range for a cpuset. Fix it by returning
retval instead, which is returned by validate_change().
Fixes: 99fe36ba6fc1 ("cgroup/cpuset: Improve temporary cpumasks handling")
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fix a possible memory leak in update_exclusive_cpumask() by moving the
alloc_cpumasks() down after the validate_change() check which can fail
and still before the temporary cpumasks are needed.
Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Reported-and-tested-by: Mirsad Todorovac <mirsad.todorovac@alu.hr>
Closes: https://lore.kernel.org/lkml/14915689-27a3-4cd8-80d2-9c30d0c768b6@alu.unizg.hr
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v6.7+
|
|\ \ \
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull fprobe fix from Masami Hiramatsu:
- allocate entry_data_size buffer for each rethook instance.
This fixes a buffer overrun bug (which leads a kernel crash)
when fprobe user uses its entry_data in the entry_handler.
* tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fprobe: Fix to allocate entry_data_size buffer with rethook instances
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fix to allocate fprobe::entry_data_size buffer with rethook instances.
If fprobe doesn't allocate entry_data_size buffer for each rethook instance,
fprobe entry handler can cause a buffer overrun when storing entry data in
entry handler.
Link: https://lore.kernel.org/all/170920576727.107552.638161246679734051.stgit@devnote2/
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Closes: https://lore.kernel.org/all/Zd9eBn2FTQzYyg7L@krava/
Fixes: 4bbd93455659 ("kprobes: kretprobe scalability improvement")
Cc: stable@vger.kernel.org
Tested-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|