summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'x86-urgent-2023-09-01' of ↵Linus Torvalds2023-09-015-13/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: "The most important fix here adds a missing CPU model to the recent Gather Data Sampling (GDS) mitigation list to ensure that mitigations are available on that CPU. There are also a pair of warning fixes, and closure of a covert channel that pops up when protection keys are disabled. Summary: - Mark all Skylake CPUs as vulnerable to GDS - Fix PKRU covert channel - Fix -Wmissing-variable-declarations warning for ia32_xyz_class - Fix kernel-doc annotation warning" * tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix PKRU covert channel x86/irq/i8259: Fix kernel-doc annotation warning x86/speculation: Mark all Skylake CPUs as vulnerable to GDS x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_class
| * x86/fpu/xstate: Fix PKRU covert channelJim Mattson2023-08-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | When XCR0[9] is set, PKRU can be read and written from userspace with XSAVE and XRSTOR, even when CR4.PKE is clear. Clear XCR0[9] when protection keys are disabled. Reported-by: Tavis Ormandy <taviso@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230831043228.1194256-1-jmattson@google.com
| * x86/irq/i8259: Fix kernel-doc annotation warningVincenzo Palazzo2023-08-311-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix this warning: arch/x86/kernel/i8259.c:235: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * ELCR registers (0x4d0, 0x4d1) control edge/level of IRQ CC arch/x86/kernel/irqinit.o Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230830131211.88226-1-vincenzopalazzodev@gmail.com
| * x86/speculation: Mark all Skylake CPUs as vulnerable to GDSDave Hansen2023-08-311-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Gather Data Sampling (GDS) vulnerability is common to all Skylake processors. However, the "client" Skylakes* are now in this list: https://www.intel.com/content/www/us/en/support/articles/000022396/processors.html which means they are no longer included for new vulnerabilities here: https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html or in other GDS documentation. Thus, they were not included in the original GDS mitigation patches. Mark SKYLAKE and SKYLAKE_L as vulnerable to GDS to match all the other Skylake CPUs (which include Kaby Lake). Also group the CPUs so that the ones that share the exact same vulnerabilities are next to each other. Last, move SRBDS to the end of each line. This makes it clear at a glance that SKYLAKE_X is unique. Of the five Skylakes, it is the only "server" CPU and has a different implementation from the clients of the "special register" hardware, making it immune to SRBDS. This makes the diff much harder to read, but the resulting table is worth it. I very much appreciate the report from Michael Zhivich about this issue. Despite what level of support a hardware vendor is providing, the kernel very much needs an accurate and up-to-date list of vulnerable CPUs. More reports like this are very welcome. * Client Skylakes are CPUID 406E3/506E3 which is family 6, models 0x4E and 0x5E, aka INTEL_FAM6_SKYLAKE and INTEL_FAM6_SKYLAKE_L. Reported-by: Michael Zhivich <mzhivich@akamai.com> Fixes: 8974eb588283 ("x86/speculation: Add Gather Data Sampling mitigation") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_classJustin Stitt2023-08-302-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building x86 defconfig with Clang-18 I get the following warnings: | arch/x86/ia32/audit.c:6:10: warning: no previous extern declaration for non-static variable 'ia32_dir_class' [-Wmissing-variable-declarations] | 6 | unsigned ia32_dir_class[] = { | arch/x86/ia32/audit.c:11:10: warning: no previous extern declaration for non-static variable 'ia32_chattr_class' [-Wmissing-variable-declarations] | 11 | unsigned ia32_chattr_class[] = { | arch/x86/ia32/audit.c:16:10: warning: no previous extern declaration for non-static variable 'ia32_write_class' [-Wmissing-variable-declarations] | 16 | unsigned ia32_write_class[] = { | arch/x86/ia32/audit.c:21:10: warning: no previous extern declaration for non-static variable 'ia32_read_class' [-Wmissing-variable-declarations] | 21 | unsigned ia32_read_class[] = { | arch/x86/ia32/audit.c:26:10: warning: no previous extern declaration for non-static variable 'ia32_signal_class' [-Wmissing-variable-declarations] | 26 | unsigned ia32_signal_class[] = { These warnings occur due to their respective extern declarations being scoped inside of audit_classes_init as well as only being enabled with `CONFIG_IA32_EMULATION=y`: | static int __init audit_classes_init(void) | { | #ifdef CONFIG_IA32_EMULATION | extern __u32 ia32_dir_class[]; | extern __u32 ia32_write_class[]; | extern __u32 ia32_read_class[]; | extern __u32 ia32_chattr_class[]; | audit_register_class(AUDIT_CLASS_WRITE_32, ia32_write_class); | audit_register_class(AUDIT_CLASS_READ_32, ia32_read_class); | audit_register_class(AUDIT_CLASS_DIR_WRITE_32, ia32_dir_class); | audit_register_class(AUDIT_CLASS_CHATTR_32, ia32_chattr_class); | #endif | audit_register_class(AUDIT_CLASS_WRITE, write_class); | audit_register_class(AUDIT_CLASS_READ, read_class); | audit_register_class(AUDIT_CLASS_DIR_WRITE, dir_class); | audit_register_class(AUDIT_CLASS_CHATTR, chattr_class); | return 0; | } Lift the extern declarations to their own header and resolve scoping issues (and thus fix the warnings). Moreover, change __u32 to unsigned so that we match the definitions: | unsigned ia32_dir_class[] = { | #include <asm-generic/audit_dir_write.h> | ~0U | }; | | unsigned ia32_chattr_class[] = { | #include <asm-generic/audit_change_attr.h> | ~0U | }; | ... This patch is similar to commit: 0e5e3d4461a22d73 ("x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()") [1] Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20200516123816.2680-1-b.thiel@posteo.de/ [1] Link: https://github.com/ClangBuiltLinux/linux/issues/1920 Link: https://lore.kernel.org/r/20230829-missingvardecl-audit-v1-1-34efeb7f3539@google.com
* | Merge tag 'trace-v6.6' of ↵Linus Torvalds2023-09-0118-169/+1424
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic - Other minor cleanups" * tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) tracefs: Remove kerneldoc from struct eventfs_file tracefs: Avoid changing i_mode to a temp value tracing/user_events: Optimize safe list traversals ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon() tracing: Remove unused function declarations tracing/filters: Document cpumask filtering tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Dynamically allocate filter_pred.regex test: ftrace: Fix kprobe test for eventfs eventfs: Move tracing/events to eventfs eventfs: Implement removal of meta data from eventfs eventfs: Implement functions to create files and dirs when accessed eventfs: Implement eventfs lookup, read, open functions eventfs: Implement eventfs file add functions ...
| * | tracefs: Remove kerneldoc from struct eventfs_fileSteven Rostedt (Google)2023-08-221-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The struct eventfs_file is a local structure and should not be parsed by kernel doc. It also does not fully follow the kerneldoc format and is causing kerneldoc to spit out errors. Replace the /** to /* so that kerneldoc no longer processes this structure. Also format the comments of the delete union of the structure to be a bit better. Link: https://lore.kernel.org/linux-trace-kernel/20230818201414.2729745-1-willy@infradead.org/ Link: https://lore.kernel.org/linux-trace-kernel/20230822053313.77aa3397@rorschach.local.home Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracefs: Avoid changing i_mode to a temp valueSishuai Gong2023-08-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now inode->i_mode is updated twice to reach the desired value in tracefs_apply_options(). Because there is no lock protecting the two writes, other threads might read the intermediate value of inode->i_mode. Thread-1 Thread-2 // tracefs_apply_options() //e.g., acl_permission_check inode->i_mode &= ~S_IALLUGO; unsigned int mode = inode->i_mode; inode->i_mode |= opts->mode; I think there is no need to introduce a lock but it is better to only update inode->i_mode ONCE, so the readers will either see the old or latest value, rather than an intermediate/temporary value. Note, the race is not a security concern as the intermediate value is more locked down than either the start or end version. This is more just to do the conversion cleanly. Link: https://lore.kernel.org/linux-trace-kernel/AB5B0A1C-75D9-4E82-A7F0-CF7D0715587B@gmail.com Signed-off-by: Sishuai Gong <sishuai.system@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/user_events: Optimize safe list traversalsEric Vaughn2023-08-221-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several of the list traversals in the user_events facility use safe list traversals where they could be using the unsafe versions instead. Replace these safe traversals with their unsafe counterparts in the interest of optimization. Link: https://lore.kernel.org/linux-trace-kernel/20230810194337.695983-1-ervaughn@linux.microsoft.com Suggested-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Eric Vaughn <ervaughn@linux.microsoft.com> Acked-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | ftrace: Remove empty declaration ftrace_enable_daemon() and ↵Zhang Zekun2023-08-221-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ftrace_disable_daemon() The definition of ftrace_enable_daemon() and ftrace_disable_daemon() has been removed since commit cb7be3b2fc2c ("ftrace: remove daemon"), remain the declarations in the header files, so remove it. Link: https://lore.kernel.org/linux-trace-kernel/20230804013636.115940-1-zhangzekun11@huawei.com Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Remove unused function declarationsYue Haibing2023-08-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9457158bbc0e ("tracing: Fix reset of time stamps during trace_clock changes") left behind tracing_reset_current() declaration. Also commit 6954e415264e ("tracing: Place trace_pid_list logic into abstract functions") removed trace_free_pid_list() implementation but leave declaration. Link: https://lore.kernel.org/linux-trace-kernel/20230803144028.25492-1-yuehaibing@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Document cpumask filteringValentin Schneider2023-08-221-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cpumask, scalar and CPU fields can now be filtered by a user-provided cpumask, document the syntax. Link: https://lkml.kernel.org/r/20230707172155.70873-10-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Further optimise scalar vs cpumask comparisonValentin Schneider2023-08-221-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks. Link: https://lkml.kernel.org/r/20230707172155.70873-9-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a ↵Valentin Schneider2023-08-221-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it. Link: https://lkml.kernel.org/r/20230707172155.70873-8-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Optimise scalar vs cpumask filtering when the user mask is ↵Valentin Schneider2023-08-221-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand. Link: https://lkml.kernel.org/r/20230707172155.70873-7-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a ↵Valentin Schneider2023-08-221-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. Reuse do_filter_scalar_cpumask() when the input mask has a weight of one. Link: https://lkml.kernel.org/r/20230707172155.70873-6-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Enable filtering the CPU common field by a cpumaskValentin Schneider2023-08-221-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event). A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well. This enables doing things like $ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}' Link: https://lkml.kernel.org/r/20230707172155.70873-5-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Enable filtering a scalar field by a cpumaskValentin Schneider2023-08-221-11/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ... Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids. Allow these to be filtered by a user-provided cpumask, which enables e.g.: $ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-4-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Enable filtering a cpumask field by another cpumaskValentin Schneider2023-08-222-2/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions. Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask. The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression. This enables e.g.: $ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-3-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing/filters: Dynamically allocate filter_pred.regexValentin Schneider2023-08-221-25/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field. A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated. While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline. No change in behaviour intended. The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@ -kfree(pred); +free_predicate(pred); Link: https://lkml.kernel.org/r/20230707172155.70873-2-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | test: ftrace: Fix kprobe test for eventfsAjay Kaher2023-07-312-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kprobe_args_char.tc, kprobe_args_string.tc has validation check for tracefs_create_dir, for eventfs it should be eventfs_create_dir. Link: https://lkml.kernel.org/r/1690568452-46553-11-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Move tracing/events to eventfsAjay Kaher2023-07-314-33/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until now, /sys/kernel/tracing/events was no different than any other part of tracefs. The files and directories within the events directory was created when the tracefs was mounted, and also created for the instances in /sys/kernel/tracing/instances/<instance>/events. Most of these files and directories will never be referenced. Since there are thousands of these files and directories they spend their time wasting precious memory resources. Move the "events" directory to the new eventfs. The eventfs will take the meta data of the events that they represent and store that. When the files in the events directory are referenced, the dentry and inodes to represent them are then created. When the files are no longer referenced, they are freed. This saves the precious memory resources that were wasted on these seldom referenced dentries and inodes. Running the following: ~# cat /proc/meminfo /proc/slabinfo > before.out ~# mkdir /sys/kernel/tracing/instances/foo ~# cat /proc/meminfo /proc/slabinfo > after.out to test the changes produces the following deltas: Before this change: Before after deltas for meminfo: MemFree: -32260 MemAvailable: -21496 KReclaimable: 21528 Slab: 22440 SReclaimable: 21528 SUnreclaim: 912 VmallocUsed: 16 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] tracefs_inode_cache: 14472 [* 1184 = 17134848] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 28 [* 1480 = 41440] dentry: 14450 [* 312 = 4508400] lsm_inode_cache: 14453 [* 32 = 462496] vma_lock: 11 [* 152 = 1672] vm_area_struct: 2 [* 184 = 368] trace_event_file: 1748 [* 88 = 153824] kmalloc-256: 1072 [* 256 = 274432] kmalloc-64: 2842 [* 64 = 181888] Total slab additions in size: 22,763,400 bytes With this change: Before after deltas for meminfo: MemFree: -12600 MemAvailable: -12580 Cached: 24 Active: 12 Inactive: 68 Inactive(anon): 48 Active(file): 12 Inactive(file): 20 Dirty: -4 AnonPages: 68 KReclaimable: 12 Slab: 1856 SReclaimable: 12 SUnreclaim: 1844 KernelStack: 16 PageTables: 36 VmallocUsed: 16 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] tracefs_inode_cache: 108 [* 1184 = 127872] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 18 [* 1480 = 26640] dentry: 127 [* 312 = 39624] lsm_inode_cache: 152 [* 32 = 4864] vma_lock: 67 [* 152 = 10184] vm_area_struct: -12 [* 184 = -2208] trace_event_file: 1764 [* 96 = 169344] kmalloc-96: 14322 [* 96 = 1374912] kmalloc-64: 2814 [* 64 = 180096] kmalloc-32: 1103 [* 32 = 35296] kmalloc-16: 2308 [* 16 = 36928] kmalloc-8: 12800 [* 8 = 102400] Total slab additions in size: 2,109,984 bytes Which is a savings of 20,653,416 bytes (20 MB) per tracing instance. Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement removal of meta data from eventfsAjay Kaher2023-07-302-0/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When events are removed from tracefs, the eventfs must be aware of this. The eventfs_remove() removes the meta data from eventfs so that it will no longer create the files associated with that event. When an instance is removed from tracefs, eventfs_remove_events_dir() will remove and clean up the entire "events" directory. The helper function eventfs_remove_rec() is used to clean up and free the associated data from eventfs for both of the added functions. SRCU is used to protect the lists of meta data stored in the eventfs. The eventfs_mutex is used to protect the content of the items in the list. As lookups may be happening as deletions of events are made, the freeing of dentry/inodes and relative information is done after the SRCU grace period has passed. Link: https://lkml.kernel.org/r/1690568452-46553-9-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202305030611.Kas747Ev-lkp@intel.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement functions to create files and dirs when accessedAjay Kaher2023-07-303-2/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add create_file() and create_dir() functions to create the files and directories respectively when they are accessed. The functions will be called from the lookup operation of the inode_operations or from the open function of file_operations. Link: https://lkml.kernel.org/r/1690568452-46553-8-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement eventfs lookup, read, open functionsAjay Kaher2023-07-302-0/+306
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the inode_operations, file_operations, and helper functions to eventfs: dcache_dir_open_wrapper() eventfs_root_lookup() eventfs_release() eventfs_set_ef_status_free() eventfs_post_create_dir() The inode_operations and file_operations functions will be called from the VFS layer. create_file() and create_dir() are added as stub functions and will be filled in later. Link: https://lkml.kernel.org/r/1690568452-46553-7-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement eventfs file add functionsAjay Kaher2023-07-302-0/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the following functions to add files to evenfs: eventfs_add_events_file() to add the data needed to create a specific file located at the top level events directory. The dentry/inode will be created when the events directory is scanned. eventfs_add_file() to add the data needed for files within the directories below the top level events directory. The dentry/inode of the file will be created when the directory that the file is in is scanned. Link: https://lkml.kernel.org/r/1690568452-46553-6-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305051619.9a469a9a-yujie.liu@intel.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement eventfs dir creation functionsAjay Kaher2023-07-304-0/+227
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add eventfs_file structure which will hold the properties of the eventfs files and directories. Add following functions to create the directories in eventfs: eventfs_create_events_dir() will create the top level "events" directory within the tracefs file system. eventfs_add_subsystem_dir() creates an eventfs_file descriptor with the given name of the subsystem. eventfs_add_dir() creates an eventfs_file descriptor with the given name of the directory and attached to a eventfs_file of a subsystem. Add tracefs_inode structure to hold the inodes, flags and pointers to private data used by eventfs. Link: https://lkml.kernel.org/r/1690568452-46553-5-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305051619.9a469a9a-yujie.liu@intel.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracefs: Rename and export some tracefs functionsAjay Kaher2023-07-302-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export a few tracefs functions that will be needed by the eventfs dynamic file system. Rename them to start with "tracefs_" to keep with the name space. start_creating -> tracefs_start_creating failed_creating -> tracefs_failed_creating end_creating -> tracefs_end_creating Link: https://lkml.kernel.org/r/1690568452-46553-4-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | eventfs: Implement tracefs_inode_cacheAjay Kaher2023-07-302-0/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a kmem cache of tracefs_inodes. To be more efficient, as there are lots of tracefs inodes, create its own cache. This also allows to see how many tracefs inodes have been created. Add helper functions: tracefs_alloc_inode() tracefs_free_inode() get_tracefs() Link: https://lkml.kernel.org/r/1690568452-46553-3-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher <akaher@vmware.com> Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Require all trace events to have a TRACE_SYSTEMSteven Rostedt (Google)2023-07-301-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The creation of the trace event directory requires that a TRACE_SYSTEM is defined that the trace event directory is added within the system it was defined in. The code handled the case where a TRACE_SYSTEM was not added, and would then add the event at the events directory. But nothing should be doing this. This code also prevents the implementation of creating dynamic dentrys for the eventfs system. As this path has never been hit on correct code, remove it. If it does get hit, issues a WARN_ON_ONCE() and return ENODEV. Link: https://lkml.kernel.org/r/1690568452-46553-2-git-send-email-akaher@vmware.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Set actual size after ring buffer resizeZheng Yejian2023-07-301-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we can resize trace ringbuffer by writing a value into file 'buffer_size_kb', then by reading the file, we get the value that is usually what we wrote. However, this value may be not actual size of trace ring buffer because of the round up when doing resize in kernel, and the actual size would be more useful. Link: https://lore.kernel.org/linux-trace-kernel/20230705002705.576633-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Add free_trace_iter_content() helper functionSteven Rostedt (Google)2023-07-301-11/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the trace iterator is created and used by various interfaces, the clean up of it needs to be consistent. Create a free_trace_iter_content() helper function that frees the content of the iterator and use that to clean it up in all places that it is used. Link: https://lkml.kernel.org/r/20230715141348.341887497@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Remove unnecessary copying of tr->current_traceSteven Rostedt (Google)2023-07-301-19/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The iterator allocated a descriptor to copy the current_trace. This was done with the assumption that the function pointers might change. But this was a false assuption, as it does not change. There's no reason to make a copy of the current_trace and just use the pointer it points to. This removes needing to manage freeing the descriptor. Worse yet, there's locations that the iterator is used but does make a copy and just uses the pointer. This could cause the actual pointer to the trace descriptor to be freed and not the allocated copy. This is more of a clean up than a fix. Link: https://lkml.kernel.org/r/20230715141348.135792275@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: d7350c3f45694 ("tracing/core: make the read callbacks reentrants") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | ring_buffer: Use try_cmpxchg instead of cmpxchgUros Bizjak2023-07-301-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structureSteven Rostedt (Google)2023-07-304-22/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For backward compatibility, older tooling expects to see the kernel_stack event with a "caller" field that is a fixed size array of 8 addresses. The code now supports more than 8 with an added "size" field that states the real number of entries. But the "caller" field still just looks like a fixed size to user space. Since the tracing macros that create the user space format files also creates the structures that those files represent, the kernel_stack event structure had its "caller" field a fixed size of 8, but in reality, when it is allocated on the ring buffer, it can hold more if the stack trace is bigger that 8 functions. The copying of these entries was simply done with a memcpy(): size = nr_entries * sizeof(unsigned long); memcpy(entry->caller, fstack->calls, size); The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was larger than 8, that the memcpy() was writing more than what the structure stated it can hold and it complained about it. This is because the FORTIFY_SOURCE code is unaware that the amount allocated is actually enough to hold the size. It does not expect that a fixed size field will hold more than the fixed size. This was originally solved by hiding the caller assignment with some pointer arithmetic. ptr = ring_buffer_data(); entry = ptr; ptr += offsetof(typeof(*entry), caller); memcpy(ptr, fstack->calls, size); But it is considered bad form to hide from kernel hardening. Instead, make it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro that is specific for this one special use case. The macro will take 4 arguments: type, item, len, field (whereas the __array() macro takes just the first three). This macro will act just like the __array() macro when creating the code to deal with the format file that is exposed to user space. But for the kernel, it will turn the caller field into: type item[] __counted_by(field); or for this instance: unsigned long caller[] __counted_by(size); Now the kernel code can expose the assignment of the caller to the FORTIFY_SOURCE and everyone is happy! Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Kees Cook <keescook@chromium.org>
* | | Merge tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2023-09-018-804/+1512
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull workqueue updates from Tejun Heo: - Unbound workqueues now support more flexible affinity scopes. The default behavior is to soft-affine according to last level cache boundaries. A work item queued from a given LLC is executed by a worker running on the same LLC but the worker may be moved across cache boundaries as the scheduler sees fit. On machines which multiple L3 caches, which are becoming more popular along with chiplet designs, this improves cache locality while not harming work conservation too much. Unbound workqueues are now also a lot more flexible in terms of execution affinity. Differeing levels of affinity scopes are supported and both the default and per-workqueue affinity settings can be modified dynamically. This should help working around amny of sub-optimal behaviors observed recently with asymmetric ARM CPUs. This involved signficant restructuring of workqueue code. Nothing was reported yet but there's some risk of subtle regressions. Should keep an eye out. - Rescuer workers now has more identifiable comms. - workqueue.unbound_cpus added so that CPUs which can be used by workqueue can be constrained early during boot. - Now that all the in-tree users have been flushed out, trigger warning if system-wide workqueues are flushed. * tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (31 commits) workqueue: fix data race with the pwq->stats[] increment workqueue: Rename rescuer kworker workqueue: Make default affinity_scope dynamically updatable workqueue: Add "Affinity Scopes and Performance" section to documentation workqueue: Implement non-strict affinity scope for unbound workqueues workqueue: Add workqueue_attrs->__pod_cpumask workqueue: Factor out need_more_worker() check and worker wake-up workqueue: Factor out work to worker assignment and collision handling workqueue: Add multiple affinity scopes and interface to select them workqueue: Modularize wq_pod_type initialization workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration workqueue: Generalize unbound CPU pods workqueue: Factor out clearing of workqueue-only attrs fields workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() workqueue: Initialize unbound CPU pods later in the boot workqueue: Move wq_pod_init() below workqueue_init() workqueue: Rename NUMA related names to use pod instead workqueue: Rename workqueue_attrs->no_numa to ->ordered workqueue: Make unbound workqueues to use per-cpu pool_workqueues workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug ...
| * | | workqueue: fix data race with the pwq->stats[] incrementMirsad Goran Todorovac2023-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KCSAN has discovered a data race in kernel/workqueue.c:2598: [ 1863.554079] ================================================================== [ 1863.554118] BUG: KCSAN: data-race in process_one_work / process_one_work [ 1863.554142] write to 0xffff963d99d79998 of 8 bytes by task 5394 on cpu 27: [ 1863.554154] process_one_work (kernel/workqueue.c:2598) [ 1863.554166] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752) [ 1863.554177] kthread (kernel/kthread.c:389) [ 1863.554186] ret_from_fork (arch/x86/kernel/process.c:145) [ 1863.554197] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1863.554213] read to 0xffff963d99d79998 of 8 bytes by task 5450 on cpu 12: [ 1863.554224] process_one_work (kernel/workqueue.c:2598) [ 1863.554235] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752) [ 1863.554247] kthread (kernel/kthread.c:389) [ 1863.554255] ret_from_fork (arch/x86/kernel/process.c:145) [ 1863.554266] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1863.554280] value changed: 0x0000000000001766 -> 0x000000000000176a [ 1863.554295] Reported by Kernel Concurrency Sanitizer on: [ 1863.554303] CPU: 12 PID: 5450 Comm: kworker/u64:1 Tainted: G L 6.5.0-rc6+ #44 [ 1863.554314] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 1863.554322] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] [ 1863.554941] ================================================================== lockdep_invariant_state(true); → pwq->stats[PWQ_STAT_STARTED]++; trace_workqueue_execute_start(work); worker->current_func(work); Moving pwq->stats[PWQ_STAT_STARTED]++; before the line raw_spin_unlock_irq(&pool->lock); resolves the data race without performance penalty. KCSAN detected at least one additional data race: [ 157.834751] ================================================================== [ 157.834770] BUG: KCSAN: data-race in process_one_work / process_one_work [ 157.834793] write to 0xffff9934453f77a0 of 8 bytes by task 468 on cpu 29: [ 157.834804] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606) [ 157.834815] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752) [ 157.834826] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389) [ 157.834834] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145) [ 157.834845] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312) [ 157.834859] read to 0xffff9934453f77a0 of 8 bytes by task 214 on cpu 7: [ 157.834868] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606) [ 157.834879] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752) [ 157.834890] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389) [ 157.834897] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145) [ 157.834907] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312) [ 157.834920] value changed: 0x000000000000052a -> 0x0000000000000532 [ 157.834933] Reported by Kernel Concurrency Sanitizer on: [ 157.834941] CPU: 7 PID: 214 Comm: kworker/u64:2 Tainted: G L 6.5.0-rc7-kcsan-00169-g81eaf55a60fc #4 [ 157.834951] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 157.834958] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] [ 157.835567] ================================================================== in code: trace_workqueue_execute_end(work, worker->current_func); → pwq->stats[PWQ_STAT_COMPLETED]++; lock_map_release(&lockdep_map); lock_map_release(&pwq->wq->lockdep_map); which needs to be resolved separately. Fixes: 725e8ec59c56c ("workqueue: Add pwq->stats[] and a monitoring script") Cc: Tejun Heo <tj@kernel.org> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Link: https://lore.kernel.org/lkml/20230818194448.29672-1-mirsad.todorovac@alu.unizg.hr/ Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Rename rescuer kworkerAaron Tomlin2023-08-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each CPU-specific and unbound kworker kthread conforms to a particular naming scheme. However, this does not extend to the rescuer kworker. At present, a rescuer kworker is simply named according to its workqueue's name. This can be cryptic. This patch modifies a rescuer to follow the kworker naming scheme. The "R" is indicative of a rescuer and after "-" is its workqueue's name e.g. "kworker/R-ext4-rsv-conver". tj: Use "R" instead of "r" as the prefix to make it more distinctive and consistent with how highpri pools are marked. Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Make default affinity_scope dynamically updatableTejun Heo2023-08-074-13/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Add "Affinity Scopes and Performance" section to documentationTejun Heo2023-08-071-5/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With affinity scopes and their strictness setting added, unbound workqueues should now be able to cover wide variety of configurations and use cases. Unfortunately, the performance picture is not entirely straight-forward due to a trade-off between efficiency and work-conservation in some situations necessitating manual configuration. This patch adds "Affinity Scopes and Performance" section to Documentation/core-api/workqueue.rst which illustrates the trade-off with a set of experiments and provides some guidelines. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Implement non-strict affinity scope for unbound workqueuesTejun Heo2023-08-075-20/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | | workqueue: Add workqueue_attrs->__pod_cpumaskTejun Heo2023-08-072-37/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | workqueue_attrs has two uses: * to specify the required unouned workqueue properties by users * to match worker_pool's properties to workqueues by core code For example, if the user wants to restrict a workqueue to run only CPUs 0 and 2, and the two CPUs are on different affinity scopes, the workqueue's attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be associated with two worker_pools, one with attrs->cpumask containing just CPU 0 and the other CPU 2. Workqueue wants to support non-strict affinity scopes where work items are started in their matching affinity scopes but the scheduler is free to migrate them outside the starting scopes, which can enable utilizing the whole machine while maintaining most of the locality benefits from affinity scopes. To enable that, worker_pools need to distinguish the strict affinity that it has to follow (because that's the restriction coming from the user) and the soft affinity that it wants to apply when dispatching work items. Note that two worker_pools with different soft dispatching requirements have to be separate; otherwise, for example, we'd be ping-ponging worker threads across NUMA boundaries constantly. This patch adds workqueue_attrs->__pod_cpumask. The new field is double underscored as it's only used internally to distinguish worker_pools. A worker_pool's ->cpumask is now always the same as the online subset of allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's subset of that ->cpumask. Going back to the example above, both worker_pools would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask would contain 0 while the other's 2. * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask that the pool's workers must stay within. This is currently always ->__pod_cpumask as all boundaries are still strict. * As a workqueue_attrs can now track both the associated workqueues' cpumask and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external out-argument. Drop @cpumask and instead store the result in ->__pod_cpumask. * The above also simplifies apply_wqattrs_prepare() as the same workqueue_attrs can be used to create all pods associated with a workqueue. tmp_attrs is dropped. * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq update is needed instead of only comparing ->cpumask so that ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks but the code is easier to understand and more robust this way. The only user-visible behavior change is that two workqueues with different cpumasks no longer can share worker_pools even when their pod subsets coincide. Going back to the example, let's say there's another workqueue with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped to two worker_pools - one with CPU 0, the other with 2 and 3. The former has the same cpumask as the first pod of the earlier example and would have shared the same worker_pool but that's no longer the case after this patch. The worker_pools would have the same ->__pod_cpumask but their ->cpumask's wouldn't match. While this is necessary to support non-strict affinity scopes, there can be further optimizations to maintain sharing among strict affinity scopes. However, non-strict affinity scopes are going to be preferable for most use cases and we don't see very diverse mixture of unbound workqueue cpumasks anyway, so the additional overhead doesn't seem to justify the extra complexity. v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by using wqattrs_equal() for comparison instead. - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Factor out need_more_worker() check and worker wake-upTejun Heo2023-08-071-50/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Checking need_more_worker() and calling wake_up_worker() is a repeated pattern. Let's add kick_pool(), which checks need_more_worker() and open-code wake_up_worker(), and replace wake_up_worker() uses. The following conversions aren't one-to-one: * __queue_work() was using __need_more_work() because it knows that pool->worklist isn't empty. Switching to kick_pool() adds an extra list_empty() test. * create_worker() always needs to wake up the newly minted worker whether there's more work to do or not to avoid triggering hung task check on the new task. Keep the current wake_up_process() and still add kick_pool(). This may lead to an extra wakeup which isn't harmful. * pwq_adjust_max_active() was explicitly checking whether it needs to wake up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up a worker when necessary, this explicit check is no longer necessary and dropped. * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding a need_more_worker() test. This avoids spurious wakeups and shouldn't break anything. wake_up_worker() is dropped as kick_pool() replaces all its users. After this patch, all paths that wakes up a non-rescuer worker to initiate work item execution use kick_pool(). This will enable future changes to improve locality. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Factor out work to worker assignment and collision handlingTejun Heo2023-08-071-28/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The two work execution paths in worker_thread() and rescuer_thread() use move_linked_works() to claim work items from @pool->worklist. Once claimed, process_schedule_works() is called which invokes process_one_work() on each work item. process_one_work() then uses find_worker_executing_work() to detect and handle collisions - situations where the work item to be executed is still running on another worker. This works fine, but, to improve work execution locality, we want to establish work to worker association earlier and know for sure that the worker is going to excute the work once asssigned, which requires performing collision handling earlier while trying to assign the work item to the worker. This patch introduces assign_work() which assigns a work item to a worker using move_linked_works() and then performs collision handling. As collision handling is handled earlier, process_one_work() no longer needs to worry about them. After the this patch, collision checks for linked work items are skipped, which should be fine as they can't be queued multiple times concurrently. For work items running from rescuers, the timing of collision handling may change but the invariant that the work items go through collision handling before starting execution does not. This patch shouldn't cause noticeable behavior changes, especially given that worker_thread() behavior remains the same. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Add multiple affinity scopes and interface to select themTejun Heo2023-08-075-12/+193
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Modularize wq_pod_type initializationTejun Heo2023-08-071-34/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM init is hard coded into workqueue_init_topology(). This patch modularizes the init path by introducing init_pod_type() which takes a callback to determine whether two CPUs should share a pod as an argument. init_pod_type() first scans the CPU combinations testing for sharing to assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with cpus_share_numa() which tests whether the CPU belongs to the same NUMA node. This patch may change the pod ID assigned to each NUMA node but that shouldn't cause any behavior changes as the NUMA node to use for allocations are tracked separately in pod_type->pod_node[]. This makes adding new affinty types pretty easy. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue ↵Tejun Heo2023-08-072-0/+225
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | configuration Lack of visibility has always been a pain point for workqueues. While the recently added wq_monitor.py improved the situation, it's still difficult to understand what worker pools are active in the system, how workqueues map to them and why. The lack of visibility into how workqueues are configured is going to become more noticeable as workqueue improves locality awareness and provides more mechanisms to customize locality related behaviors. Now that the basic framework for more flexible locality support is in place, this is a good time to improve the situation. This patch adds tools/workqueues/wq_dump.py which prints out the topology configuration, worker pools and how workqueues are mapped to pools. Read the command's help message for more details. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Generalize unbound CPU podsTejun Heo2023-08-072-65/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Factor out clearing of workqueue-only attrs fieldsTejun Heo2023-08-071-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | workqueue_attrs can be used for both workqueues and worker_pools. However, some fields, currently only ->ordered, only apply to workqueues and should be cleared to the default / invalid values. Currently, an unbound workqueue explicitly clears attrs->ordered in get_unbound_pool() after copying the source workqueue attrs, while per-cpu workqueues rely on the fact that zeroing on allocation gives us the desired default value for pool->attrs->ordered. This is fragile. Let's add wqattrs_clear_for_pool() which clears attrs->ordered and is called from both init_worker_pool() and get_unbound_pool(). This will ease adding more workqueue-only attrs fields. In get_unbound_pool(), pool->node initialization is moved upwards for readability. This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: Factor out actual cpumask calculation to reduce subtlety in ↵Tejun Heo2023-08-071-20/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wq_update_pod() For an unbound pool, multiple cpumasks are involved. U: The user-specified cpumask (may be filtered with cpu_possible_mask). A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering leaves no CPU, wq_unbound_cpumask is used. P: Per-pod subsets of #A. wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and wq->cpu_pwq[CPU]->pool->attrs->cpumask #P. wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To calculate the new #P for each workqueue, it needs to call wq_calc_pod_cpumask() with @attrs that contains #A. Currently, wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with wq->dfl_pwq->pool->attrs. This is rather fragile because we're calling wq_calc_pod_cpumask() with @attrs of a worker_pool rather than the workqueue's actual attrs when what we want to calculate is the workqueue's cpumask on the pod. While this works fine currently, future changes will add fields which are used differently between workqueues and worker_pools and this subtlety will bite us. This patch factors out #U -> #A calculation from apply_wqattrs_prepare() into wqattrs_actualize_cpumask and updates wq_update_pod() to copy wq->unbound_attrs and use the new helper to obtain #A freshly instead of abusing wq->dfl_pwq->pool_attrs. This shouldn't cause any behavior changes in the current code. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com