summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'kbuild-v5.17' of ↵Linus Torvalds2022-01-191-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add new kconfig target 'make mod2noconfig', which will be useful to speed up the build and test iteration. - Raise the minimum supported version of LLVM to 11.0.0 - Refactor certs/Makefile - Change the format of include/config/auto.conf to stop double-quoting string type CONFIG options. - Fix ARCH=sh builds in dash - Separate compression macros for general purposes (cmd_bzip2 etc.) and the ones for decompressors (cmd_bzip2_with_size etc.) - Misc Makefile cleanups * tag 'kbuild-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits) kbuild: add cmd_file_size arch: decompressor: remove useless vmlinux.bin.all-y kbuild: rename cmd_{bzip2,lzma,lzo,lz4,xzkern,zstd22} kbuild: drop $(size_append) from cmd_zstd sh: rename suffix-y to suffix_y doc: kbuild: fix default in `imply` table microblaze: use built-in function to get CPU_{MAJOR,MINOR,REV} certs: move scripts/extract-cert to certs/ kbuild: do not quote string values in include/config/auto.conf kbuild: do not include include/config/auto.conf from shell scripts certs: simplify $(srctree)/ handling and remove config_filename macro kbuild: stop using config_filename in scripts/Makefile.modsign certs: remove misleading comments about GCC PR certs: refactor file cleaning certs: remove unneeded -I$(srctree) option for system_certificates.o certs: unify duplicated cmd_extract_certs and improve the log certs: use $< and $@ to simplify the key generation rule kbuild: remove headers_check stub kbuild: move headers_check.pl to usr/include/ certs: use if_changed to re-generate the key when the key type is changed ...
| * gcov: Remove compiler version checkNathan Chancellor2021-12-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The minimum supported version of LLVM has been raised to 11.0.0, meaning this check is always true, so it can be dropped. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | Merge branch 'modules-next' of ↵Linus Torvalds2022-01-174-16/+320
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "The biggest change here is in-kernel support for module decompression. This change is being made to help support LSMs like LoadPin as otherwise it loses link between the source of kernel module on the disk and binary blob that is being loaded into the kernel. kmod decompression is still done by userspace even with this is done, both because there are no measurable gains in not doing so and as it adds a secondary extra check for validating the module before loading it into the kernel. The rest of the changes are minor, the only other change worth mentionin there is Jessica Yu is now bowing out of maintenance of modules as she's taking a break from work. While there were other changes posted for modules, those have not yet received much review of testing so I'm not yet comfortable in merging any of those changes yet." * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: fix signature check failures when using in-kernel decompression kernel: Fix spelling mistake "compresser" -> "compressor" MAINTAINERS: add mailing lists for kmod and modules module.h: allow #define strings to work with MODULE_IMPORT_NS module: add in-kernel support for decompressing MAINTAINERS: Remove myself as modules maintainer module: Remove outdated comment
| * | module: fix signature check failures when using in-kernel decompressionDmitry Torokhov2022-01-141-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new flag MODULE_INIT_COMPRESSED_FILE unintentionally trips check in module_sig_check(). The check was supposed to catch case when version info or magic was removed from a signed module, making signature invalid, but it was coded too broadly and was catching this new flag as well. Change the check to only test the 2 particular flags affecting signature validity. Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing") Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
| * | kernel: Fix spelling mistake "compresser" -> "compressor"Colin Ian King2022-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | There is a spelling mistake in a pr_err error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
| * | module: add in-kernel support for decompressingDmitry Torokhov2022-01-114-11/+315
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current scheme of having userspace decompress kernel modules before loading them into the kernel runs afoul of LoadPin security policy, as it loses link between the source of kernel module on the disk and binary blob that is being loaded into the kernel. To solve this issue let's implement decompression in kernel, so that we can pass a file descriptor of compressed module file into finit_module() which will keep LoadPin happy. To let userspace know what compression/decompression scheme kernel supports it will create /sys/module/compression attribute. kmod can read this attribute and decide if it can pass compressed file to finit_module(). New MODULE_INIT_COMPRESSED_DATA flag indicates that the kernel should attempt to decompress the data read from file descriptor prior to trying load the module. To simplify things kernel will only implement single decompression method matching compression method selected when generating modules. This patch implements gzip and xz; more can be added later, Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
| * | module: Remove outdated commentYu Chen2022-01-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | Since commit e513cc1c07e2 ("module: Remove stop_machine from module unloading") this comment is no longer correct. Remove it. Signed-off-by: Yu Chen <chen.yu@easystack.cn> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* | | Merge branch 'signal-for-v5.17' of ↵Linus Torvalds2022-01-1711-202/+172
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull signal/exit/ptrace updates from Eric Biederman: "This set of changes deletes some dead code, makes a lot of cleanups which hopefully make the code easier to follow, and fixes bugs found along the way. The end-game which I have not yet reached yet is for fatal signals that generate coredumps to be short-circuit deliverable from complete_signal, for force_siginfo_to_task not to require changing userspace configured signal delivery state, and for the ptrace stops to always happen in locations where we can guarantee on all architectures that the all of the registers are saved and available on the stack. Removal of profile_task_ext, profile_munmap, and profile_handoff_task are the big successes for dead code removal this round. A bunch of small bug fixes are included, as most of the issues reported were small enough that they would not affect bisection so I simply added the fixes and did not fold the fixes into the changes they were fixing. There was a bug that broke coredumps piped to systemd-coredump. I dropped the change that caused that bug and replaced it entirely with something much more restrained. Unfortunately that required some rebasing. Some successes after this set of changes: There are few enough calls to do_exit to audit in a reasonable amount of time. The lifetime of struct kthread now matches the lifetime of struct task, and the pointer to struct kthread is no longer stored in set_child_tid. The flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is removed. Issues where task->exit_code was examined with signal->group_exit_code should been examined were fixed. There are several loosely related changes included because I am cleaning up and if I don't include them they will probably get lost. The original postings of these changes can be found at: https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org I trimmed back the last set of changes to only the obviously correct once. Simply because there was less time for review than I had hoped" * 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits) ptrace/m68k: Stop open coding ptrace_report_syscall ptrace: Remove unused regs argument from ptrace_report_syscall ptrace: Remove second setting of PT_SEIZED in ptrace_attach taskstats: Cleanup the use of task->exit_code exit: Use the correct exit_code in /proc/<pid>/stat exit: Fix the exit_code for wait_task_zombie exit: Coredumps reach do_group_exit exit: Remove profile_handoff_task exit: Remove profile_task_exit & profile_munmap signal: clean up kernel-doc comments signal: Remove the helper signal_group_exit signal: Rename group_exit_task group_exec_task coredump: Stop setting signal->group_exit_task signal: Remove SIGNAL_GROUP_COREDUMP signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process signal: Make coredump handling explicit in complete_signal signal: Have prepare_signal detect coredumps using signal->core_state signal: Have the oom killer detect coredumps using signal->core_state exit: Move force_uaccess back into do_exit exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit ...
| * | | ptrace: Remove second setting of PT_SEIZED in ptrace_attachEric W. Biederman2022-01-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The code is totally redundant remove it. Link: https://lkml.kernel.org/r/20220103213312.9144-6-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | taskstats: Cleanup the use of task->exit_codeEric W. Biederman2022-01-081-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the function bacct_add_task the code reading task->exit_code was introduced in commit f3cef7a99469 ("[PATCH] csa: basic accounting over taskstats"), and it is not entirely clear what the taskstats interface is trying to return as only returning the exit_code of the first task in a process doesn't make a lot of sense. As best as I can figure the intent is to return task->exit_code after a task exits. The field is returned with per task fields, so the exit_code of the entire process is not wanted. Only the value of the first task is returned so this is not a useful way to get the per task ptrace stop code. The ordinary case of returning this value is returning after a task exits, which also precludes use for getting a ptrace value. It is common to for the first task of a process to also be the last task of a process so this field may have done something reasonable by accident in testing. Make ac_exitcode a reliable per task value by always returning it for every exited task. Setting ac_exitcode in a sensible mannter makes it possible to continue to provide this value going forward. Cc: Balbir Singh <bsingharora@gmail.com> Fixes: f3cef7a99469 ("[PATCH] csa: basic accounting over taskstats") Link: https://lkml.kernel.org/r/20220103213312.9144-5-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Fix the exit_code for wait_task_zombieEric W. Biederman2022-01-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function wait_task_zombie is defined to always returns the process not thread exit status. Unfortunately when process group exit support was added to wait_task_zombie the WNOWAIT case was overlooked. Usually tsk->exit_code and tsk->signal->group_exit_code will be in sync so fixing this is bug probably has no effect in practice. But fix it anyway so that people aren't scratching their heads about why the two code paths are different. History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: 2c66151cbc2c ("[PATCH] sys_exit() threading improvements, BK-curr") Link: https://lkml.kernel.org/r/20220103213312.9144-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Coredumps reach do_group_exitEric W. Biederman2022-01-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment about coredumps not reaching do_group_exit and the corresponding BUG_ON are bogus. What happens and has happened for years is that get_signal calls do_coredump (which sets SIGNAL_GROUP_EXIT and group_exit_code) and then do_group_exit passing the signal number. Then do_group_exit ignores the exit_code it is passed and uses signal->group_exit_code from the coredump. The comment and BUG_ON were correct when they were added during the 2.5 development cycle, but became obsolete and incorrect when get_signal was changed to fall through to do_group_exit after do_coredump in 2.6.10-rc2. So remove the stale comment and BUG_ON Fixes: 63bd6144f191 ("[PATCH] Invalid BUG_ONs in signal.c") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Link: https://lkml.kernel.org/r/20220103213312.9144-2-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Remove profile_handoff_taskEric W. Biederman2022-01-082-26/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All profile_handoff_task does is notify the task_free_notifier chain. The helpers task_handoff_register and task_handoff_unregister are used to add and delete entries from that chain and are never called. So remove the dead code and make it much easier to read and reason about __put_task_struct. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/87fspyw6m0.fsf@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Remove profile_task_exit & profile_munmapEric W. Biederman2022-01-082-51/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I say remove I mean remove. All profile_task_exit and profile_munmap do is call a blocking notifier chain. The helpers profile_task_register and profile_task_unregister are not called anywhere in the tree. Which means this is all dead code. So remove the dead code and make it easier to read do_exit. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/20220103213312.9144-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: clean up kernel-doc commentsRandy Dunlap2022-01-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc warnings in kernel/signal.c: kernel/signal.c:1830: warning: Function parameter or member 'force_coredump' not described in 'force_sig_seccomp' kernel/signal.c:2873: warning: missing initial short description on line: * signal_delivered - Also add a closing parenthesis to the comments in signal_delivered(). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Richard Weinberger <richard@nod.at> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marco Elver <elver@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20211222031027.29694-1-rdunlap@infradead.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | signal: Remove the helper signal_group_exitEric W. Biederman2022-01-082-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helper is misleading. It tests for an ongoing exec as well as the process having received a fatal signal. Sometimes it is appropriate to treat an on-going exec differently than a process that is shutting down due to a fatal signal. In particular taking the fast path out of exit_signals instead of retargeting signals is not appropriate during exec, and not changing the the exit code in do_group_exit during exec. Removing the helper makes it more obvious what is going on as both cases must be coded for explicitly. While removing the helper fix the two cases where I have observed using signal_group_exit resulted in the wrong result. In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are retargetted during an exec. In do_group_exit use 0 as the exit code during an exec as de_thread does not set group_exit_code. As best as I can determine group_exit_code has been is set to 0 most of the time during de_thread. During a thread group stop group_exit_code is set to the stop signal and when the thread group receives SIGCONT group_exit_code is reset to 0. Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: Rename group_exit_task group_exec_taskEric W. Biederman2022-01-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only remaining user of group_exit_task is exec. Rename the field so that it is clear which part of the code uses it. Update the comment above the definition of group_exec_task to document how it is currently used. Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: Remove SIGNAL_GROUP_COREDUMPEric W. Biederman2022-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the previous cleanups "signal->core_state" is set whenever SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested whenver the code wants to know if a coredump is in progress. The remaining tests of SIGNAL_GROUP_COREDUMP also test to see if SIGNAL_GROUP_EXIT is set. Similarly the only place that sets SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT. Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and remove it's definition. With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags by setting SIGNAL_GROUP_EXIT. Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: Make coredump handling explicit in complete_signalEric W. Biederman2022-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ever since commit 6cd8f0acae34 ("coredump: ensure that SIGKILL always kills the dumping thread") it has been possible for a SIGKILL received during a coredump to set SIGNAL_GROUP_EXIT and trigger a process shutdown (for a second time). Update the logic to explicitly allow coredumps so that coredumps can set SIGNAL_GROUP_EXIT and shutdown like an ordinary process. Link: https://lkml.kernel.org/r/87zgo6ytyf.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: Have prepare_signal detect coredumps using signal->core_stateEric W. Biederman2022-01-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change prepare_signal to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP. Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested. Link: https://lkml.kernel.org/r/20211213225350.27481-1-ebiederm@xmission.com Link: https://lkml.kernel.org/r/875yqu14co.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Move force_uaccess back into do_exitEric W. Biederman2022-01-081-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With kernel threads on architectures that still have set_fs/get_fs running as KERNEL_DS moving force_uaccess_begin does not appear safe. Calling force_uaccess_begin is a noop on anything people care about. Update the comment to explain why this code while looking like an obvious candidate for moving to make_task_dead probably needs to remain in do_exit until set_fs/get_fs are entirely removed from the kernel. Fixes: 05ea0424f0e2 ("exit: Move oops specific logic from do_exit into make_task_dead") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/YdUxGKRcSiDy8jGg@zeniv-ca.linux.org.uk Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Guarantee make_task_dead leaks the tsk when calling do_task_exitEric W. Biederman2022-01-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the task state to EXIT_DEAD and take an extra rcu_refernce to guarantee the task will not be reaped and that it will not be freed. Link: https://lkml.kernel.org/r/YdUzjrLAlRiNLQp2@zeniv-ca.linux.org.uk Pointed-out-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | kthread: Generalize pf_io_worker so it can point to struct kthreadEric W. Biederman2022-01-083-17/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The point of using set_child_tid to hold the kthread pointer was that it already did what is necessary. There are now restrictions on when set_child_tid can be initialized and when set_child_tid can be used in schedule_tail. Which indicates that continuing to use set_child_tid to hold the kthread pointer is a bad idea. Instead of continuing to use the set_child_tid field of task_struct generalize the pf_io_worker field of task_struct and use it to hold the kthread pointer. Rename pf_io_worker (which is a void * pointer) to worker_private so it can be used to store kthreads struct kthread pointer. Update the kthread code to store the kthread pointer in the worker_private field. Remove the places where set_child_tid had to be dealt with carefully because kthreads also used it. Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | kthread: Never put_user the set_child_tid addressEric W. Biederman2021-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel threads abuse set_child_tid. Historically that has been fine as set_child_tid was initialized after the kernel thread had been forked. Unfortunately storing struct kthread in set_child_tid after the thread is running makes struct kthread being unusable for storing result codes of the thread. When set_child_tid is set to struct kthread during fork that results in schedule_tail writing the thread id to the beggining of struct kthread (if put_user does not realize it is a kernel address). Solve this by skipping the put_user for all kthreads. Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | kthread: Warn about failed allocations for the init kthreadEric W. Biederman2021-12-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Failed allocates are not expected when setting up the initial task and it is not really possible to handle them either. So I added a warning to report if such an allocation failure ever happens. Correct the sense of the warning so it warns when an allocation failure happens not when the allocation succeeded. Oops. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Link: https://lkml.kernel.org/r/20211221231611.785b74cf@canb.auug.org.au Link: https://lkml.kernel.org/r/CA+G9fYvLaR5CF777CKeWTO+qJFTN6vAvm95gtzN+7fw3Wi5hkA@mail.gmail.com Link: https://lkml.kernel.org/r/20211216102956.GC10708@xsang-OptiPlex-9020 Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacctEric W. Biederman2021-12-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I just fixed a bug in copy_process when using the label bad_fork_cleanup_threadgroup_lock. While fixing the bug I looked closer at the label and realized it has been misnamed since 568ac888215c ("cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork"). Fix the name so that fork is easier to understand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMAEric W. Biederman2021-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark Brown <broonie@kernel.org> reported: > This is also causing further build errors including but not limited to: > > /tmp/next/build/kernel/fork.c: In function 'copy_process': > /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined > 2106 | goto bad_fork_cleanup_threadgroup_lock; > | ^~~~ It turns out that I messed up and was depending upon a label protected by an ifdef. Move the label out of the ifdef as the ifdef around the label no longer makes sense (if it ever did). Link: https://lkml.kernel.org/r/YbugCP144uxXvRsk@sirena.org.uk Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exitEric W. Biederman2021-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I misspelled kthread_complete_and_exit in the kernel doc comment fix it. Link: https://lkml.kernel.org/r/202112141329.KBkyJ5ql-lkp@intel.com Link: https://lkml.kernel.org/r/202112141422.Cykr6YUS-lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Fixes: cead18552660 ("exit: Rename complete_and_exit to kthread_complete_and_exit") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit/kthread: Move the exit code for kernel threads into struct kthreadEric W. Biederman2021-12-131-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The exit code of kernel threads has different semantics than the exit_code of userspace tasks. To avoid confusion and allow the userspace implementation to change as needed move the kernel thread exit code into struct kthread. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | kthread: Ensure struct kthread is present for all kthreadsEric W. Biederman2021-12-133-25/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Rename complete_and_exit to kthread_complete_and_exitEric W. Biederman2021-12-132-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update complete_and_exit to call kthread_exit instead of do_exit. Change the name to reflect this change in functionality. All of the users of complete_and_exit are causing the current kthread to exit so this change makes it clear what is happening. Move the implementation of kthread_complete_and_exit from kernel/exit.c to to kernel/kthread.c. As this function is kthread specific it makes most sense to live with the kthread functions. There are no functional change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Rename module_put_and_exit to module_put_and_kthread_exitEric W. Biederman2021-12-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update module_put_and_exit to call kthread_exit instead of do_exit. Change the name to reflect this change in functionality. All of the users of module_put_and_exit are causing the current kthread to exit so this change makes it clear what is happening. There is no functional change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Implement kthread_exitEric W. Biederman2021-12-131-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way the per task_struct exit_code is used by kernel threads is not quite compatible how it is used by userspace applications. The low byte of the userspace exit_code value encodes the exit signal. While kthreads just use the value as an int holding ordinary kernel function exit status like -EPERM. Add kthread_exit to clearly separate the two kinds of uses. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Stop exporting do_exitEric W. Biederman2021-12-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Stop poorly open coding do_task_dead in make_task_deadEric W. Biederman2021-12-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the kernel detects it is oops or otherwise force killing a task while it exits the code poorly attempts to permanently stop the task from scheduling. I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE to be woken up. As it makes no sense for the task to continue call do_task_dead instead which actually does the work and permanently removes the task from the scheduler. Guaranteeing the task will never be woken up again. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Move oops specific logic from do_exit into make_task_deadEric W. Biederman2021-12-133-41/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The beginning of do_exit has become cluttered and difficult to read as it is filled with checks to handle things that can only happen when the kernel is operating improperly. Now that we have a dedicated function for cleaning up a task when the kernel is operating improperly move the checks there. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | exit: Add and use make_task_dead.Eric W. Biederman2021-12-131-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two big uses of do_exit. The first is it's design use to be the guts of the exit(2) system call. The second use is to terminate a task after something catastrophic has happened like a NULL pointer in kernel code. Add a function make_task_dead that is initialy exactly the same as do_exit to cover the cases where do_exit is called to handle catastrophic failure. In time this can probably be reduced to just a light wrapper around do_task_dead. For now keep it exactly the same so that there will be no behavioral differences introducing this new concept. Replace all of the uses of do_exit that use it for catastraphic task cleanup with make_task_dead to make it clear what the code is doing. As part of this rename rewind_stack_do_exit rewind_stack_and_make_dead. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | Merge SA_IMMUTABLE-fixes-for-v5.16-rc2Eric W. Biederman2021-12-032-9/+31
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I completed the first batch of signal changes for v5.17 against v5.16-rc1 before the SA_IMMUTABLE fixes where completed. Which leaves me with two lines of development that I want on my signal development branch both rooted at v5.16-rc1. Especially as I am hoping to reach the point of being able to remove SA_IMMUTABLE. Linus merged my SA_IMUTABLE fixes as: 7af959b5d5c8 ("Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace") To avoid rebasing the development changes that are currently complete I am merging the work I sent upstream to Linus to make my life simpler. The SA_IMMUTABLE changes as they are described in Linus's merge commit. Pull exit-vs-signal handling fixes from Eric Biederman: "This is a small set of changes where debuggers were no longer able to intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit cleanups. This is essentially the change you suggested with all of i's dotted and the t's crossed so that ptrace can intercept all of the cases it has been able to intercept the past, and all of the cases that made it to exit without giving ptrace a chance still don't give ptrace a chance" * 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Replace force_fatal_sig with force_exit_sig when in doubt signal: Don't always set SA_IMMUTABLE for forced signals Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | signal: Requeue ptrace signalsEric W. Biederman2021-11-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kyle Huey <me@kylehuey.com> writes: > rr, a userspace record and replay debugger[0], uses the recorded register > state at PTRACE_EVENT_EXIT to find the point in time at which to cease > executing the program during replay. > > If a SIGKILL races with processing another signal in get_signal, it is > possible for the kernel to decline to notify the tracer of the original > signal. But if the original signal had a handler, the kernel proceeds > with setting up a signal handler frame as if the tracer had chosen to > deliver the signal unmodified to the tracee. When the kernel goes to > execute the signal handler that it has now modified the stack and registers > for, it will discover the pending SIGKILL, and terminate the tracee > without executing the handler. When PTRACE_EVENT_EXIT is delivered to > the tracer, however, the effects of handler setup will be visible to > the tracer. > > Because rr (the tracer) was never notified of the signal, it is not aware > that a signal handler frame was set up and expects the state of the program > at PTRACE_EVENT_EXIT to be a state that will be reconstructed naturally > by allowing the program to execute from the last event. When that fails > to happen during replay, rr will assert and die. > > The following patches add an explicit check for a newly pending SIGKILL > after the ptracer has been notified and the siglock has been reacquired. > If this happens, we stop processing the current signal and proceed > immediately to handling the SIGKILL. This makes the state reported at > PTRACE_EVENT_EXIT the unmodified state of the program, and also avoids the > work to set up a signal handler frame that will never be used. > > [0] https://rr-project.org/ The problem is that while the traced process makes it into ptrace_stop, the tracee is killed before the tracer manages to wait for the tracee and discover which signal was about to be delivered. More generally the problem is that while siglock was dropped a signal with process wide effect is short cirucit delivered to the entire process killing it, but the process continues to try and deliver another signal. In general it impossible to avoid all cases where work is performed after the process has been killed. In particular if the process is killed after get_signal returns the code will simply not know it has been killed until after delivering the signal frame to userspace. On the other hand when the code has already discovered the process has been killed and taken user space visible action that shows the kernel knows the process has been killed, it is just silly to then write the signal frame to the user space stack. Instead of being silly detect the process has been killed in ptrace_signal and requeue the signal so the code can pretend it was simply never dequeued for delivery. To test the process has been killed I use fatal_signal_pending rather than signal_group_exit to match the test in signal_pending_state which is used in schedule which is where ptrace_stop detects the process has been killed. Requeuing the signal so the code can pretend it was simply never dequeued improves the user space visible behavior that has been present since ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4"). Kyle Huey verified that this change in behavior and makes rr happy. Reported-by: Kyle Huey <khuey@kylehuey.com> Reported-by: Marko Mäkelä <marko.makela@mariadb.com> History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87tugcd5p2.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | signal: Requeue signals in the appropriate queueEric W. Biederman2021-11-171-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the event that a tracer changes which signal needs to be delivered and that signal is currently blocked then the signal needs to be requeued for later delivery. With the advent of CLONE_THREAD the kernel has 2 signal queues per task. The per process queue and the per task queue. Update the code so that if the signal is removed from the per process queue it is requeued on the per process queue. This is necessary to make it appear the signal was never dequeued. The rr debugger reasonably believes that the state of the process from the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated by simply letting a process run. If a SIGKILL interrupts a ptrace_stop this is not true today. So return signals to their original queue in ptrace_signal so that signals that are not delivered appear like they were never dequeued. Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | signal: In get_signal test for signal_group_exit every time through the loopEric W. Biederman2021-11-171-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently while investigating a problem with rr and signals I noticed that siglock is dropped in ptrace_signal and get_signal does not jump to relock. Looking farther to see if the problem is anywhere else I see that do_signal_stop also returns if signal_group_exit is true. I believe that test can now never be true, but it is a bit hard to trace through and be certain. Testing signal_group_exit is not expensive, so move the test for signal_group_exit into the for loop inside of get_signal to ensure the test is never skipped improperly. This has been a potential problem since I added the test for signal_group_exit was added. Fixes: 35634ffa1751 ("signal: Always notice exiting tasks") Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | | | Merge tag 'hyperv-next-signed-20220114' of ↵Linus Torvalds2022-01-161-2/+48
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - More patches for Hyper-V isolation VM support (Tianyu Lan) - Bug fixes and clean-up patches from various people * tag 'hyperv-next-signed-20220114' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: scsi: storvsc: Fix storvsc_queuecommand() memory leak x86/hyperv: Properly deal with empty cpumasks in hyperv_flush_tlb_multi() Drivers: hv: vmbus: Initialize request offers message for Isolation VM scsi: storvsc: Fix unsigned comparison to zero swiotlb: Add CONFIG_HAS_IOMEM check around swiotlb_mem_remap() x86/hyperv: Fix definition of hv_ghcb_pg variable Drivers: hv: Fix definition of hypercall input & output arg variables net: netvsc: Add Isolation VM support for netvsc driver scsi: storvsc: Add Isolation VM support for storvsc driver hyper-v: Enable swiotlb bounce buffer for Isolation VM x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has() swiotlb: Add swiotlb bounce buffer remap function for HV IVM
| * | | | | swiotlb: Add CONFIG_HAS_IOMEM check around swiotlb_mem_remap()Wei Liu2022-01-041-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HAS_IOMEM option may not be selected on some platforms (e.g, s390) and this will cause compilation failure due to missing memremap() implementation. Fix it by stubbing out swiotlb_mem_remap when CONFIG_HAS_IOMEM is not set. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>
| * | | | | swiotlb: Add swiotlb bounce buffer remap function for HV IVMTianyu Lan2021-12-201-2/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20211213071407.314309-2-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
* | | | | | Merge tag 'trace-v5.17' of ↵Linus Torvalds2022-01-1617-247/+800
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New: - The Real Time Linux Analysis (RTLA) tool is added to the tools directory. - Can safely filter on user space pointers with: field.ustring ~ "match-string" - eprobes can now be filtered like any other event. - trace_marker(_raw) now uses stream_open() to allow multiple threads to safely write to it. Note, this could possibly break existing user space, but we will not know until we hear about it, and then can revert the change if need be. - New field in events to display when bottom halfs are disabled. - Sorting of the ftrace functions are now done at compile time instead of at bootup. Infrastructure changes to support future efforts: - Added __rel_loc type for trace events. Similar to __data_loc but the offset to the dynamic data is based off of the location of the descriptor and not the beginning of the event. Needed for user defined events. - Some simplification of event trigger code. - Make synthetic events process its callback better to not hinder other event callbacks that are registered. Needed for user defined events. And other small fixes and cleanups" * tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits) tracing: Add ustring operation to filtering string pointers rtla: Add rtla timerlat hist documentation rtla: Add rtla timerlat top documentation rtla: Add rtla timerlat documentation rtla: Add rtla osnoise hist documentation rtla: Add rtla osnoise top documentation rtla: Add rtla osnoise man page rtla: Add Documentation rtla/timerlat: Add timerlat hist mode rtla: Add timerlat tool and timelart top mode rtla/osnoise: Add the hist mode rtla/osnoise: Add osnoise top mode rtla: Add osnoise tool rtla: Helper functions for rtla rtla: Real-Time Linux Analysis tool tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails tracing: Remove duplicate warnings when calling trace_create_file() tracing/kprobes: 'nmissed' not showed correctly for kretprobe tracing: Add test for user space strings when filtering on string pointers tracing: Have syscall trace events use trace_event_buffer_lock_reserve() ...
| * | | | | | tracing: Add ustring operation to filtering string pointersSteven Rostedt2022-01-141-24/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since referencing user space pointers is special, if the user wants to filter on a field that is a pointer to user space, then they need to specify it. Add a ".ustring" attribute to the field name for filters to state that the field is pointing to user space such that the kernel can take the appropriate action to read that pointer. Link: https://lore.kernel.org/all/yt9d8rvmt2jq.fsf@linux.ibm.com/ Fixes: 77360f9bbc7e ("tracing: Add test for user space strings when filtering on string pointers") Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | | | tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() failsNikita Yushchenko2022-01-131-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If start_per_cpu_kthreads() called from osnoise_workload_start() returns error, event hooks are left in broken state: unhook_irq_events() called but unhook_thread_events() and unhook_softirq_events() not called, and trace_osnoise_callback_enabled flag not cleared. On the next tracer enable, hooks get not installed due to trace_osnoise_callback_enabled flag. And on the further tracer disable an attempt to remove non-installed hooks happened, hitting a WARN_ON_ONCE() in tracepoint_remove_func(). Fix the error path by adding the missing part of cleanup. While at this, introduce osnoise_unhook_events() to avoid code duplication between this error path and normal tracer disable. Link: https://lkml.kernel.org/r/20220109153459.3701773-1-nikita.yushchenko@virtuozzo.com Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | | | tracing: Remove duplicate warnings when calling trace_create_file()Yuntao Wang2022-01-131-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the same warning message is already printed in the trace_create_file() function, there is no need to print it again. Link: https://lkml.kernel.org/r/20220109162232.361747-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | | | tracing/kprobes: 'nmissed' not showed correctly for kretprobeXiangyang Zhang2022-01-131-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'nmissed' column of the 'kprobe_profile' file for kretprobe is not showed correctly, kretprobe can be skipped by two reasons, shortage of kretprobe_instance which is counted by tk->rp.nmissed, and kprobe itself is missed by some reason, so to show the sum. Link: https://lkml.kernel.org/r/20220107150242.5019-1-xyz.sun.ok@gmail.com Cc: stable@vger.kernel.org Fixes: 4a846b443b4e ("tracing/kprobes: Cleanup kprobe tracer code") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Xiangyang Zhang <xyz.sun.ok@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | | | tracing: Add test for user space strings when filtering on string pointersSteven Rostedt2022-01-131-3/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pingfan reported that the following causes a fault: echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter echo 1 > events/syscalls/sys_enter_at/enable The reason is that trace event filter treats the user space pointer defined by "filename" as a normal pointer to compare against the "cpu" string. The following bug happened: kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60 #PF: supervisor read access in kernel mode #PF: error_code(0x0001) - permissions violation PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867 Oops: 0001 [#1] PREEMPT SMP PTI CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:strlen+0x0/0x20 Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246 RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000 RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60 RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000 R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380 R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0 FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: filter_pred_pchar+0x18/0x40 filter_match_preds+0x31/0x70 ftrace_syscall_enter+0x27a/0x2c0 syscall_trace_enter.constprop.0+0x1aa/0x1d0 do_syscall_64+0x16/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa861d88664 The above happened because the kernel tried to access user space directly and triggered a "supervisor read access in kernel mode" fault. Worse yet, the memory could not even be loaded yet, and a SEGFAULT could happen as well. This could be true for kernel space accessing as well. To be even more robust, test both kernel and user space strings. If the string fails to read, then simply have the filter fail. Note, TASK_SIZE is used to determine if the pointer is user or kernel space and the appropriate strncpy_from_kernel/user_nofault() function is used to copy the memory. For some architectures, the compare to TASK_SIZE may always pick user space or kernel space. If it gets it wrong, the only thing is that the filter will fail to match. In the future, this needs to be fixed to have the event denote which should be used. But failing a filter is much better than panicing the machine, and that can be solved later. Link: https://lore.kernel.org/all/20220107044951.22080-1-kernelfans@gmail.com/ Link: https://lkml.kernel.org/r/20220110115532.536088fd@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Reported-by: Pingfan Liu <kernelfans@gmail.com> Tested-by: Pingfan Liu <kernelfans@gmail.com> Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings") Signed-off-by: Steven Rostedt <rostedt@goodmis.org>