summaryrefslogtreecommitdiffstats
path: root/include/linux/perf_event.h
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'tip/perf/urgent'Peter Zijlstra2023-11-151-5/+9
|\ | | | | | | | | | | Avoid conflicts, base on fixes. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
| * perf/core: Fix cpuctx refcountingPeter Zijlstra2023-11-151-5/+8
| | | | | | | | | | | | | | | | | | | | | | Audit of the refcounting turned up that perf_pmu_migrate_context() fails to migrate the ctx refcount. Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org Cc: <stable@vger.kernel.org>
| * Merge tag 'perf-core-2023-10-28' of ↵Linus Torvalds2023-10-301-1/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: - Add AMD Unified Memory Controller (UMC) events introduced with Zen 4 - Simplify & clean up the uncore management code - Fall back from RDPMC to RDMSR on certain uncore PMUs - Improve per-package and cstate event reading - Extend the Intel ref-cycles event to GP counters - Fix Intel MTL event constraints - Improve the Intel hybrid CPU handling code - Micro-optimize the RAPL code - Optimize perf_cgroup_switch() - Improve large AUX area error handling - Misc fixes and cleanups * tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init() x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations perf: Optimize perf_cgroup_switch() perf/x86/amd/uncore: Add memory controller support perf/x86/amd/uncore: Add group exclusivity perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable perf/x86/amd/uncore: Move discovery and registration perf/x86/amd/uncore: Refactor uncore management perf/core: Allow reading package events from perf_event_read_local perf/x86/cstate: Allow reading the package statistics from local CPU perf/x86/intel/pt: Fix kernel-doc comments perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update() perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update() perf/core: Bail out early if the request AUX area is out of bound perf/x86/intel: Extend the ref-cycles event to GP counters perf/x86/intel: Fix broken fixed event constraints extension ...
| * | perf: Disallow mis-matched inherited group readsPeter Zijlstra2023-10-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because group consistency is non-atomic between parent (filedesc) and children (inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum non-matching counter groups -- with non-sensical results. Add group_generation to distinguish the case where a parent group removes and adds an event and thus has the same number, but a different configuration of events as inherited groups. This became a problem when commit fa8c269353d5 ("perf/core: Invert perf_read_group() loops") flipped the order of child_list and sibling_list. Previously it would iterate the group (sibling_list) first, and for each sibling traverse the child_list. In this order, only the group composition of the parent is relevant. By flipping the order the group composition of the child (inherited) events becomes an issue and the mis-match in group composition becomes evident. That said; even prior to this commit, while reading of a group that is not equally inherited was not broken, it still made no sense. (Ab)use ECHILD as error return to indicate issues with child process group composition. Fixes: fa8c269353d5 ("perf/core: Invert perf_read_group() loops") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net
* | | perf: Add branch_sample_call_stackKan Liang2023-10-271-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper function to check call stack sample type. The later patch will invoke the function in several places. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231025201626.3000228-3-kan.liang@linux.intel.com
* | | perf: Add branch stack countersKan Liang2023-10-271-1/+16
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the additional information of a branch entry is stored in a u64 space. With more and more information added, the space is running out. For example, the information of occurrences of events will be added for each branch. Two places were suggested to append the counters. https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/ One place is right after the flags of each branch entry. It changes the existing struct perf_branch_entry. The later ARCH specific implementation has to be really careful to consistently pick the right struct. The other place is right after the entire struct perf_branch_stack. The disadvantage is that the pointer of the extra space has to be recorded. The common interface perf_sample_save_brstack() has to be updated. The latter is much straightforward, and should be easily understood and maintained. It is implemented in the patch. Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate the event which is recorded in the branch info. The "u64 counters" may store the occurrences of several events. The information regarding the number of events/counters and the width of each counter should be exposed via sysfs as a reference for the perf tool. Define the branch_counter_nr and branch_counter_width ABI here. The support will be implemented later in the Intel-specific patch. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
* | perf: Optimize perf_cgroup_switch()Peter Zijlstra2023-10-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Namhyung reported that bd2756811766 ("perf: Rewrite core context handling") regresses context switch overhead when perf-cgroup is in use together with 'slow' PMUs like uncore. Specifically, perf_cgroup_switch()'s perf_ctx_disable() / ctx_sched_out() etc.. all iterate the full list of active PMUs for that CPU, even if they don't have cgroup events. Previously there was cgrp_cpuctx_list which linked the relevant PMUs together, but that got lost in the rework. Instead of re-instruducing a similar list, let the perf_event_pmu_context iteration skip those that do not have cgroup events. This avoids growing multiple versions of the perf_event_pmu_context iteration. Measured performance (on a slightly different patch): Before) $ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB # Running 'sched/pipe' benchmark: # Executed 10000 pipe operations between two processes Total time: 0.901 [sec] 90.128700 usecs/op 11095 ops/sec After) $ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB # Running 'sched/pipe' benchmark: # Executed 10000 pipe operations between two processes Total time: 0.065 [sec] 6.560100 usecs/op 152436 ops/sec Fixes: bd2756811766 ("perf: Rewrite core context handling") Reported-by: Namhyung Kim <namhyung@kernel.org> Debugged-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231009210425.GC6307@noisy.programming.kicks-ass.net
* | perf/core: Rename perf_proc_update_handler() -> ↵Xiu Jianfeng2023-10-051-1/+1
|/ | | | | | | | | | perf_event_max_sample_rate_handler(), for readability Follow the naming pattern of the other sysctl handlers in perf. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230721090607.172002-1-xiujianfeng@huawei.com
* Merge tag 'riscv-for-linus-6.6-mw1' of ↵Linus Torvalds2023-09-011-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the new "riscv,isa-extensions" and "riscv,isa-base" device tree interfaces for probing extensions - Support for userspace access to the performance counters - Support for more instructions in kprobes - Crash kernels can be allocated above 4GiB - Support for KCFI - Support for ELFs in !MMU configurations - ARCH_KMALLOC_MINALIGN has been reduced to 8 - mmap() defaults to sv48-sized addresses, with longer addresses hidden behind a hint (similar to Arm and Intel) - Also various fixes and cleanups * tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits) lib/Kconfig.debug: Restrict DEBUG_INFO_SPLIT for RISC-V riscv: support PREEMPT_DYNAMIC with static keys riscv: Move create_tmp_mapping() to init sections riscv: Mark KASAN tmp* page tables variables as static riscv: mm: use bitmap_zero() API riscv: enable DEBUG_FORCE_FUNCTION_ALIGN_64B riscv: remove redundant mv instructions RISC-V: mm: Document mmap changes RISC-V: mm: Update pgtable comment documentation RISC-V: mm: Add tests for RISC-V mm RISC-V: mm: Restrict address space for sv39,sv48,sv57 riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent riscv: allow kmalloc() caches aligned to the smallest value riscv: support the elf-fdpic binfmt loader binfmt_elf_fdpic: support 64-bit systems riscv: Allow CONFIG_CFI_CLANG to be selected riscv/purgatory: Disable CFI riscv: Add CFI error handling riscv: Add ftrace_stub_graph riscv: Add types to indirectly called assembly functions ...
| * perf: Fix wrong comment about default event_idxAlexandre Ghiti2023-08-161-1/+2
| | | | | | | | | | | | | | | | | | | | Since commit c719f56092ad ("perf: Fix and clean up initialization of pmu::event_idx"), event_idx default implementation has returned 0, not idx + 1, so fix the comment that can be misleading. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com>
* | Merge tag 'arm64-upstream' of ↵Linus Torvalds2023-08-281-3/+19
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "I think we have a bit less than usual on the architecture side, but that's somewhat balanced out by a large crop of perf/PMU driver updates and extensions to our selftests. CPU features and system registers: - Advertise hinted conditional branch support (FEAT_HBC) to userspace - Avoid false positive "SANITY CHECK" warning when xCR registers differ outside of the length field Documentation: - Fix macro name typo in SME documentation Entry code: - Unmask exceptions earlier on the system call entry path Memory management: - Don't bother clearing PTE_RDONLY for dirty ptes in pte_wrprotect() and pte_modify() Perf and PMU drivers: - Initial support for Coresight TRBE devices on ACPI systems (the coresight driver changes will come later) - Fix hw_breakpoint single-stepping when called from bpf - Fixes for DDR PMU on i.MX8MP SoC - Add NUMA-awareness to Hisilicon PCIe PMU driver - Fix locking dependency issue in Arm DMC620 PMU driver - Workaround Hisilicon erratum 162001900 in the SMMUv3 PMU driver - Add support for Arm CMN-700 r3 parts to the CMN PMU driver - Add support for recent Arm Cortex CPU PMUs - Update Hisilicon PMU maintainers Selftests: - Add a bunch of new features to the hwcap test (JSCVT, PMULL, AES, SHA1, etc) - Fix SSVE test to leave streaming-mode after grabbing the signal context - Add new test for SVE vector-length changes with SME enabled Miscellaneous: - Allow compiler to warn on suspicious looking system register expressions - Work around SDEI firmware bug by aborting any running handlers on a kernel crash - Fix some harmless warnings when building with W=1 - Remove some unused function declarations - Other minor fixes and cleanup" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (62 commits) drivers/perf: hisi: Update HiSilicon PMU maintainers arm_pmu: acpi: Add a representative platform device for TRBE arm_pmu: acpi: Refactor arm_spe_acpi_register_device() kselftest/arm64: Fix hwcaps selftest build hw_breakpoint: fix single-stepping when using bpf_overflow_handler arm64/sysreg: refactor deprecated strncpy kselftest/arm64: add jscvt feature to hwcap test kselftest/arm64: add pmull feature to hwcap test kselftest/arm64: add AES feature check to hwcap test kselftest/arm64: add SHA1 and related features to hwcap test arm64: sysreg: Generate C compiler warnings on {read,write}_sysreg_s arguments kselftest/arm64: build BTI tests in output directory perf/imx_ddr: don't enable counter0 if none of 4 counters are used perf/imx_ddr: speed up overflow frequency of cycle drivers/perf: hisi: Schedule perf session according to locality kselftest/arm64: fix a memleak in zt_regs_run() perf/arm-dmc620: Fix dmc620_pmu_irqs_lock/cpu_hotplug_lock circular lock dependency perf/smmuv3: Add MODULE_ALIAS for module auto loading perf/smmuv3: Enable HiSilicon Erratum 162001900 quirk for HIP08/09 kselftest/arm64: Size sycall-abi buffers for the actual maximum VL ...
| * | hw_breakpoint: fix single-stepping when using bpf_overflow_handlerTomislav Novak2023-08-181-3/+19
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arm platforms use is_default_overflow_handler() to determine if the hw_breakpoint code should single-step over the breakpoint trigger or let the custom handler deal with it. Since bpf_overflow_handler() currently isn't recognized as a default handler, attaching a BPF program to a PERF_TYPE_BREAKPOINT event causes it to keep firing (the instruction triggering the data abort exception is never skipped). For example: # bpftrace -e 'watchpoint:0x10000:4:w { print("hit") }' -c ./test Attaching 1 probe... hit hit [...] ^C (./test performs a single 4-byte store to 0x10000) This patch replaces the check with uses_default_overflow_handler(), which accounts for the bpf_overflow_handler() case by also testing if one of the perf_event_output functions gets invoked indirectly, via orig_default_handler. Signed-off-by: Tomislav Novak <tnovak@meta.com> Tested-by: Samuel Gosselin <sgosselin@google.com> # arm64 Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/linux-arm-kernel/20220923203644.2731604-1-tnovak@fb.com/ Link: https://lore.kernel.org/r/20230605191923.1219974-1-tnovak@meta.com Signed-off-by: Will Deacon <will@kernel.org>
* | perf: Remove unused extern declaration arch_perf_get_page_size()YueHaibing2023-07-261-4/+0
| | | | | | | | | | | | | | | | | | commit 8af26be06272 ("perf/core: Fix arch_perf_get_page_size()") left behind this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230725135038.25060-1-yuehaibing@huawei.com
* | perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capabilityJames Clark2023-07-261-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit bd2756811766 ("perf: Rewrite core context handling") the relationship between perf_event_context and PMUs has changed so that the error scenario that PERF_PMU_CAP_HETEROGENEOUS_CPUS originally silenced no longer exists. Remove the capability to avoid confusion that it actually influences any perf core behavior and shift down the following capability bits to fill in the unused space. This change should be a no-op. Signed-off-by: James Clark <james.clark@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20230724134500.970496-5-james.clark@arm.com
* | perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NARavi Bangoria2023-07-261-1/+2
|/ | | | | | | | Add PERF_MEM_LVLNUM_NA wherever PERF_MEM_NA is used to set default values. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230725150206.184-3-ravi.bangoria@amd.com
* Merge tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds2023-07-011-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull CXL updates from Dan Williams: "The highlights in terms of new functionality are support for the standard CXL Performance Monitor definition that appeared in CXL 3.0, support for device sanitization (wiping all data from a device), secure-erase (re-keying encryption of user data), and support for firmware update. The firmware update support is notable as it reuses the simple sysfs_upload interface to just cat(1) a blob to a sysfs file and pipe that to the device. Additionally there are a substantial number of cleanups and reorganizations to get ready for RCH error handling (RCH == Restricted CXL Host == current shipping hardware generation / pre CXL-2.0 topologies) and type-2 (accelerator / vendor specific) devices. For vendor specific devices they implement a subset of what the generic type-3 (generic memory expander) driver expects. As a result the rework decouples optional infrastructure from the core driver context. For RCH topologies, where the specification working group did not want to confuse pre-CXL-aware operating systems, many of the standard registers are hidden which makes support standard bus features like AER (PCIe Advanced Error Reporting) difficult. The rework arranges for the driver to help the PCI-AER core. Bjorn is on board with this direction but a late regression disocvery means the completion of this functionality needs to cook a bit longer, so it is code reorganizations only for now. Summary: - Add infrastructure for supporting background commands along with support for device sanitization and firmware update - Introduce a CXL performance monitoring unit driver based on the common definition in the specification. - Land some preparatory cleanup and refactoring for the anticipated arrival of CXL type-2 (accelerator devices) and CXL RCH (CXL-v1.1 topology) error handling. - Rework CPU cache management with respect to region configuration (device hotplug or other dynamic changes to memory interleaving) - Fix region reconfiguration vs CXL decoder ordering rules" * tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (51 commits) cxl: Fix one kernel-doc comment cxl/pci: Use correct flag for sanitize polling docs: perf: Minimal introduction the the CXL PMU device and driver perf: CXL Performance Monitoring Unit driver tools/testing/cxl: add firmware update emulation to CXL memdevs tools/testing/cxl: Use named effects for the Command Effect Log tools/testing/cxl: Fix command effects for inject/clear poison cxl: add a firmware update mechanism using the sysfs firmware loader cxl/test: Add Secure Erase opcode support cxl/mem: Support Secure Erase cxl/test: Add Sanitize opcode support cxl/mem: Wire up Sanitization support cxl/mbox: Add sanitization handling machinery cxl/mem: Introduce security state sysfs file cxl/mbox: Allow for IRQ_NONE case in the isr Revert "cxl/port: Enable the HDM decoder capability for switch ports" cxl/memdev: Formalize endpoint port linkage cxl/pci: Unconditionally unmask 256B Flit errors cxl/region: Manage decoder target_type at decoder-attach time cxl/hdm: Default CXL_DEVTYPE_DEVMEM decoders to CXL_DECODER_DEVMEM ...
| * perf: Allow a PMU to have a parentJonathan Cameron2023-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some PMUs have well defined parents such as PCI devices. As the device_initialize() and device_add() are all within pmu_dev_alloc() which is called from perf_pmu_register() there is no opportunity to set the parent from within a driver. Add a struct device *parent field to struct pmu and use that to set the parent. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20230526095824.16336-2-Jonathan.Cameron@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'perf-core-2023-06-27' of ↵Linus Torvalds2023-06-271-0/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Rework & fix the event forwarding logic by extending the core interface. This fixes AMD PMU events that have to be forwarded from the core PMU to the IBS PMU. - Add self-tests to test AMD IBS invocation via core PMU events - Clean up Intel FixCntrCtl MSR encoding & handling * tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Re-instate the linear PMU search perf/x86/intel: Define bit macros for FixCntrCtl MSR perf test: Add selftest to test IBS invocation via core pmu events perf/core: Remove pmu linear searching code perf/ibs: Fix interface via core pmu events perf/core: Rework forwarding of {task|cpu}-clock events
| * | perf/core: Rework forwarding of {task|cpu}-clock eventsRavi Bangoria2023-05-081-0/+10
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and cpu-clock events are interfaced through it but internally gets forwarded to their own pmus. Rework this by overwriting event->attr.type in perf_swevent_init() which will cause perf_init_event() to retry with updated type and event will automatically get forwarded to right pmu. With the change, SW pmu no longer needs to be treated specially and can be included in 'pmu_idr' list. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230504110003.2548-2-ravi.bangoria@amd.com
* / perf/core: Drop __weak attribute from arch_perf_update_userpage() prototypeMarc Zyngier2023-06-161-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | Reiji reports that the arm64 implementation of arch_perf_update_userpage() is now ignored and replaced by the dummy stub in core code. This seems to happen since the PMUv3 driver was moved to driver/perf. As it turns out, dropping the __weak attribute from the *prototype* of the function solves the problem. You're right, this doesn't seem to make much sense. And yet... It appears that both symbols get flagged as weak, and that the first one to appear in the link order wins: $ nm drivers/perf/arm_pmuv3.o|grep arch_perf_update_userpage 0000000000001db0 W arch_perf_update_userpage Dropping the attribute from the prototype restores the expected behaviour, and arm64 is able to enjoy arch_perf_update_userpage() again. Fixes: 7755cec63ade ("arm64: perf: Move PMUv3 driver to drivers/perf") Fixes: f1ec3a517b43 ("kernel/events: Add a missing prototype for arch_perf_update_userpage()") Reported-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Reiji Watanabe <reijiw@google.com> Link: https://lkml.kernel.org/r/20230616114831.3186980-1-maz@kernel.org
* perf/core: Introduce perf_prepare_header()Namhyung Kim2023-01-181-1/+15
| | | | | | | | | | | | | | | | | Factor out perf_prepare_header() so that it can call perf_prepare_sample() without a header if not needed. Also it checks the filtered_sample_type to avoid duplicate work when perf_prepare_sample() is called twice (or more). Suggested-by: Peter Zijlstr <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
* perf/core: Add perf_sample_save_brstack() helperNamhyung Kim2023-01-181-26/+40
| | | | | | | | | | | | | | | | When we saves the branch stack to the perf sample data, we needs to update the sample flags and the dynamic size. To make sure this is done consistently, add the perf_sample_save_brstack() helper and convert all call sites. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-5-namhyung@kernel.org
* perf/core: Add perf_sample_save_raw_data() helperNamhyung Kim2023-01-181-5/+28
| | | | | | | | | | | | | | | When we save the raw_data to the perf sample data, we need to update the sample flags and the dynamic size. To make sure this is done consistently, add the perf_sample_save_raw_data() helper and convert all call sites. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
* perf/core: Add perf_sample_save_callchain() helperNamhyung Kim2023-01-181-1/+15
| | | | | | | | | | | | | | | When we save the callchain to the perf sample data, we need to update the sample flags and the dynamic size. To ensure this is done consistently, add the perf_sample_save_callchain() helper and convert all call sites. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-3-namhyung@kernel.org
* perf/core: Save the dynamic parts of sample data sizeNamhyung Kim2023-01-181-0/+2
| | | | | | | | | | | | | | | | | | | The perf sample data can be divided into parts. The event->header_size and event->id_header_size keep the static part of the sample data which is determined by the sample_type flags. But other parts like CALLCHAIN and BRANCH_STACK are changing dynamically so it needs to see the actual data. In preparation of handling repeated calls for perf_prepare_sample(), it can save the dynamic size to the perf sample data to avoid the duplicate work. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-2-namhyung@kernel.org
* perf/core: Change the layout of perf_sample_dataNamhyung Kim2023-01-091-15/+19
| | | | | | | | | | | | | | | | The layout of perf_sample_data is designed to minimize cache-line access. The perf_sample_data_init() used to initialize a couple of fields unconditionally so they were placed together at the head. But it's changed now to set the fields according to the actual sample_type flags. The main user (the perf tools) sets the IP, TID, TIME, PERIOD always. Also group relevant fields like addr, phys_addr and data_page_size. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221229204101.1099430-1-namhyung@kernel.org
* perf: Add PMU_FORMAT_ATTR_SHOWKan Liang2023-01-091-1/+4
| | | | | | | | | | | | | | | | | The macro PMU_FORMAT_ATTR facilitates the definition of both the "show" function and "format_attr". But it only works for a non-hybrid platform. For a hybrid platform, the name "format_attr_hybrid_" is used. The definition of the "show" function can be shared between a non-hybrid platform and a hybrid platform. Add a new macro PMU_FORMAT_ATTR_SHOW. No functional change. The PMU_FORMAT_ATTR_SHOW will be used in the following patch. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-1-kan.liang@linux.intel.com
* perf: Rewrite core context handlingPeter Zijlstra2022-10-271-31/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There have been various issues and limitations with the way perf uses (task) contexts to track events. Most notable is the single hardware PMU task context, which has resulted in a number of yucky things (both proposed and merged). Notably: - HW breakpoint PMU - ARM big.little PMU / Intel ADL PMU - Intel Branch Monitoring PMU - AMD IBS PMU - S390 cpum_cf PMU - PowerPC trace_imc PMU *Current design:* Currently we have a per task and per cpu perf_event_contexts: task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context ^ | ^ | ^ `---------------------------------' | `--> pmu ---' v ^ perf_event ------' Each task has an array of pointers to a perf_event_context. Each perf_event_context has a direct relation to a PMU and a group of events for that PMU. The task related perf_event_context's have a pointer back to that task. Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which includes a perf_event_context, which again has a direct relation to that PMU, and a group of events for that PMU. The perf_cpu_context also tracks which task context is currently associated with that CPU and includes a few other things like the hrtimer for rotation etc. Each perf_event is then associated with its PMU and one perf_event_context. *Proposed design:* New design proposed by this patch reduce to a single task context and a single CPU context but adds some intermediate data-structures: task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context ^ | ^ ^ `---------------------------' | | | | perf_cpu_pmu_context <--. | `----. ^ | | | | | | v v | | ,--> perf_event_pmu_context | | | | | | | v v | perf_event ---> pmu ----------------' With the new design, perf_event_context will hold all events for all pmus in the (respective pinned/flexible) rbtrees. This can be achieved by adding pmu to rbtree key: {cpu, pmu, cgroup, group_index} Each perf_event_context carries a list of perf_event_pmu_context which is used to hold per-pmu-per-context state. For example, it keeps track of currently active events for that pmu, a pmu specific task_ctx_data, a flag to tell whether rotation is required or not etc. Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu state like hrtimer details to drive the event rotation, a pointer to perf_event_pmu_context of currently running task and some other ancillary information. Each perf_event is associated to it's pmu, perf_event_context and perf_event_pmu_context. Further optimizations to current implementation are possible. For example, ctx_resched() can be optimized to reschedule only single pmu events. Much thanks to Ravi for picking this up and pushing it towards completion. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
* perf: Fix missing SIGTRAPsPeter Zijlstra2022-10-171-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marco reported: Due to the implementation of how SIGTRAP are delivered if perf_event_attr::sigtrap is set, we've noticed 3 issues: 1. Missing SIGTRAP due to a race with event_sched_out() (more details below). 2. Hardware PMU events being disabled due to returning 1 from perf_event_overflow(). The only way to re-enable the event is for user space to first "properly" disable the event and then re-enable it. 3. The inability to automatically disable an event after a specified number of overflows via PERF_EVENT_IOC_REFRESH. The worst of the 3 issues is problem (1), which occurs when a pending_disable is "consumed" by a racing event_sched_out(), observed as follows: CPU0 | CPU1 --------------------------------+--------------------------- __perf_event_overflow() | perf_event_disable_inatomic() | pending_disable = CPU0 | ... | _perf_event_enable() | event_function_call() | task_function_call() | /* sends IPI to CPU0 */ <IPI> | ... __perf_event_enable() +--------------------------- ctx_resched() task_ctx_sched_out() ctx_sched_out() group_sched_out() event_sched_out() pending_disable = -1 </IPI> <IRQ-work> perf_pending_event() perf_pending_event_disable() /* Fails to send SIGTRAP because no pending_disable! */ </IRQ-work> In the above case, not only is that particular SIGTRAP missed, but also all future SIGTRAPs because 'event_limit' is not reset back to 1. To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction of a separate 'pending_sigtrap', no longer using 'event_limit' and 'pending_disable' for its delivery. Additionally; and different to Marco's proposed patch: - recognise that pending_disable effectively duplicates oncpu for the case where it is set. As such, change the irq_work handler to use ->oncpu to target the event and use pending_* as boolean toggles. - observe that SIGTRAP targets the ctx->task, so the context switch optimization that carries contexts between tasks is invalid. If the irq_work were delayed enough to hit after a context switch the SIGTRAP would be delivered to the wrong task. - observe that if the event gets scheduled out (rotation/migration/context-switch/...) the irq-work would be insufficient to deliver the SIGTRAP when the event gets scheduled back in (the irq-work might still be pending on the old CPU). Therefore have event_sched_out() convert the pending sigtrap into a task_work which will deliver the signal at return_to_user. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <dvyukov@google.com> Debugged-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Marco Elver <elver@google.com> Debugged-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com>
* perf: Fix lockdep_assert_event_ctx()Peter Zijlstra2022-10-041-1/+1
| | | | | | | | | | | | I'm a flamin' moron; because even after Mark told me it should be '&&' I still got it wrong in the final commit. Fixes: f3c0eba28704 ("perf: Add a few assertions") Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/YvvIWmDBWdIUCMZj@FVFF77S0Q05N
* perf: Use sample_flags for raw_dataNamhyung Kim2022-09-271-3/+2
| | | | | | | | | | | | | Use the new sample_flags to indicate whether the raw data field is filled by the PMU driver. Although it could check with the NULL, follow the same rule with other fields. Remove the raw field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
* perf: Use sample_flags for addrNamhyung Kim2022-09-271-2/+6
| | | | | | | | | | | Use the new sample_flags to indicate whether the addr field is filled by the PMU driver. As most PMU drivers pass 0, it can set the flag only if it has a non-zero value. And use 0 in perf_sample_output() if it's not filled already. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220921220032.2858517-1-namhyung@kernel.org
* perf: Add a few assertionsPeter Zijlstra2022-09-071-0/+17
| | | | | | | | | | | | | | | | | | While auditing 6b959ba22d34 ("perf/core: Fix reentry problem in perf_output_read_group()") a few spots were found that wanted assertions. Notable for_each_sibling_event() relies on exclusion from modification. This would normally be holding either ctx->lock or ctx->mutex, however due to how things are constructed disabling IRQs is a valid and sufficient substitute for ctx->lock. Another possible site to add assertions would be the various pmu::{add,del,read,..}() methods, but that's not trivially expressable in C -- the best option is wrappers, but those are easy enough to forget. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* perf/core: Assert PERF_EVENT_FLAG_ARCH does not overlap with generic flagsAnshuman Khandual2022-09-071-0/+2
| | | | | | | | | | This just ensures that PERF_EVENT_FLAG_ARCH does not overlap with generic hardware event flags. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220907091924.439193-3-anshuman.khandual@arm.com
* perf/core: Expand PERF_EVENT_FLAG_ARCHAnshuman Khandual2022-09-071-1/+1
| | | | | | | | | | | | Two hardware event flags on x86 platform has overshot PERF_EVENT_FLAG_ARCH (0x0000ffff). These flags are PERF_X86_EVENT_PEBS_LAT_HYBRID (0x20000) and PERF_X86_EVENT_AMD_BRS (0x10000). Lets expand PERF_EVENT_FLAG_ARCH mask to accommodate those flags, and also create room for two more in the future. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220907091924.439193-2-anshuman.khandual@arm.com
* perf: Consolidate branch sample filter helpersAnshuman Khandual2022-09-071-0/+26
| | | | | | | | | | | | | | Besides the branch type filtering requests, 'event.attr.branch_sample_type' also contains various flags indicating which additional information should be captured, along with the base branch record. These flags help configure the underlying hardware, and capture the branch records appropriately when required e.g after PMU interrupt. But first, this moves an existing helper perf_sample_save_hw_index() into the header before adding some more helpers for other branch sample filter flags. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220906084414.396220-1-anshuman.khandual@arm.com
* perf: Use sample_flags for txnKan Liang2022-09-061-2/+1
| | | | | | | | | | | | Use the new sample_flags to indicate whether the txn field is filled by the PMU driver. Remove the txn field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
* perf: Use sample_flags for data_srcKan Liang2022-09-061-2/+1
| | | | | | | | | | | | Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver. Remove the data_src field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
* perf: Use sample_flags for weightKan Liang2022-09-061-2/+1
| | | | | | | | | | | | Use the new sample_flags to indicate whether the weight field is filled by the PMU driver. Remove the weight field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
* perf: Use sample_flags for branch stackKan Liang2022-09-061-2/+2
| | | | | | | | | | | | Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver. Remove the br_stack from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
* perf: Add sample_flags to indicate the PMU-filled sample dataKan Liang2022-09-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some platforms, some data e.g., timestamps, can be retrieved from the PMU driver. Usually, the data from the PMU driver is more accurate. The current perf kernel should output the PMU-filled sample data if it's available. To check the availability of the PMU-filled sample data, the current perf kernel initializes the related fields in the perf_sample_data_init(). When outputting a sample, the perf checks whether the field is updated by the PMU driver. If yes, the updated value will be output. If not, the perf uses an SW way to calculate the value or just outputs the initialized value if an SW way is unavailable either. With more and more data being provided by the PMU driver, more fields has to be initialized in the perf_sample_data_init(). That will increase the number of cache lines touched in perf_sample_data_init() and be harmful to the performance. Add new "sample_flags" to indicate the PMU-filled sample data. The PMU driver should set the corresponding PERF_SAMPLE_ flag when the field is updated. The initialization of the corresponding field is not required anymore. The following patches will make use of it and remove the corresponding fields from the perf_sample_data_init(), which will further minimize the number of cache lines touched. Only clear the sample flags that have already been done by the PMU driver in the perf_prepare_sample() for the PERF_RECORD_SAMPLE. For the other PERF_RECORD_ event type, the sample data is not available. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-2-kan.liang@linux.intel.com
* perf/hw_breakpoint: Optimize list of per-task breakpointsMarco Elver2022-08-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.208 [sec] | | 108.422396 usecs/op | 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
* perf/core: Add speculation info to branch entriesSandipan Das2022-08-271-0/+1
| | | | | | | | | | | | | | | | Add a new "spec" bitfield to branch entries for providing speculation information. This will be populated using hints provided by branch sampling features on supported hardware. The following cases are covered: * No branch speculation information is available * Branch is speculative but taken on the wrong path * Branch is non-speculative but taken on the correct path * Branch is speculative and taken on the correct path Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/834088c302faf21c7b665031dd111f424e509a64.1660211399.git.sandipan.das@amd.com
* perf/core: Add a new read format to get a number of lost samplesNamhyung Kim2022-06-281-0/+2
| | | | | | | | | | | | | | | Sometimes we want to know an accurate number of samples even if it's lost. Currenlty PERF_RECORD_LOST is generated for a ring-buffer which might be shared with other events. So it's hard to know per-event lost count. Add event->lost_samples field and PERF_FORMAT_LOST to retrieve it from userspace. Original-patch-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220616180623.1358843-1-namhyung@kernel.org
* ACPI: Add perf low power callbackStephane Eranian2022-04-051-0/+6
| | | | | | | | | | | | | | | Add an optional callback needed by some PMU features, e.g., AMD BRS, to give a chance to the perf_events code to change its state before a CPU goes to low power and after it comes back. The callback is void when the PERF_NEEDS_LOPWR_CB flag is not set. This flag must be set in arch specific perf_event.h header whenever needed. When not set, there is no impact on the ACPI code. Signed-off-by: Stephane Eranian <eranian@google.com> [peterz: build fix] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220322221517.2510440-9-eranian@google.com
* perf/core: Add perf_clear_branch_entry_bitfields() helperStephane Eranian2022-04-051-0/+16
| | | | | | | | | | | | Make it simpler to reset all the info fields on the perf_branch_entry by adding a helper inline function. The goal is to centralize the initialization to avoid missing a field in case more are added. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220322221517.2510440-2-eranian@google.com
* perf: Fix wrong name in comment for struct perf_cpu_contextAlexandru Elisei2022-02-081-1/+1
| | | | | | | | | | | | | | | | | | | Commit 0793a61d4df8 ("performance counters: core code") added the perf subsystem (then called Performance Counters) to Linux, creating the struct perf_cpu_context. The comment for the struct referred to it as a "struct perf_counter_cpu_context". Commit cdd6c482c9ff ("perf: Do the big rename: Performance Counters -> Performance Events") changed the comment to refer to a "struct perf_event_cpu_context", which was still the wrong name for the struct. Change the comment to say "struct perf_cpu_context". CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220127161759.53553-3-alexandru.elisei@arm.com
* perf: Fix perf_event_read_local() timePeter Zijlstra2022-01-181-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Time readers that cannot take locks (due to NMI etc..) currently make use of perf_event::shadow_ctx_time, which, for that event gives: time' = now + (time - timestamp) or, alternatively arranged: time' = time + (now - timestamp) IOW, the progression of time since the last time the shadow_ctx_time was updated. There's problems with this: A) the shadow_ctx_time is per-event, even though the ctx_time it reflects is obviously per context. The direct concequence of this is that the context needs to iterate all events all the time to keep the shadow_ctx_time in sync. B) even with the prior point, the context itself might not be active meaning its time should not advance to begin with. C) shadow_ctx_time isn't consistently updated when ctx_time is There are 3 users of this stuff, that suffer differently from this: - calc_timer_values() - perf_output_read() - perf_event_update_userpage() /* A */ - perf_event_read_local() /* A,B */ In particular, perf_output_read() doesn't suffer at all, because it's sample driven and hence only relevant when the event is actually running. This same was supposed to be true for perf_event_update_userpage(), after all self-monitoring implies the context is active *HOWEVER*, as per commit f79256532682 ("perf/core: fix userpage->time_enabled of inactive events") this goes wrong when combined with counter overcommit, in that case those events that do not get scheduled when the context becomes active (task events typically) miss out on the EVENT_TIME update and ENABLED time is inflated (for a little while) with the time the context was inactive. Once the event gets rotated in, this gets corrected, leading to a non-monotonic timeflow. perf_event_read_local() made things even worse, it can request time at any point, suffering all the problems perf_event_update_userpage() does and more. Because while perf_event_update_userpage() is limited by the context being active, perf_event_read_local() users have no such constraint. Therefore, completely overhaul things and do away with perf_event::shadow_ctx_time. Instead have regular context time updates keep track of this offset directly and provide perf_event_time_now() to complement perf_event_time(). perf_event_time_now() will, in adition to being context wide, also take into account if the context is active. For inactive context, it will not advance time. This latter property means the cgroup perf_cgroup_info context needs to grow addition state to track this. Additionally, since all this is strictly per-cpu, we can use barrier() to order context activity vs context time. Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Song Liu <song@kernel.org> Tested-by: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/YcB06DasOBtU0b00@hirez.programming.kicks-ass.net
* Merge tag 'perf_core_for_v5.17_rc1' of ↵Linus Torvalds2022-01-121-12/+32
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Borislav Petkov: "Cleanup of the perf/kvm interaction." * tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Drop guest callback (un)register stubs KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y KVM: arm64: Convert to the generic perf callbacks KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c KVM: Move x86's perf guest info callbacks to generic KVM KVM: x86: More precisely identify NMI from guest when handling PMI KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable perf/core: Use static_call to optimize perf_guest_info_callbacks perf: Force architectures to opt-in to guest callbacks perf: Add wrappers for invoking guest callbacks perf/core: Rework guest callbacks to prepare for static_call support perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv perf: Stop pretending that perf can handle multiple guest callbacks KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest KVM: x86: Register perf callbacks after calling vendor's hardware_setup() perf: Protect perf_guest_cbs with RCU
| * perf: Drop guest callback (un)register stubsSean Christopherson2021-11-171-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Drop perf's stubs for (un)registering guest callbacks now that KVM registration of callbacks is hidden behind GUEST_PERF_EVENTS=y. The only other user is x86 XEN_PV, and x86 unconditionally selects PERF_EVENTS. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20211111020738.2512932-18-seanjc@google.com