summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/cpu
Commit message (Collapse)AuthorAgeFilesLines
* x86_64, asm: Work around AMD SYSRET SS descriptor attribute issueAndy Lutomirski2015-04-261-0/+3
| | | | | | | | | | | | | | | | | | | AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET with SS == 0 results in an invalid usermode state in which SS is apparently equal to __USER_DS but causes #SS if used. Work around the issue by setting SS to __KERNEL_DS __switch_to, thus ensuring that SYSRET never happens with SS set to NULL. This was exposed by a recent vDSO cleanup. Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Brian Gerst <brgerst@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2015-04-185-49/+106
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "This update has mostly fixes, but also other bits: - perf tooling fixes - PMU driver fixes - Intel Broadwell PMU driver HW-enablement for LBR callstacks - a late coming 'perf kmem' tool update that enables it to also analyze page allocation data. Note, this comes with MM tracepoint changes that we believe to not break anything: because it changes the formerly opaque 'struct page *' field that uniquely identifies pages to 'pfn' which identifies pages uniquely too, but isn't as opaque and can be used for other purposes as well" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/pt: Fix and clean up error handling in pt_event_add() perf/x86/intel: Add Broadwell support for the LBR callstack perf/x86/intel/rapl: Fix energy counter measurements but supporing per domain energy units perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events perf/x86: Fix hw_perf_event::flags collision perf probe: Fix segfault when probe with lazy_line to file perf probe: Find compilation directory path for lazy matching perf probe: Set retprobe flag when probe in address-based alternative mode perf kmem: Analyze page allocator events also tracing, mm: Record pfn instead of pointer to struct page
| * perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()Ingo Molnar2015-04-181-18/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dan Carpenter reported that pt_event_add() has buggy error handling logic: it returns 0 instead of -EBUSY when it fails to start a newly added event. Furthermore, the control flow in this function is messy, with cleanup labels mixed with direct returns. Fix the bug and clean up the code by converting it to a straight fast path for the regular non-failing case, plus a clear sequence of cascading goto labels to do all cleanup. NOTE: I materially changed the existing clean up logic in the pt_event_start() failure case to use the direct perf_aux_output_end() path, not pt_event_del(), because perf_aux_output_end() is enough here. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150416103830.GB7847@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Add Broadwell support for the LBR callstackKan Liang2015-04-171-1/+1
| | | | | | | | | | | | | | | | | | | | Same as Haswell, Broadwell also support the LBR callstack. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1427962377-40955-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel/rapl: Fix energy counter measurements but supporing per ↵Jacob Pan2015-04-171-21/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | domain energy units RAPL energy hardware unit can vary within a single CPU package, e.g. HSW server DRAM has a fixed energy unit of 15.3 uJ (2^-16) whereas the unit on other domains can be enumerated from power unit MSR. There might be other variations in the future, this patch adds per cpu model quirk to allow special handling of certain cpus. hw_unit is also removed from per cpu data since it is not per cpu and the sampling rate for energy counter is typically not high. Without this patch, DRAM domain on HSW servers will be counted 4x higher than the real energy counter. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1427405325-780-1-git-send-email-jacob.jun.pan@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp eventsPeter Zijlstra2015-04-171-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ingo reported that cycles:pp didn't work for him on some machines. It turns out that in this commit: af4bdcf675cf perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events Andi forgot to explicitly allow that event when he disabled event flags for PEBS on those uarchs. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: af4bdcf675cf ("perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events") Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Fix hw_perf_event::flags collisionPeter Zijlstra2015-04-171-9/+9
| | | | | | | | | | | | | | | | | | | | | | Somehow we ended up with overlapping flags when merging the RDPMC control flag - this is bad, fix it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86: mtrr: if: remove use of seq_printf return valueJoe Perches2015-04-151-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2015-04-1415-231/+4606
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf changes from Ingo Molnar: "Core kernel changes: - One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. (Right now it's limited to root-only, but in the future we might allow unprivileged use as well.) (Alexei Starovoitov) - Another non-trivial feature is per event clockid support: this allows, amongst other things, the selection of different clock sources for event timestamps traced via perf. This feature is sought by people who'd like to merge perf generated events with external events that were measured with different clocks: - cluster wide profiling - for system wide tracing with user-space events, - JIT profiling events etc. Matching perf tooling support is added as well, available via the -k, --clockid <clockid> parameter to perf record et al. (Peter Zijlstra) Hardware enablement kernel changes: - x86 Intel Processor Trace (PT) support: which is a hardware tracer on steroids, available on Broadwell CPUs. The hardware trace stream is directly output into the user-space ring-buffer, using the 'AUX' data format extension that was added to the perf core to support hardware constraints such as the necessity to have the tracing buffer physically contiguous. This patch-set was developed for two years and this is the result. A simple way to make use of this is to use BTS tracing, the PT driver emulates BTS output - available via the 'intel_bts' PMU. More explicit PT specific tooling support is in the works as well - will probably be ready by 4.2. (Alexander Shishkin, Peter Zijlstra) - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware feature of Intel Xeon CPUs that allows the measurement and allocation/partitioning of caches to individual workloads. These kernel changes expose the measurement side as a new PMU driver, which exposes various QoS related PMU events. (The partitioning change is work in progress and is planned to be merged as a cgroup extension.) (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P Waskiewicz Jr) - x86 Intel Haswell LBR call stack support: this is a new Haswell feature that allows the hardware recording of call chains, plus tooling support. To activate this feature you have to enable it via the new 'lbr' call-graph recording option: perf record --call-graph lbr perf report or: perf top --call-graph lbr This hardware feature is a lot faster than stack walk or dwarf based unwinding, but has some limitations: - It reuses the current LBR facility, so LBR call stack and branch record can not be enabled at the same time. - It is only available for user-space callchains. (Yan, Zheng) - x86 Intel Broadwell CPU support and various event constraints and event table fixes for earlier models. (Andi Kleen) - x86 Intel HT CPUs event scheduling workarounds. This is a complex CPU bug affecting the SNB,IVB,HSW families that results in counter value corruption. The mitigation code is automatically enabled and is transparent. (Maria Dimakopoulou, Stephane Eranian) The perf tooling side had a ton of changes in this cycle as well, so I'm only able to list the user visible changes here, in addition to the tooling changes outlined above: User visible changes affecting all tools: - Improve support of compressed kernel modules (Jiri Olsa) - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo) - Bash completion for subcommands (Yunlong Song) - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa) - Support missing -f to override perf.data file ownership. (Yunlong Song) - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo) User visible changes in individual tools: 'perf data': New tool for converting perf.data to other formats, initially for the CTF (Common Trace Format) from LTTng (Jiri Olsa, Sebastian Siewior) 'perf diff': Add --kallsyms option (David Ahern) 'perf list': Allow listing events with 'tracepoint' prefix (Yunlong Song) Sort the output of the command (Yunlong Song) 'perf kmem': Respect -i option (Jiri Olsa) Print big numbers using thousands' group (Namhyung Kim) Allow -v option (Namhyung Kim) Fix alignment of slab result table (Namhyung Kim) 'perf probe': Support multiple probes on different binaries on the same command line (Masami Hiramatsu) Support unnamed union/structure members data collection. (Masami Hiramatsu) Check kprobes blacklist when adding new events. (Masami Hiramatsu) 'perf record': Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra) Support recording running/enabled time (Andi Kleen) 'perf sched': Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song) 'perf report' and 'perf top': Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo) Indicate which callchain entries are annotated in the TUI hists browser (Arnaldo Carvalho de Melo) Add pid/tid filtering to 'report' and 'script' commands (David Ahern) Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT events (Arnaldo Carvalho de Melo) 'perf stat': Report unsupported events properly (Suzuki K. Poulose) Output running time and run/enabled ratio in CSV mode (Andi Kleen) 'perf trace': Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo) Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo) Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo) Dump stack on segfaults (Arnaldo Carvalho de Melo) No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo) There's also been a ton of infrastructure work done, such as the split-out of perf's build system into tools/build/ and other changes - see the shortlog and changelog for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits) perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init() perf evlist: Fix type for references to data_head/tail perf probe: Check the orphaned -x option perf probe: Support multiple probes on different binaries perf buildid-list: Fix segfault when show DSOs with hits perf tools: Fix cross-endian analysis perf tools: Fix error path to do closedir() when synthesizing threads perf tools: Fix synthesizing fork_event.ppid for non-main thread perf tools: Add 'I' event modifier for exclude_idle bit perf report: Don't call map__kmap if map is NULL. perf tests: Fix attr tests perf probe: Fix ARM 32 building error perf tools: Merge all perf_event_attr print functions perf record: Add clockid parameter perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10 perf sched replay: Support using -f to override perf.data file ownership perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task perf sched replay: Fix the segmentation fault problem caused by pr_err in threads perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations ...
| * perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()Ingo Molnar2015-04-121-23/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dan Carpenter pointed out that the control flow in pt_pmu_hw_init() is a bit messy: for example the kfree(de_attrs) is entirely superfluous. Another problem is the inconsistent mixing of label based and direct return error handling. Add modern, label based error handling instead and clean up the code a bit as well. Note that we'll still do a kfree(NULL) in the normal case - this does not matter as this is an init path and kfree() returns early if it sees a NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20150409090805.GG17605@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel/pt: Fix the 32-bit buildIngo Molnar2015-04-022-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a 32-bit build I got: arch/x86/kernel/cpu/perf_event_intel_pt.c:413:5: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] arch/x86/kernel/cpu/perf_event_intel_bts.c:162:24: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] Fix it. The code should probably be (re-)tested on 32-bit systems to make sure all is fine. Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Avoid rewriting DEBUGCTL with the same value for LBRsAndi Kleen2015-04-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | perf with LBRs on has a tendency to rewrite the DEBUGCTL MSR with the same value. Add a little optimization to skip the unnecessary write. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1426871484-21285-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Streamline LBR MSR handling in PMIAndi Kleen2015-04-023-10/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The perf PMI currently does unnecessary MSR accesses when LBRs are enabled. We use LBR freezing, or when in callstack mode force the LBRs to only filter on ring 3. So there is no need to disable the LBRs explicitely in the PMI handler. Also we always unnecessarily rewrite LBR_SELECT in the LBR handler, even though it can never change. 5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */ 5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */ 5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */ 5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f */ 5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 */ 5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */ 5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */ 5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */ This patch: - Avoids disabling already frozen LBRs unnecessarily in the PMI - Avoids changing LBR_SELECT in the PMI Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1426871484-21285-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Only dump PEBS register when PEBS has been detectedAndi Kleen2015-04-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | Technically PEBS_ENABLED is only guaranteed to exist when we detected PEBS. So add a check for this to the PMU dump function. I don't think it can happen on a real CPU, but could in a VM. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1425059312-18217-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Dump DEBUGCTL in PMU dumpAndi Kleen2015-04-021-1/+5
| | | | | | | | | | | | | | | | | | | | | | LBRs and LBR freezing are controlled through the DEBUGCTL MSR. So dump the state of DEBUGCTL too when dumping the PMU state. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1425059312-18217-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Reset more state in PMU resetAndi Kleen2015-04-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PMU reset code didn't quite keep up with newer PMU features. Improve it a bit to really reset a modern PMU: - Clear all overflow status - Clear LBRs and freezing state - Disable fixed counters too Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1425059312-18217-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Make the HT bug workaround conditional on HT enabledStephane Eranian2015-04-022-21/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch disables the PMU HT bug when Hyperthreading (HT) is disabled. We cannot do this test immediately when perf_events is initialized. We need to wait until the topology information is setup properly. As such, we register a later initcall, check the topology and potentially disable the workaround. To do this, we need to ensure there is no user of the PMU. At this point of the boot, the only user is the NMI watchdog, thus we disable it during the switch and re-enable it right after. Having the workaround disabled when it is not needed provides some benefits by limiting the overhead is time and space. The workaround still ensures correct scheduling of the corrupting memory events (0xd0, 0xd1, 0xd2) when HT is off. Those events can only be measured on counters 0-3. Something else the current kernel did not handle correctly. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Link: http://lkml.kernel.org/r/1416251225-17721-13-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Limit to half counters when the HT workaround is enabled, to ↵Stephane Eranian2015-04-022-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | avoid exclusive mode starvation This patch limits the number of counters available to each CPU when the HT bug workaround is enabled. This is necessary to avoid situation of counter starvation. Such can arise from configuration where one HT thread, HT0, is using all 4 counters with corrupting events which require exclusion the the sibling HT, HT1. In such case, HT1 would not be able to schedule any event until HT0 is done. To mitigate this problem, this patch artificially limits the number of counters to 2. That way, we can gurantee that at least 2 counters are not in exclusive mode and therefore allow the sibling thread to schedule events of the same type (system vs. per-thread). The 2 counters are not determined in advance. We simply set the limit to two events per HT. This helps mitigate starvation in case of events with specific counter constraints such a PREC_DIST. Note that this does not elimintate the starvation is all cases. But it is better than not having it. (Solution suggested by Peter Zjilstra.) Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Link: http://lkml.kernel.org/r/1416251225-17721-11-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Fix intel_get_event_constraints() for dynamic constraintsStephane Eranian2015-04-021-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | With dynamic constraint, we need to restart from the static constraints each time the intel_get_event_constraints() is called. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-10-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Enforce HT bug workaround with PEBS for SNB/IVB/HSWMaria Dimakopoulou2015-04-022-11/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch modifies the PEBS constraint tables for SNB/IVB/HSW such that corrupting events supporting PEBS activate the HT workaround. Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-9-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Enforce HT bug workaround for SNB/IVB/HSWMaria Dimakopoulou2015-04-021-9/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patches activates the HT bug workaround for the SNB/IVB/HSW processors. This covers non-PEBS mode. Activation is done thru the constraint tables. Both client and server processors needs this workaround. Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-8-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Implement cross-HT corruption bug workaroundMaria Dimakopoulou2015-04-023-13/+331
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a software workaround for a HW erratum on Intel SandyBridge, IvyBridge and Haswell processors with Hyperthreading enabled. The errata are documented for each processor in their respective specification update documents: - SandyBridge: BJ122 - IvyBridge: BV98 - Haswell: HSD29 The bug causes silent counter corruption across hyperthreads only when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3). Counters measuring those events may leak counts to the sibling counter. For instance, counter 0, thread 0 measuring event 0xd0, may leak to counter 0, thread 1, regardless of the event measured there. The size of the leak is not predictible. It all depends on the workload and the state of each sibling hyper-thread. The corrupting events do undercount as a consequence of the leak. The leak is compensated automatically only when the sibling counter measures the exact same corrupting event AND the workload is on the two threads is the same. Given, there is no way to guarantee this, a work-around is necessary. Furthermore, there is a serious problem if the leaked count is added to a low-occurrence event. In that case the corruption on the low occurrence event can be very large, e.g., orders of magnitude. There is no HW or FW workaround for this problem. The bug is very easy to reproduce on a loaded system. Here is an example on a Haswell client, where CPU0, CPU4 are siblings. We load the CPUs with a simple triad app streaming large floating-point vector. We use 0x81d0 corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and 0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not using the LBR, the 0x20cc event should be zero. $ taskset -c 0 triad & $ taskset -c 4 triad & $ perf stat -a -C 0 -e r81d0 sleep 100 & $ perf stat -a -C 4 -r20cc sleep 10 Performance counter stats for 'system wide': 139 277 291 r20cc 10,000969126 seconds time elapsed In this example, 0x81d0 and r20cc ar eusing sinling counters on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it from 0 to 139 millions occurrences. This patch provides a software workaround to this problem by modifying the way events are scheduled onto counters by the kernel. The patch forces cross-thread mutual exclusion between counters in case a corrupting event is measured by one of the hyper-threads. If thread 0, counter 0 is measuring event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting event is measured on any hyper-thread, event scheduling proceeds as before. The same example run with the workaround enabled, yield the correct answer: $ taskset -c 0 triad & $ taskset -c 4 triad & $ perf stat -a -C 0 -e r81d0 sleep 100 & $ perf stat -a -C 4 -r20cc sleep 10 Performance counter stats for 'system wide': 0 r20cc 10,000969126 seconds time elapsed The patch does provide correctness for all non-corrupting events. It does not "repatriate" the leaked counts back to the leaking counter. This is planned for a second patch series. This patch series makes this repatriation more easy by guaranteeing the sibling counter is not measuring any useful event. The patch introduces dynamic constraints for events. That means that events which did not have constraints, i.e., could be measured on any counters, may now be constrained to a subset of the counters depending on what is going on the sibling thread. The algorithm is similar to a cache coherency protocol. We call it XSU in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU counter. As a consequence of the workaround, users may see an increased amount of event multiplexing, even in situtations where there are fewer events than counters measured on a CPU. Patch has been tested on all three impacted processors. Note that when HT is off, there is no corruption. However, the workaround is still enabled, yet not costing too much. Adding a dynamic detection of HT on turned out to be complex are requiring too much to code to be justified. This patch addresses the issue when PEBS is not used. A subsequent patch fixes the problem when PEBS is used. Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> [spinlock_t -> raw_spinlock_t] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-7-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel: Add cross-HT counter exclusion infrastructureMaria Dimakopoulou2015-04-022-5/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new shared_regs style structure to the per-cpu x86 state (cpuc). It is used to coordinate access between counters which must be used with exclusion across HyperThreads on Intel processors. This new struct is not needed on each PMU, thus is is allocated on demand. Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> [peterz: spinlock_t -> raw_spinlock_t] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-6-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Add 'index' param to get_event_constraint() callbackStephane Eranian2015-04-024-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an index parameter to the get_event_constraint() x86_pmu callback. It is expected to represent the index of the event in the cpuc->event_list[] array. When the callback is used for fake_cpuc (evnet validation), then the index must be -1. The motivation for passing the index is to use it to index into another cpuc array. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Link: http://lkml.kernel.org/r/1416251225-17721-5-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Add 3 new scheduling callbacksMaria Dimakopoulou2015-04-022-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds 3 new PMU model specific callbacks during the event scheduling done by x86_schedule_events(). ->start_scheduling(): invoked when entering the schedule routine. ->stop_scheduling(): invoked at the end of the schedule routine ->commit_scheduling(): invoked for each committed event To be used optionally by model-specific code. Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Stephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-4-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Vectorize cpuc->kfree_on_onlineStephane Eranian2015-04-024-6/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the cpuc->kfree_on_online a vector to accommodate more than one entry and add the second entry to be used by a later patch. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86: Rename x86_pmu::er_flags to 'flags'Stephane Eranian2015-04-022-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Because it will be used for more than just tracking the presence of extra registers. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Link: http://lkml.kernel.org/r/1416251225-17721-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar2015-04-021-5/+5
| |\ | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | perf/x86/intel/bts: Add BTS PMU driverAlexander Shishkin2015-04-025-3/+540
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for Branch Trace Store (BTS) via kernel perf event infrastructure. The difference with the existing implementation of BTS support is that this one is a separate PMU that exports events' trace buffers to userspace by means of AUX area of the perf buffer, which is zero-copy mapped into userspace. The immediate benefit is that the buffer size can be much bigger, resulting in fewer interrupts and no kernel side copying is involved and little to no trace data loss. Also, kernel code can be traced with this driver. The old way of collecting BTS traces still works. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422614435-114702-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | perf/x86/intel/pt: Add Intel PT PMU driverAlexander Shishkin2015-04-025-0/+1238
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for Intel Processor Trace (PT) to kernel's perf events. PT is an extension of Intel Architecture that collects information about software execuction such as control flow, execution modes and timings and formats it into highly compressed binary packets. Even being compressed, these packets are generated at hundreds of megabytes per second per core, which makes it impractical to decode them on the fly in the kernel. This driver exports trace data by through AUX space in the perf ring buffer, which is zero-copy mapped into userspace for faster data retrieval. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422614392-114498-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | perf/x86: Mark Intel PT and LBR/BTS as mutually exclusiveAlexander Shishkin2015-04-023-0/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel PT cannot be used at the same time as LBR or BTS and will cause a general protection fault if they are used together. In order to avoid fixing up GPs in the fast path, instead we disallow creating LBR/BTS events when PT events are present and vice versa. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-12-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86: Add Intel Processor Trace (INTEL_PT) cpu feature detectionAlexander Shishkin2015-04-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel Processor Trace is an architecture extension that allows for program flow tracing. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-11-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | perf: Add per event clockid supportPeter Zijlstra2015-03-271-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While thinking on the whole clock discussion it occurred to me we have two distinct uses of time: 1) the tracking of event/ctx/cgroup enabled/running/stopped times which includes the self-monitoring support in struct perf_event_mmap_page. 2) the actual timestamps visible in the data records. And we've been conflating them. The first is all about tracking time deltas, nobody should really care in what time base that happens, its all relative information, as long as its internally consistent it works. The second however is what people are worried about when having to merge their data with external sources. And here we have the discussion on MONOTONIC vs MONOTONIC_RAW etc.. Where MONOTONIC is good for correlating between machines (static offset), MONOTNIC_RAW is required for correlating against a fixed rate hardware clock. This means configurability; now 1) makes that hard because it needs to be internally consistent across groups of unrelated events; which is why we had to have a global perf_clock(). However, for 2) it doesn't really matter, perf itself doesn't care what it writes into the buffer. The below patch makes the distinction between these two cases by adding perf_event_clock() which is used for the second case. It further makes this configurable on a per-event basis, but adds a few sanity checks such that we cannot combine events with different clocks in confusing ways. And since we then have per-event configurability we might as well retain the 'legacy' behaviour as a default. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | perf/x86: Remove redundant calls to perf_pmu_{dis|en}able()David Ahern2015-03-271-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf_pmu_disable() is called before pmu->add() and perf_pmu_enable() is called afterwards. No need to call these inside of x86_pmu_add() as well. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424281543-67335-1-git-send-email-dsahern@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | Merge branch 'perf/x86' into perf/core, because it's readyIngo Molnar2015-03-273-1/+1419
| |\ \ | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf: Remove type specific target pointersPeter Zijlstra2015-03-231-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only reason CQM had to use a hard-coded pmu type was so it could use cqm_target in hw_perf_event. Do away with the {tp,bp,cqm}_target pointers and provide a non type specific one. This allows us to do away with that silly pmu type as well. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vince Weaver <vince@deater.net> Cc: acme@kernel.org Cc: acme@redhat.com Cc: hpa@zytor.com Cc: jolsa@redhat.com Cc: kanaka.d.juvva@intel.com Cc: matt.fleming@intel.com Cc: tglx@linutronix.de Cc: torvalds@linux-foundation.org Cc: vikas.shivappa@linux.intel.com Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Fix Makefile to actually build the cqm driverMatt Fleming2015-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Someone fat fingered a merge conflict and lost the Makefile hunk. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <acme@redhat.com> Cc: <hpa@zytor.com> Cc: <jolsa@redhat.com> Cc: <kanaka.d.juvva@intel.com> Cc: <tglx@linutronix.de> Cc: <torvalds@linux-foundation.org> Cc: <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1424976420.15321.35.camel@mfleming-mobl1.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Enable conflicting event scheduling for CQMMatt Fleming2015-02-251-46/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can leverage the workqueue that we use for RMID rotation to support scheduling of conflicting monitoring events. Allowing events that monitor conflicting things is done at various other places in the perf subsystem, so there's precedent there. An example of two conflicting events would be monitoring a cgroup and simultaneously monitoring a task within that cgroup. This uses the cache_groups list as a queuing mechanism, where every event that reaches the front of the list gets the chance to be scheduled in, possibly descheduling any conflicting events that are running. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-10-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Perform rotation on Intel CQM RMIDsMatt Fleming2015-02-251-48/+623
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many use cases where people will want to monitor more tasks than there exist RMIDs in the hardware, meaning that we have to perform some kind of multiplexing. We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID to a waiting event when the RMID becomes unused. This scheme reserves one RMID at all times for rotation. When we need to schedule a new event we give it the reserved RMID, pick a victim event from the front of the global CQM list and wait for the victim's RMID to drop to zero occupancy, before it becomes the new reserved RMID. We put the victim's RMID onto the limbo list, where it resides for a "minimum queue time", which is intended to save ourselves an expensive smp IPI when the RMID is unlikely to have a occupancy value below __intel_cqm_threshold. If we fail to recycle an RMID, even after waiting the minimum queue time then we need to increment __intel_cqm_threshold. There is an upper bound on this threshold, __intel_cqm_max_threshold, which is programmable from userland as /sys/devices/intel_cqm/max_recycling_threshold. The comments above __intel_cqm_rmid_rotate() have more details. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-9-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Support task events with Intel CQMMatt Fleming2015-02-251-21/+174
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for task events as well as system-wide events. This change has a big impact on the way that we gather LLC occupancy values in intel_cqm_event_read(). Currently, for system-wide (per-cpu) events we defer processing to userspace which knows how to discard all but one cpu result per package. Things aren't so simple for task events because we need to do the value aggregation ourselves. To do this, we defer updating the LLC occupancy value in event->count from intel_cqm_event_read() and do an SMP cross-call to read values for all packages in intel_cqm_event_count(). We need to ensure that we only do this for one task event per cache group, otherwise we'll report duplicate values. If we're a system-wide event we want to fallback to the default perf_event_count() implementation. Refactor this into a common function so that we don't duplicate the code. Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an event's task (if the event isn't per-cpu) inside of the Intel CQM PMU driver. This task information is only availble in the upper layers of the perf infrastructure. Other perf backends stash the target task in event->hw.*target so we need to do something similar. The task is used to determine whether events should share a cache group and an RMID. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Implement LRU monitoring ID allocation for CQMMatt Fleming2015-02-251-8/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible to run into issues with re-using unused monitoring IDs because there may be stale cachelines associated with that ID from a previous allocation. This can cause the LLC occupancy values to be inaccurate. To attempt to mitigate this problem we place the IDs on a least recently used list, essentially a FIFO. The basic idea is that the longer the time period between ID re-use the lower the probability that stale cachelines exist in the cache. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-7-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | perf/x86/intel: Add Intel Cache QoS Monitoring supportMatt Fleming2015-02-251-0/+530
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Future Intel Xeon processors support a Cache QoS Monitoring feature that allows tracking of the LLC occupancy for a task or task group, i.e. the amount of data in pulled into the LLC for the task (group). Currently the PMU only supports per-cpu events. We create an event for each cpu and read out all the LLC occupancy values. Because this results in duplicate values being written out to userspace, we also export a .per-pkg event file so that the perf tools only accumulate values for one cpu per package. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | x86: Add support for Intel Cache QoS Monitoring (CQM) detectionPeter P Waskiewicz Jr2015-02-251-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for the new Cache QoS Monitoring (CQM) feature found in future Intel Xeon processors. It includes the new values to track CQM resources to the cpuinfo_x86 structure, plus the CPUID detection routines for CQM. CQM allows a process, or set of processes, to be tracked by the CPU to determine the cache usage of that task group. Using this data from the CPU, software can be written to extract this data and report cache usage and occupancy for a particular process, or group of processes. More information about Cache QoS Monitoring can be found in the Intel (R) x86 Architecture Software Developer Manual, section 17.14. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chris Webb <chris@arachsys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jacob Shin <jacob.w.shin@gmail.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Honeyman <stevenhoneyman@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-5-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | Merge branch 'perf/urgent' into perf/core, to pick up fixes and to refresh ↵Ingo Molnar2015-03-272-2/+8
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | the tree Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Add INST_RETIRED.ALL workaroundsAndi Kleen2015-03-273-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Broadwell INST_RETIRED.ALL cannot be used with any period that doesn't have the lowest 6 bits cleared. And the period should not be smaller than 128. This is erratum BDM11 and BDM55: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf BDM11: When using a period < 100; we may get incorrect PEBS/PMI interrupts and/or an invalid counter state. BDM55: When bit0-5 of the period are !0 we may get redundant PEBS records on overflow. Add a new callback to enforce this, and set it for Broadwell. How does this handle the case when an app requests a specific period with some of the bottom bits set? Short answer: Any useful instruction sampling period needs to be 4-6 orders of magnitude larger than 128, as an PMI every 128 instructions would instantly overwhelm the system and be throttled. So the +-64 error from this is really small compared to the period, much smaller than normal system jitter. Long answer (by Peterz): IFF we guarantee perf_event_attr::sample_period >= 128. Suppose we start out with sample_period=192; then we'll set period_left to 192, we'll end up with left = 128 (we truncate the lower bits). We get an interrupt, find that period_left = 64 (>0 so we return 0 and don't get an overflow handler), up that to 128. Then we trigger again, at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get an overflow). We increment with sample_period so we get left = 128. We fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an overflow). And on and on. So while the individual interrupts are 'wrong' we get then with interval=256,128 in exactly the right ratio to average out at 192. And this works for everything >=128. So the num_samples*fixed_period thing is still entirely correct +- 127, which is good enough I'd say, as you already have that error anyhow. So no need to 'fix' the tools, al we need to do is refuse to create INST_RETIRED:ALL events with sample_period < 128. Signed-off-by: Andi Kleen <ak@linux.intel.com> [ Updated comments and changelog a bit. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Add Broadwell core supportAndi Kleen2015-03-271-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add Broadwell support for Broadwell to perf. The basic support is very similar to Haswell. We use the new cache event list added for Haswell earlier. The only differences are a few bits related to remote nodes. To avoid an extra, mostly identical, table these are patched up in the initialization code. The constraint list has one new event that needs to be handled over Haswell. Includes code and testing from Kan Liang. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424225886-18652-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Add new cache events table for HaswellAndi Kleen2015-03-271-2/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Haswell offcore events are quite different from Sandy Bridge. Add a new table to handle Haswell properly. Note that the offcore bits listed in the SDM are not quite correct (this is currently being fixed). An uptodate list of bits is in the patch. The basic setup is similar to Sandy Bridge. The prefetch columns have been removed, as prefetch counting is not very reliable on Haswell. One L1 event that is not in the event list anymore has been also removed. - data reads do not include code reads (comparable to earlier Sandy Bridge tables) - data counts include speculative execution (except L1 write, dtlb, bpu) - remote node access includes both remote memory, remote cache, remote mmio. - prefetches are not included in the counts for consistency (different from Sandy Bridge, which includes prefetches in the remote node) Signed-off-by: Andi Kleen <ak@linux.intel.com> [ Removed the HSM30 comments; we don't have them for SNB/IVB either. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424225886-18652-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | Merge tag 'v4.0-rc1' into perf/core, to refresh the treeIngo Molnar2015-02-267-41/+47
| |\ \ \ \ | | |_|/ / | |/| | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Expose LBR callstack to user space toolingPeter Zijlstra2015-02-182-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With LBR call stack feature enable, there are three callchain options. Enable the 3rd callchain option (LBR callstack) to user space tooling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/20141105093759.GQ10501@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Discard zero length call entries in LBR call stackYan, Zheng2015-02-181-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "Zero length call" uses the attribute of the call instruction to push the immediate instruction pointer on to the stack and then pops off that address into a register. This is accomplished without any matching return instruction. It confuses the hardware and make the recorded call stack incorrect. We can partially resolve this issue by: decode call instructions and discard any zero length call entry in the LBR stack. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-16-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>