Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/…

…pub/scm/linux/kernel/git/acme/linux Pull perf tools updates from Arnaldo Carvalho de Melo: "Miscellaneous: - Add Ian Rogers to MAINTAINERS as a perf tools reviewer. - Add support for retire latency feature (pipeline stall of a instruction compared to the previous one, in cycles) present on some Intel processors. - Add 'perf c2c' report option to show false sharing with adjacent cachelines, to be used in machines with cacheline prefetching, where accesses to a cacheline brings the next one too. - Skip 'perf test bpf' when the required kernel-debuginfo package isn't installed. - Avoid d3-flame-graph package dependency in 'perf script flamegraph', making this feature more generally available. - Add JSON metric events to present CPI stall cycles in Power10. - Assorted improvements/refactorings on the JSON metrics parsing code. perf lock contention: - Add -o/--lock-owner option: $ sudo ./perf lock contention -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome The owner is unknown in most cases. Filtering only for the mutex locks, it will more likely get the owners. - -S/--callstack-filter is to limit display entries having the given string in the callstack: $ sudo ./perf lock contention -abv -S net sleep 1 ... contended total wait max wait avg wait type caller 5 70.20 us 16.13 us 14.04 us spinlock __dev_queue_xmit+0xb6d 0xffffffffa5dd1c60 _raw_spin_lock+0x30 0xffffffffa5b8f6ed __dev_queue_xmit+0xb6d 0xffffffffa5cd8267 ip6_finish_output2+0x2c7 0xffffffffa5cdac14 ip6_finish_output+0x1d4 0xffffffffa5cdb477 ip6_xmit+0x457 0xffffffffa5d1fd17 inet6_csk_xmit+0xd7 0xffffffffa5c5f4aa __tcp_transmit_skb+0x54a 0xffffffffa5c6467d tcp_keepalive_timer+0x2fd Please note that to have the -b option (BPF) working above one has to build with BUILD_BPF_SKEL=1. - Add more 'perf test' entries to test these new features. perf script: - Add 'cgroup' field for 'perf script' output: $ perf record --all-cgroups -- true $ perf script -F comm,pid,cgroup true 337112 /user.slice/user-657345.slice/[email protected]/... true 337112 /user.slice/user-657345.slice/[email protected]/... true 337112 /user.slice/user-657345.slice/[email protected]/... true 337112 /user.slice/user-657345.slice/[email protected]/... - Add support for showing branch speculation information in 'perf script' and in the 'perf report' raw dump (-D). perf record: - Fix 'perf record' segfault with --overwrite and --max-size. perf test/bench: - Switch basic BPF filtering test to use syscall tracepoint to avoid the variable number of probes inserted when using the previous probe point (do_epoll_wait) that happens on different CPU architectures. - Fix DWARF unwind test by adding non-inline to expected function in a backtrace. - Use 'grep -c' where the longer form 'grep | wc -l' was being used. - Add getpid and execve benchmarks to 'perf bench syscall'. Intel PT: - Add support for synthesizing "cycle" events from Intel PT traces as we support "instruction" events when Intel PT CYC packets are available. This enables much more accurate profiles than when using the regular 'perf record -e cycles' (the default) when the workload lasts for very short periods (<10ms). - .plt symbol handling improvements, better handling IBT (in the past MPX) done in the context of decoding Intel PT processor traces, IFUNC symbols on x86_64, static executables, understanding .plt.got symbols on x86_64. - Add a 'perf test' to test symbol resolution, part of the .plt improvements series, this tests things like symbol size in contexts where only the symbol start is available (kallsyms), etc. - Better handle auxtrace/Intel PT data when using pipe mode (perf record sleep 1|perf report). - Fix symbol lookup with kcore with multiple segments match stext, getting the symbol resolution to just show DSOs as unknown. ARM: - Timestamp improvements for ARM64 systems with ETMv4 (Embedded Trace Macrocell v4). - Ensure ARM64 CoreSight timestamps don't go backwards. - Document that ARM64 SPE (Statistical Profiling Extension) is used with 'perf c2c/mem'. - Add raw decoding for ARM64 SPEv1.2 previous branch address. - Update neoverse-n2-v2 ARM vendor events (JSON tables): topdown L1, TLB, cache, branch, PE utilization and instruction mix metrics. - Update decoder code for OpenCSD version 1.4, on ARM64 systems. - Fix command line auto-complete of CPU events on aarch64. Build: - Fix 'perf probe' and 'perf test' when libtraceevent isn't linked, as several tests use tracepoints, those should be skipped. - More fallout fixes for the removal of tools/lib/traceevent/. - Fix build error when linking with libpfm" * tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (114 commits) perf tests stat_all_metrics: Change true workload to sleep workload for system wide check perf vendor events power10: Add JSON metric events to present CPI stall cycles in powerpc perf intel-pt: Synthesize cycle events perf c2c: Add report option to show false sharing in adjacent cachelines perf record: Fix segfault with --overwrite and --max-size perf stat: Avoid merging/aggregating metric counts twice perf tools: Fix perf tool build error in util/pfm.c perf tools: Fix auto-complete on aarch64 perf lock contention: Support old rw_semaphore type perf lock contention: Add -o/--lock-owner option perf lock contention: Fix to save callstack for the default modified perf test bpf: Skip test if kernel-debuginfo is not present perf probe: Update the exit error codes in function try_to_find_probe_trace_event perf script: Fix missing Retire Latency fields option documentation perf event x86: Add retire_lat when synthesizing PERF_SAMPLE_WEIGHT_STRUCT perf test x86: Support the retire_lat (Retire Latency) sample_type check perf test bpf: Check for libtraceevent support perf script: Support Retire Latency perf report: Support Retire Latency perf lock contention: Support filters for different aggregation ...
kawasaki · Feb 23, 2023 · 0df8218 · 0df8218
2 parents b72b5fe + f9fa077
commit 0df8218
Show file tree

Hide file tree

Showing 129 changed files with 3,210 additions and 1,092 deletions.
diff --git a/MAINTAINERS b/MAINTAINERS
@@ -16323,6 +16323,7 @@ R:	Mark Rutland <[email protected]>
 R:	Alexander Shishkin <[email protected]>
 R:	Jiri Olsa <[email protected]>
 R:	Namhyung Kim <[email protected]>
+R:	Ian Rogers <[email protected]>
 L:	[email protected]
 L:	[email protected]
 S:	Supported

diff --git a/tools/arch/x86/include/uapi/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h
@@ -1,16 +1,25 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __NR_perf_event_open
-# define __NR_perf_event_open 336
+#ifndef __NR_execve
+#define __NR_execve 11
 #endif
-#ifndef __NR_futex
-# define __NR_futex 240
+#ifndef __NR_getppid
+#define __NR_getppid 64
+#endif
+#ifndef __NR_getpgid
+#define __NR_getpgid 132
 #endif
 #ifndef __NR_gettid
-# define __NR_gettid 224
+#define __NR_gettid 224
+#endif
+#ifndef __NR_futex
+#define __NR_futex 240
 #endif
 #ifndef __NR_getcpu
-# define __NR_getcpu 318
+#define __NR_getcpu 318
+#endif
+#ifndef __NR_perf_event_open
+#define __NR_perf_event_open 336
 #endif
 #ifndef __NR_setns
-# define __NR_setns 346
+#define __NR_setns 346
 #endif
diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
@@ -1,16 +1,25 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __NR_perf_event_open
-# define __NR_perf_event_open 298
+#ifndef __NR_execve
+#define __NR_execve 59
 #endif
-#ifndef __NR_futex
-# define __NR_futex 202
+#ifndef __NR_getppid
+#define __NR_getppid 110
+#endif
+#ifndef __NR_getpgid
+#define __NR_getpgid 121
 #endif
 #ifndef __NR_gettid
-# define __NR_gettid 186
+#define __NR_gettid 186
 #endif
-#ifndef __NR_getcpu
-# define __NR_getcpu 309
+#ifndef __NR_futex
+#define __NR_futex 202
+#endif
+#ifndef __NR_perf_event_open
+#define __NR_perf_event_open 298
 #endif
 #ifndef __NR_setns
 #define __NR_setns 308
 #endif
+#ifndef __NR_getcpu
+#define __NR_getcpu 309
+#endif
diff --git a/tools/build/Makefile.build b/tools/build/Makefile.build
@@ -53,6 +53,7 @@ build-file := $(dir)/Build
 
 quiet_cmd_flex  = FLEX    $@
 quiet_cmd_bison = BISON   $@
+quiet_cmd_test  = TEST    $@
 
 # Create directory unless it exists
 quiet_cmd_mkdir = MKDIR   $(dir $@)

diff --git a/tools/perf/.gitignore b/tools/perf/.gitignore
@@ -38,6 +38,7 @@ arch/*/include/generated/
 trace/beauty/generated/
 pmu-events/pmu-events.c
 pmu-events/jevents
+pmu-events/metric_test.log
 feature/
 libapi/
 libbpf/

diff --git a/tools/perf/Documentation/itrace.txt b/tools/perf/Documentation/itrace.txt
@@ -1,4 +1,5 @@
 		i	synthesize instructions events
+		y	synthesize cycles events
 		b	synthesize branches events (branch misses for Arm SPE)
 		c	synthesize branches events (calls only)
 		r	synthesize branches events (returns only)
@@ -25,7 +26,7 @@
 		A	approximate IPC
 		Z	prefer to ignore timestamps (so-called "timeless" decoding)
 
-	The default is all events i.e. the same as --itrace=ibxwpe,
+	The default is all events i.e. the same as --itrace=iybxwpe,
 	except for perf script where it is --itrace=ce
 
 	In addition, the period (default 100000, except for perf script where it is 1)

diff --git a/tools/perf/Documentation/perf-bench.txt b/tools/perf/Documentation/perf-bench.txt
@@ -18,7 +18,7 @@ COMMON OPTIONS
 --------------
 -r::
 --repeat=::
-Specify amount of times to repeat the run (default 10).
+Specify number of times to repeat the run (default 10).
 
 -f::
 --format=::

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
@@ -22,7 +22,11 @@ you to track down the cacheline contentions.
 On Intel, the tool is based on load latency and precise store facility events
 provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
 with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
-limitations, perf c2c is not supported on Zen3 cpus).
+limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to
+sample load and store operations, therefore hardware and kernel support is
+required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the
+statistical nature of Arm SPE sampling, not every memory operation will be
+sampled.
 
 These events provide:
   - memory address of the access
@@ -121,11 +125,17 @@ REPORT OPTIONS
 	perf c2c record --call-graph lbr.
 	Disabled by default. In common cases with call stack overflows,
 	it can recreate better call stacks than the default lbr call stack
-	output. But this approach is not full proof. There can be cases
+	output. But this approach is not foolproof. There can be cases
 	where it creates incorrect call stacks from incorrect matches.
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.
 
+--double-cl::
+	Group the detection of shared cacheline events into double cacheline
+	granularity. Some architectures have an Adjacent Cacheline Prefetch
+	feature, which causes cacheline sharing to behave like the cacheline
+	size is doubled.
+
 C2C RECORD
 ----------
 The perf c2c record command setup options related to HITM cacheline analysis
@@ -333,4 +343,4 @@ Check Joe's blog on c2c tool for detailed use case explanation:
 
 SEE ALSO
 --------
-linkperf:perf-record[1], linkperf:perf-mem[1]
+linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]
diff --git a/tools/perf/Documentation/perf-intel-pt.txt b/tools/perf/Documentation/perf-intel-pt.txt
@@ -101,12 +101,12 @@ data is available you can use the 'perf script' tool with all itrace sampling
 options, which will list all the samples.
 
 	perf record -e intel_pt//u ls
-	perf script --itrace=ibxwpe
+	perf script --itrace=iybxwpe
 
 An interesting field that is not printed by default is 'flags' which can be
 displayed as follows:
 
-	perf script --itrace=ibxwpe -F+flags
+	perf script --itrace=iybxwpe -F+flags
 
 The flags are "bcrosyiABExghDt" which stand for branch, call, return, conditional,
 system, asynchronous, interrupt, transaction abort, trace begin, trace end,
@@ -147,16 +147,17 @@ displayed as follows:
 There are two ways that instructions-per-cycle (IPC) can be calculated depending
 on the recording.
 
-If the 'cyc' config term (see config terms section below) was used, then IPC is
-calculated using the cycle count from CYC packets, otherwise MTC packets are
-used - refer to the 'mtc' config term.  When MTC is used, however, the values
-are less accurate because the timing is less accurate.
+If the 'cyc' config term (see config terms section below) was used, then IPC
+and cycle events are calculated using the cycle count from CYC packets, otherwise
+MTC packets are used - refer to the 'mtc' config term.  When MTC is used, however,
+the values are less accurate because the timing is less accurate.
 
 Because Intel PT does not update the cycle count on every branch or instruction,
 the values will often be zero.  When there are values, they will be the number
 of instructions and number of cycles since the last update, and thus represent
-the average IPC since the last IPC for that event type.  Note IPC for "branches"
-events is calculated separately from IPC for "instructions" events.
+the average IPC cycle count since the last IPC for that event type.
+Note IPC for "branches" events is calculated separately from IPC for "instructions"
+events.
 
 Even with the 'cyc' config term, it is possible to produce IPC information for
 every change of timestamp, but at the expense of accuracy.  That is selected by
@@ -900,11 +901,12 @@ Having no option is the same as
 
 which, in turn, is the same as
 
-	--itrace=cepwx
+	--itrace=cepwxy
 
 The letters are:
 
 	i	synthesize "instructions" events
+	y	synthesize "cycles" events
 	b	synthesize "branches" events
 	x	synthesize "transactions" events
 	w	synthesize "ptwrite" events
@@ -927,16 +929,26 @@ The letters are:
 "Instructions" events look like they were recorded by "perf record -e
 instructions".
 
+"Cycles" events look like they were recorded by "perf record -e cycles"
+(ie., the default). Note that even with CYC packets enabled and no sampling,
+these are not fully accurate, since CYC packets are not emitted for each
+instruction, only when some other event (like an indirect branch, or a
+TNT packet representing multiple branches) happens causes a packet to
+be emitted. Thus, it is more effective for attributing cycles to functions
+(and possibly basic blocks) than to individual instructions, although it
+is not even perfect for functions (although it becomes better if the noretcomp
+option is active).
+
 "Branches" events look like they were recorded by "perf record -e branches". "c"
 and "r" can be combined to get calls and returns.
 
 "Transactions" events correspond to the start or end of transactions. The
 'flags' field can be used in perf script to determine whether the event is a
 transaction start, commit or abort.
 
-Note that "instructions", "branches" and "transactions" events depend on code
-flow packets which can be disabled by using the config term "branch=0".  Refer
-to the config terms section above.
+Note that "instructions", "cycles", "branches" and "transactions" events
+depend on code flow packets which can be disabled by using the config term
+"branch=0".  Refer to the config terms section above.
 
 "ptwrite" events record the payload of the ptwrite instruction and whether
 "fup_on_ptw" was used.  "ptwrite" events depend on PTWRITE packets which are
@@ -1821,6 +1833,36 @@ Can be compiled and traced:
  $
 
 
+Pipe mode
+---------
+Pipe mode is a problem for Intel PT and possibly other auxtrace users.
+It's not recommended to use a pipe as data output with Intel PT because
+of the following reason.
+
+Essentially the auxtrace buffers do not behave like the regular perf
+event buffers.  That is because the head and tail are updated by
+software, but in the auxtrace case the data is written by hardware.
+So the head and tail do not get updated as data is written.
+
+In the Intel PT case, the head and tail are updated only when the trace
+is disabled by software, for example:
+    - full-trace, system wide : when buffer passes watermark
+    - full-trace, not system-wide : when buffer passes watermark or
+                                    context switches
+    - snapshot mode : as above but also when a snapshot is made
+    - sample mode : as above but also when a sample is made
+
+That means finished-round ordering doesn't work.  An auxtrace buffer
+can turn up that has data that extends back in time, possibly to the
+very beginning of tracing.
+
+For a perf.data file, that problem is solved by going through the trace
+and queuing up the auxtrace buffers in advance.
+
+For pipe mode, the order of events and timestamps can presumably
+be messed up.
+
+
 EXAMPLE
 -------
 

diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
@@ -232,7 +232,7 @@ This can be overridden by setting the kernel.perf_event_paranoid
 sysctl to -1, which allows non root to use these events.
 
 For accessing trace point events perf needs to have read access to
-/sys/kernel/debug/tracing, even when perf_event_paranoid is in a relaxed
+/sys/kernel/tracing, even when perf_event_paranoid is in a relaxed
 setting.
 
 TRACING

diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
@@ -172,6 +172,11 @@ CONTENTION OPTIONS
 --lock-addr::
 	Show lock contention stat by address
 
+-o::
+--lock-owner::
+	Show lock contention stat by owners.  Implies --threads and
+	requires --use-bpf.
+
 -Y::
 --type-filter=<value>::
 	Show lock contention only for given lock types (comma separated list).
@@ -187,6 +192,12 @@ CONTENTION OPTIONS
 --lock-filter=<value>::
 	Show lock contention only for given lock addresses or names (comma separated list).
 
+-S::
+--callstack-filter=<value>::
+	Show lock contention only if the callstack contains the given string.
+	Note that it matches the substring so 'rq' would match both 'raw_spin_rq_lock'
+	and 'irq_enter_rcu'.
+
 
 SEE ALSO
 --------

diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt
@@ -23,6 +23,11 @@ Note that on Intel systems the memory latency reported is the use-latency,
 not the pure load (or store latency). Use latency includes any pipeline
 queueing delays in addition to the memory subsystem latency.
 
+On Arm64 this uses SPE to sample load and store operations, therefore hardware
+and kernel support is required. See linkperf:perf-arm-spe[1] for a setup guide.
+Due to the statistical nature of SPE sampling, not every memory operation will
+be sampled.
+
 OPTIONS
 -------
 <command>...::
@@ -93,4 +98,4 @@ all perf record options.
 
 SEE ALSO
 --------
-linkperf:perf-record[1], linkperf:perf-report[1]
+linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-arm-spe[1]
diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
@@ -222,7 +222,7 @@ probe syntax, 'SRC' means the source file path, 'ALN' is start line number,
 and 'ALN2' is end line number in the file. It is also possible to specify how
 many lines to show by using 'NUM'. Moreover, 'FUNC@SRC' combination is good
 for searching a specific function when several functions share same name.
-So, "source.c:100-120" shows lines between 100th to l20th in source.c file. And "func:10+20" shows 20 lines from 10th line of func function.
+So, "source.c:100-120" shows lines between 100th to 120th in source.c file. And "func:10+20" shows 20 lines from 10th line of func function.
 
 LAZY MATCHING
 -------------

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
@@ -115,6 +115,8 @@ OPTIONS
 	- p_stage_cyc: On powerpc, this presents the number of cycles spent in a
 	  pipeline stage. And currently supported only on powerpc.
 	- addr: (Full) virtual address of the sampled instruction
+	- retire_lat: On X86, this reports pipeline stall of this instruction compared
+	  to the previous instruction in cycles. And currently supported only on X86
 
 	By default, comm, dso and symbol keys are used.
 	(i.e. --sort comm,dso,symbol)
@@ -507,7 +509,7 @@ include::itrace.txt[]
 	perf record --call-graph lbr.
 	Disabled by default. In common cases with call stack overflows,
 	it can recreate better call stacks than the default lbr call stack
-	output. But this approach is not full proof. There can be cases
+	output. But this approach is not foolproof. There can be cases
 	where it creates incorrect call stacks from incorrect matches.
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.

diff --git a/tools/perf/Documentation/perf-script-perl.txt b/tools/perf/Documentation/perf-script-perl.txt
@@ -55,7 +55,7 @@ Traces meant to be processed using a script should be recorded with
 the above option: -a to enable system-wide collection.
 
 The format file for the sched_wakeup event defines the following fields
-(see /sys/kernel/debug/tracing/events/sched/sched_wakeup/format):
+(see /sys/kernel/tracing/events/sched/sched_wakeup/format):
 
 ----
  format:

diff --git a/tools/perf/Documentation/perf-script-python.txt b/tools/perf/Documentation/perf-script-python.txt
@@ -319,7 +319,7 @@ So those are the essential steps in writing and running a script.  The
 process can be generalized to any tracepoint or set of tracepoints
 you're interested in - basically find the tracepoint(s) you're
 interested in by looking at the list of available events shown by
-'perf list' and/or look in /sys/kernel/debug/tracing/events/ for
+'perf list' and/or look in /sys/kernel/tracing/events/ for
 detailed event and field info, record the corresponding trace data
 using 'perf record', passing it the list of interesting events,
 generate a skeleton script using 'perf script -g python' and modify the
@@ -449,7 +449,7 @@ Traces meant to be processed using a script should be recorded with
 the above option: -a to enable system-wide collection.
 
 The format file for the sched_wakeup event defines the following fields
-(see /sys/kernel/debug/tracing/events/sched/sched_wakeup/format):
+(see /sys/kernel/tracing/events/sched/sched_wakeup/format):
 
 ----
  format: