Skip to content

Commit

Permalink
Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/…
Browse files Browse the repository at this point in the history
…pub/scm/linux/kernel/git/acme/linux

Pull perf tools updates from Arnaldo Carvalho de Melo:
 "Miscellaneous:

   - Add Ian Rogers to MAINTAINERS as a perf tools reviewer.

   - Add support for retire latency feature (pipeline stall of a
     instruction compared to the previous one, in cycles) present on
     some Intel processors.

   - Add 'perf c2c' report option to show false sharing with adjacent
     cachelines, to be used in machines with cacheline prefetching,
     where accesses to a cacheline brings the next one too.

   - Skip 'perf test bpf' when the required kernel-debuginfo package
     isn't installed.

   - Avoid d3-flame-graph package dependency in 'perf script flamegraph',
     making this feature more generally available.

   - Add JSON metric events to present CPI stall cycles in Power10.

   - Assorted improvements/refactorings on the JSON metrics parsing
     code.

  perf lock contention:

   - Add -o/--lock-owner option:

        $ sudo ./perf lock contention -abo -- ./perf bench sched pipe
        # Running 'sched/pipe' benchmark:
        # Executed 1000000 pipe operations between two processes

             Total time: 4.766 [sec]

               4.766540 usecs/op
                 209795 ops/sec
         contended   total wait     max wait     avg wait          pid   owner

               403    565.32 us     26.81 us      1.40 us           -1   Unknown
                 4     27.99 us      8.57 us      7.00 us      1583145   sched-pipe
                 1      8.25 us      8.25 us      8.25 us      1583144   sched-pipe
                 1      2.03 us      2.03 us      2.03 us         5068   chrome

         The owner is unknown in most cases. Filtering only for the
         mutex locks, it will more likely get the owners.

   - -S/--callstack-filter is to limit display entries having the given
     string in the callstack:

        $ sudo ./perf lock contention -abv -S net sleep 1
        ...
         contended   total wait     max wait     avg wait         type   caller

                 5     70.20 us     16.13 us     14.04 us     spinlock   __dev_queue_xmit+0xb6d
                                0xffffffffa5dd1c60  _raw_spin_lock+0x30
                                0xffffffffa5b8f6ed  __dev_queue_xmit+0xb6d
                                0xffffffffa5cd8267  ip6_finish_output2+0x2c7
                                0xffffffffa5cdac14  ip6_finish_output+0x1d4
                                0xffffffffa5cdb477  ip6_xmit+0x457
                                0xffffffffa5d1fd17  inet6_csk_xmit+0xd7
                                0xffffffffa5c5f4aa  __tcp_transmit_skb+0x54a
                                0xffffffffa5c6467d  tcp_keepalive_timer+0x2fd

     Please note that to have the -b option (BPF) working above one has
     to build with BUILD_BPF_SKEL=1.

   - Add more 'perf test' entries to test these new features.

  perf script:

   - Add 'cgroup' field for 'perf script' output:

        $ perf record --all-cgroups -- true
        $ perf script -F comm,pid,cgroup
                  true 337112  /user.slice/user-657345.slice/[email protected]/...
                  true 337112  /user.slice/user-657345.slice/[email protected]/...
                  true 337112  /user.slice/user-657345.slice/[email protected]/...
                  true 337112  /user.slice/user-657345.slice/[email protected]/...

   - Add support for showing branch speculation information in 'perf
     script' and in the 'perf report' raw dump (-D).

  perf record:

   - Fix 'perf record' segfault with --overwrite and --max-size.

  perf test/bench:

   - Switch basic BPF filtering test to use syscall tracepoint to avoid
     the variable number of probes inserted when using the previous
     probe point (do_epoll_wait) that happens on different CPU
     architectures.

   - Fix DWARF unwind test by adding non-inline to expected function in
     a backtrace.

   - Use 'grep -c' where the longer form 'grep | wc -l' was being used.

   - Add getpid and execve benchmarks to 'perf bench syscall'.

  Intel PT:

   - Add support for synthesizing "cycle" events from Intel PT traces as
     we support "instruction" events when Intel PT CYC packets are
     available. This enables much more accurate profiles than when using
     the regular 'perf record -e cycles' (the default) when the workload
     lasts for very short periods (<10ms).

   - .plt symbol handling improvements, better handling IBT (in the past
     MPX) done in the context of decoding Intel PT processor traces,
     IFUNC symbols on x86_64, static executables, understanding .plt.got
     symbols on x86_64.

   - Add a 'perf test' to test symbol resolution, part of the .plt
     improvements series, this tests things like symbol size in contexts
     where only the symbol start is available (kallsyms), etc.

   - Better handle auxtrace/Intel PT data when using pipe mode (perf
     record sleep 1|perf report).

   - Fix symbol lookup with kcore with multiple segments match stext,
     getting the symbol resolution to just show DSOs as unknown.

  ARM:

   - Timestamp improvements for ARM64 systems with ETMv4 (Embedded Trace
     Macrocell v4).

   - Ensure ARM64 CoreSight timestamps don't go backwards.

   - Document that ARM64 SPE (Statistical Profiling Extension) is used
     with 'perf c2c/mem'.

   - Add raw decoding for ARM64 SPEv1.2 previous branch address.

   - Update neoverse-n2-v2 ARM vendor events (JSON tables): topdown L1,
     TLB, cache, branch, PE utilization and instruction mix metrics.

   - Update decoder code for OpenCSD version 1.4, on ARM64 systems.

   - Fix command line auto-complete of CPU events on aarch64.

  Build:

   - Fix 'perf probe' and 'perf test' when libtraceevent isn't linked,
     as several tests use tracepoints, those should be skipped.

   - More fallout fixes for the removal of tools/lib/traceevent/.

   - Fix build error when linking with libpfm"

* tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (114 commits)
  perf tests stat_all_metrics: Change true workload to sleep workload for system wide check
  perf vendor events power10: Add JSON metric events to present CPI stall cycles in powerpc
  perf intel-pt: Synthesize cycle events
  perf c2c: Add report option to show false sharing in adjacent cachelines
  perf record: Fix segfault with --overwrite and --max-size
  perf stat: Avoid merging/aggregating metric counts twice
  perf tools: Fix perf tool build error in util/pfm.c
  perf tools: Fix auto-complete on aarch64
  perf lock contention: Support old rw_semaphore type
  perf lock contention: Add -o/--lock-owner option
  perf lock contention: Fix to save callstack for the default modified
  perf test bpf: Skip test if kernel-debuginfo is not present
  perf probe: Update the exit error codes in function try_to_find_probe_trace_event
  perf script: Fix missing Retire Latency fields option documentation
  perf event x86: Add retire_lat when synthesizing PERF_SAMPLE_WEIGHT_STRUCT
  perf test x86: Support the retire_lat (Retire Latency) sample_type check
  perf test bpf: Check for libtraceevent support
  perf script: Support Retire Latency
  perf report: Support Retire Latency
  perf lock contention: Support filters for different aggregation
  ...
  • Loading branch information
torvalds committed Feb 23, 2023
2 parents b72b5fe + f9fa077 commit 0df8218
Show file tree
Hide file tree
Showing 129 changed files with 3,210 additions and 1,092 deletions.
1 change: 1 addition & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -16323,6 +16323,7 @@ R: Mark Rutland <[email protected]>
R: Alexander Shishkin <[email protected]>
R: Jiri Olsa <[email protected]>
R: Namhyung Kim <[email protected]>
R: Ian Rogers <[email protected]>
L: [email protected]
L: [email protected]
S: Supported
Expand Down
23 changes: 16 additions & 7 deletions tools/arch/x86/include/uapi/asm/unistd_32.h
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __NR_perf_event_open
# define __NR_perf_event_open 336
#ifndef __NR_execve
#define __NR_execve 11
#endif
#ifndef __NR_futex
# define __NR_futex 240
#ifndef __NR_getppid
#define __NR_getppid 64
#endif
#ifndef __NR_getpgid
#define __NR_getpgid 132
#endif
#ifndef __NR_gettid
# define __NR_gettid 224
#define __NR_gettid 224
#endif
#ifndef __NR_futex
#define __NR_futex 240
#endif
#ifndef __NR_getcpu
# define __NR_getcpu 318
#define __NR_getcpu 318
#endif
#ifndef __NR_perf_event_open
#define __NR_perf_event_open 336
#endif
#ifndef __NR_setns
# define __NR_setns 346
#define __NR_setns 346
#endif
23 changes: 16 additions & 7 deletions tools/arch/x86/include/uapi/asm/unistd_64.h
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __NR_perf_event_open
# define __NR_perf_event_open 298
#ifndef __NR_execve
#define __NR_execve 59
#endif
#ifndef __NR_futex
# define __NR_futex 202
#ifndef __NR_getppid
#define __NR_getppid 110
#endif
#ifndef __NR_getpgid
#define __NR_getpgid 121
#endif
#ifndef __NR_gettid
# define __NR_gettid 186
#define __NR_gettid 186
#endif
#ifndef __NR_getcpu
# define __NR_getcpu 309
#ifndef __NR_futex
#define __NR_futex 202
#endif
#ifndef __NR_perf_event_open
#define __NR_perf_event_open 298
#endif
#ifndef __NR_setns
#define __NR_setns 308
#endif
#ifndef __NR_getcpu
#define __NR_getcpu 309
#endif
1 change: 1 addition & 0 deletions tools/build/Makefile.build
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ build-file := $(dir)/Build

quiet_cmd_flex = FLEX $@
quiet_cmd_bison = BISON $@
quiet_cmd_test = TEST $@

# Create directory unless it exists
quiet_cmd_mkdir = MKDIR $(dir $@)
Expand Down
1 change: 1 addition & 0 deletions tools/perf/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ arch/*/include/generated/
trace/beauty/generated/
pmu-events/pmu-events.c
pmu-events/jevents
pmu-events/metric_test.log
feature/
libapi/
libbpf/
Expand Down
3 changes: 2 additions & 1 deletion tools/perf/Documentation/itrace.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
i synthesize instructions events
y synthesize cycles events
b synthesize branches events (branch misses for Arm SPE)
c synthesize branches events (calls only)
r synthesize branches events (returns only)
Expand All @@ -25,7 +26,7 @@
A approximate IPC
Z prefer to ignore timestamps (so-called "timeless" decoding)

The default is all events i.e. the same as --itrace=ibxwpe,
The default is all events i.e. the same as --itrace=iybxwpe,
except for perf script where it is --itrace=ce

In addition, the period (default 100000, except for perf script where it is 1)
Expand Down
2 changes: 1 addition & 1 deletion tools/perf/Documentation/perf-bench.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ COMMON OPTIONS
--------------
-r::
--repeat=::
Specify amount of times to repeat the run (default 10).
Specify number of times to repeat the run (default 10).

-f::
--format=::
Expand Down
16 changes: 13 additions & 3 deletions tools/perf/Documentation/perf-c2c.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,11 @@ you to track down the cacheline contentions.
On Intel, the tool is based on load latency and precise store facility events
provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
limitations, perf c2c is not supported on Zen3 cpus).
limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to
sample load and store operations, therefore hardware and kernel support is
required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the
statistical nature of Arm SPE sampling, not every memory operation will be
sampled.

These events provide:
- memory address of the access
Expand Down Expand Up @@ -121,11 +125,17 @@ REPORT OPTIONS
perf c2c record --call-graph lbr.
Disabled by default. In common cases with call stack overflows,
it can recreate better call stacks than the default lbr call stack
output. But this approach is not full proof. There can be cases
output. But this approach is not foolproof. There can be cases
where it creates incorrect call stacks from incorrect matches.
The known limitations include exception handing such as
setjmp/longjmp will have calls/returns not match.

--double-cl::
Group the detection of shared cacheline events into double cacheline
granularity. Some architectures have an Adjacent Cacheline Prefetch
feature, which causes cacheline sharing to behave like the cacheline
size is doubled.

C2C RECORD
----------
The perf c2c record command setup options related to HITM cacheline analysis
Expand Down Expand Up @@ -333,4 +343,4 @@ Check Joe's blog on c2c tool for detailed use case explanation:

SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-mem[1]
linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]
66 changes: 54 additions & 12 deletions tools/perf/Documentation/perf-intel-pt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,12 @@ data is available you can use the 'perf script' tool with all itrace sampling
options, which will list all the samples.

perf record -e intel_pt//u ls
perf script --itrace=ibxwpe
perf script --itrace=iybxwpe

An interesting field that is not printed by default is 'flags' which can be
displayed as follows:

perf script --itrace=ibxwpe -F+flags
perf script --itrace=iybxwpe -F+flags

The flags are "bcrosyiABExghDt" which stand for branch, call, return, conditional,
system, asynchronous, interrupt, transaction abort, trace begin, trace end,
Expand Down Expand Up @@ -147,16 +147,17 @@ displayed as follows:
There are two ways that instructions-per-cycle (IPC) can be calculated depending
on the recording.

If the 'cyc' config term (see config terms section below) was used, then IPC is
calculated using the cycle count from CYC packets, otherwise MTC packets are
used - refer to the 'mtc' config term. When MTC is used, however, the values
are less accurate because the timing is less accurate.
If the 'cyc' config term (see config terms section below) was used, then IPC
and cycle events are calculated using the cycle count from CYC packets, otherwise
MTC packets are used - refer to the 'mtc' config term. When MTC is used, however,
the values are less accurate because the timing is less accurate.

Because Intel PT does not update the cycle count on every branch or instruction,
the values will often be zero. When there are values, they will be the number
of instructions and number of cycles since the last update, and thus represent
the average IPC since the last IPC for that event type. Note IPC for "branches"
events is calculated separately from IPC for "instructions" events.
the average IPC cycle count since the last IPC for that event type.
Note IPC for "branches" events is calculated separately from IPC for "instructions"
events.

Even with the 'cyc' config term, it is possible to produce IPC information for
every change of timestamp, but at the expense of accuracy. That is selected by
Expand Down Expand Up @@ -900,11 +901,12 @@ Having no option is the same as

which, in turn, is the same as

--itrace=cepwx
--itrace=cepwxy

The letters are:

i synthesize "instructions" events
y synthesize "cycles" events
b synthesize "branches" events
x synthesize "transactions" events
w synthesize "ptwrite" events
Expand All @@ -927,16 +929,26 @@ The letters are:
"Instructions" events look like they were recorded by "perf record -e
instructions".

"Cycles" events look like they were recorded by "perf record -e cycles"
(ie., the default). Note that even with CYC packets enabled and no sampling,
these are not fully accurate, since CYC packets are not emitted for each
instruction, only when some other event (like an indirect branch, or a
TNT packet representing multiple branches) happens causes a packet to
be emitted. Thus, it is more effective for attributing cycles to functions
(and possibly basic blocks) than to individual instructions, although it
is not even perfect for functions (although it becomes better if the noretcomp
option is active).

"Branches" events look like they were recorded by "perf record -e branches". "c"
and "r" can be combined to get calls and returns.

"Transactions" events correspond to the start or end of transactions. The
'flags' field can be used in perf script to determine whether the event is a
transaction start, commit or abort.

Note that "instructions", "branches" and "transactions" events depend on code
flow packets which can be disabled by using the config term "branch=0". Refer
to the config terms section above.
Note that "instructions", "cycles", "branches" and "transactions" events
depend on code flow packets which can be disabled by using the config term
"branch=0". Refer to the config terms section above.

"ptwrite" events record the payload of the ptwrite instruction and whether
"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are
Expand Down Expand Up @@ -1821,6 +1833,36 @@ Can be compiled and traced:
$


Pipe mode
---------
Pipe mode is a problem for Intel PT and possibly other auxtrace users.
It's not recommended to use a pipe as data output with Intel PT because
of the following reason.

Essentially the auxtrace buffers do not behave like the regular perf
event buffers. That is because the head and tail are updated by
software, but in the auxtrace case the data is written by hardware.
So the head and tail do not get updated as data is written.

In the Intel PT case, the head and tail are updated only when the trace
is disabled by software, for example:
- full-trace, system wide : when buffer passes watermark
- full-trace, not system-wide : when buffer passes watermark or
context switches
- snapshot mode : as above but also when a snapshot is made
- sample mode : as above but also when a sample is made

That means finished-round ordering doesn't work. An auxtrace buffer
can turn up that has data that extends back in time, possibly to the
very beginning of tracing.

For a perf.data file, that problem is solved by going through the trace
and queuing up the auxtrace buffers in advance.

For pipe mode, the order of events and timestamps can presumably
be messed up.


EXAMPLE
-------

Expand Down
2 changes: 1 addition & 1 deletion tools/perf/Documentation/perf-list.txt
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ This can be overridden by setting the kernel.perf_event_paranoid
sysctl to -1, which allows non root to use these events.

For accessing trace point events perf needs to have read access to
/sys/kernel/debug/tracing, even when perf_event_paranoid is in a relaxed
/sys/kernel/tracing, even when perf_event_paranoid is in a relaxed
setting.

TRACING
Expand Down
11 changes: 11 additions & 0 deletions tools/perf/Documentation/perf-lock.txt
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,11 @@ CONTENTION OPTIONS
--lock-addr::
Show lock contention stat by address

-o::
--lock-owner::
Show lock contention stat by owners. Implies --threads and
requires --use-bpf.

-Y::
--type-filter=<value>::
Show lock contention only for given lock types (comma separated list).
Expand All @@ -187,6 +192,12 @@ CONTENTION OPTIONS
--lock-filter=<value>::
Show lock contention only for given lock addresses or names (comma separated list).

-S::
--callstack-filter=<value>::
Show lock contention only if the callstack contains the given string.
Note that it matches the substring so 'rq' would match both 'raw_spin_rq_lock'
and 'irq_enter_rcu'.


SEE ALSO
--------
Expand Down
7 changes: 6 additions & 1 deletion tools/perf/Documentation/perf-mem.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ Note that on Intel systems the memory latency reported is the use-latency,
not the pure load (or store latency). Use latency includes any pipeline
queueing delays in addition to the memory subsystem latency.

On Arm64 this uses SPE to sample load and store operations, therefore hardware
and kernel support is required. See linkperf:perf-arm-spe[1] for a setup guide.
Due to the statistical nature of SPE sampling, not every memory operation will
be sampled.

OPTIONS
-------
<command>...::
Expand Down Expand Up @@ -93,4 +98,4 @@ all perf record options.

SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-report[1]
linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-arm-spe[1]
2 changes: 1 addition & 1 deletion tools/perf/Documentation/perf-probe.txt
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ probe syntax, 'SRC' means the source file path, 'ALN' is start line number,
and 'ALN2' is end line number in the file. It is also possible to specify how
many lines to show by using 'NUM'. Moreover, 'FUNC@SRC' combination is good
for searching a specific function when several functions share same name.
So, "source.c:100-120" shows lines between 100th to l20th in source.c file. And "func:10+20" shows 20 lines from 10th line of func function.
So, "source.c:100-120" shows lines between 100th to 120th in source.c file. And "func:10+20" shows 20 lines from 10th line of func function.

LAZY MATCHING
-------------
Expand Down
4 changes: 3 additions & 1 deletion tools/perf/Documentation/perf-report.txt
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ OPTIONS
- p_stage_cyc: On powerpc, this presents the number of cycles spent in a
pipeline stage. And currently supported only on powerpc.
- addr: (Full) virtual address of the sampled instruction
- retire_lat: On X86, this reports pipeline stall of this instruction compared
to the previous instruction in cycles. And currently supported only on X86

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
Expand Down Expand Up @@ -507,7 +509,7 @@ include::itrace.txt[]
perf record --call-graph lbr.
Disabled by default. In common cases with call stack overflows,
it can recreate better call stacks than the default lbr call stack
output. But this approach is not full proof. There can be cases
output. But this approach is not foolproof. There can be cases
where it creates incorrect call stacks from incorrect matches.
The known limitations include exception handing such as
setjmp/longjmp will have calls/returns not match.
Expand Down
2 changes: 1 addition & 1 deletion tools/perf/Documentation/perf-script-perl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Traces meant to be processed using a script should be recorded with
the above option: -a to enable system-wide collection.

The format file for the sched_wakeup event defines the following fields
(see /sys/kernel/debug/tracing/events/sched/sched_wakeup/format):
(see /sys/kernel/tracing/events/sched/sched_wakeup/format):

----
format:
Expand Down
4 changes: 2 additions & 2 deletions tools/perf/Documentation/perf-script-python.txt
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ So those are the essential steps in writing and running a script. The
process can be generalized to any tracepoint or set of tracepoints
you're interested in - basically find the tracepoint(s) you're
interested in by looking at the list of available events shown by
'perf list' and/or look in /sys/kernel/debug/tracing/events/ for
'perf list' and/or look in /sys/kernel/tracing/events/ for
detailed event and field info, record the corresponding trace data
using 'perf record', passing it the list of interesting events,
generate a skeleton script using 'perf script -g python' and modify the
Expand Down Expand Up @@ -449,7 +449,7 @@ Traces meant to be processed using a script should be recorded with
the above option: -a to enable system-wide collection.

The format file for the sched_wakeup event defines the following fields
(see /sys/kernel/debug/tracing/events/sched/sched_wakeup/format):
(see /sys/kernel/tracing/events/sched/sched_wakeup/format):

----
format:
Expand Down
Loading

0 comments on commit 0df8218

Please sign in to comment.