0% found this document useful (0 votes)

35 views41 pages

Perf - Event Docume N

Uploaded by

Anand Parakkat Parambil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views41 pages

Perf - Event Docume N

Uploaded by

Anand Parakkat Parambil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Using perf On Arm platforms

Leo Yan & Daniel Thompson

Linaro Support and Solutions Engineering
Introduction
perf is a performance analysis tools for Linux,
it covers hardware level features and
software features for performance profiling
with the high efficiency.

We will review the fundamental mechanism

for perf, then go through different usages
delivered by perf, mainly for Arm related
hardware features. We will conclude the
session by discussing an examples.

We will finish this material in 50 minutes.

Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
perf - a family of useful tools
perf is a powerful profiling tool; primarily it Profiling and tracing
exploits the CPU performance counters but perf top perf stat perf record
can also gather information from other perf probe perf ftrace perf list
sources (including hrtimers, static
perf sched
tracepoints and dynamic probes).
Reports
perf is a family of useful tools collected into a perf report perf script perf annotate
single binary; it is a profiling tool to gather
perf data perf diff perf evlist
statistics info and report the result, it can act
perf inject
as a wrapper for ftrace and eBPF, it also
includes the benchmark suites for memory, Benchmark suites
scheduling performance testing, etc. perf bench
Profiling events
perf supports different kinds profiling events, perf list command is used to quickly
especially statistical profiling and check what events are supported in your
performance monitoring. system:
# perf list
At the most basic end, a timer (clock event)
cache-misses [Hardware event]
can used to periodically sample the PC, [...]
however profiling can be triggered by other
cpu-clock [Software event]
hardware events such as I$ or D$ miss, context-switches OR cs [Software event]
branch instruction, etc. perf also can rely on [...]

hardware breakpoint for profiling. mem:<addr>[/len][:access] [Hardware breakpoint]

9p:9p_client_req [Tracepoint event]

perf also supports software events for kernel [...]
software event statistics, like context
switches counting, ftrace tracepoints, etc.
Profiling modes
perf performance profiling can be free-run Free-run profiling
to count cycles, cache misses and branch Program execution period
misprediction (e.g. perf stat), or they
can interrupt after N samples to allow Start profiling Finish profiling
and read
statistical profiling (e.g. perf record) and statistics
also can capture context info.
Sampling based profiling
Different profilers have different levels of Program execution period
overhead, the statistical profiler has low
overhead, the tracing profiler is more
Start profiling N samples Finish profiling
accurate but with high overhead. interrupt
Interfaces between kernel and user space
The user space program uses the system Interfaces between kernel and user space
call perf_event_open() to open event perf record -e 'cycles' ls
and uses fcntl() to set the blocking
mode; A read() on a counter returns the perf stat -e 'cycles' ls perf.data

current value of the counter and this is

used to read free-running counters (e.g. sys_perf_event_open read mmap

perf stat). User space

Kernel

The sampling counter generates events Events samples

and store them in the ring buffer, which is ID
available to user space using mmap(). The Software Hardware PID
event event
data can be saved into perf.data file …...
with perf record. Interrupt
Control tracing scope for counters
perf organizes counters as the counter group, a Counters organization metrics
counter group is scheduled to the CPU as a unit,
so the values of the member counters can be CPU0 CPU1
meaningfully compared, added, divided (to get
task0 User mode
ratios), etc.

perf events can be system wide, or they can be

attached to specific CPUs with specific tasks; it task0 Kernel mode
Per thread profiling
can profile per-thread wise or per-cpu wise;
perf events also can be restricted to the times
task1 Hypervisor
when the CPU is in user, kernel or hypervisor
mode.
Per CPU profiling
perf record -e cs_etm/@826000.etr/u
--per-thread ./main
Profiling result analysis
The perf data can be investigated by perf Example for statistics result
report. It explores the tracer configuration
# Samples: 32K of event 'cache-misses'
info and sample data in the perf file and # Event count (approx.): 14284599
#
connect with Dynamic Shared Object (DSO) # Overhead Command Shared Object Symbol
for analysis. # ........ .......... ................. ........................
#
67.20% sched-pipe [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
DSOs are referred by build id and cached in 3.19% sched-pipe [kernel.kallsyms] [k] pipe_read
2.28% sched-pipe [kernel.kallsyms] [k] mutex_lock
the folder ~/.debug/ and they can be 2.15% sched-pipe [kernel.kallsyms] [k] copy_page_from_iter
archived by perf archive, the tar file can be 1.99% sched-pipe [kernel.kallsyms] [k] el0_svc_naked

used by another platform for cross-analysis.

Annotation with source code
perf annotate maps profile information to Example for perf annotate
source code; it displays the source code
alongside assembly code if the object file has
debug symbols; otherwise if without debug
symbols then it only displays assembly.

Displayed information is straightforward to

review and it is easy to associate lines in the
source code with percentage information.
Pressing enter can dig deeper function and
pressing q jumps to upper function.

By pressing a in perf report context it can

annotate for specific function.
Post process with scripts
perf script reads the input file and Example for dump syscall invoking
displays the detailed trace of the import os
import sys
workload with specified fields, e.g. pid,
from perf_trace_context import *
cpu and time, etc. from Core import *

def trace_begin():
perf script -F cpu,event,ip print "in trace_begin"

def trace_end():
Furthermore, perf provides support for print "in trace_end"

post process with python or perl scripts def raw_syscalls__sys_enter(event_name, context,

common_cpu, common_secs,
that aggregates and extracts useful common_nsecs, common_pid,
Common_comm, id, args):
information from a raw perf stream. print "id=%d, args=%s\n" % (id, args)

perf script -s syscall-enter.py

Profile with timer
perf includes support for time based Profile with CPU clock at 99Hz
profiling using hrtimers, it’s intuitive to # perf top -F 99 -ns comm,dso
understand how the code consumes time.
59.62% 22 perf [kernel]
36.15% 12 perf perf
perf provide two time based profilers 3.72% 28 swapper [kernel]
0.51% 14 kworker/1:1 [kernel]
cpu-clock and task-clock; cpu-clock is
wall-clock based and samples are taken Profile with task clock at 99Hz
at regular intervals relative to walltime;
# perf record -e task-clock -F 99 uname
task-clock is to sample the specific task
run time.
If sampling frequency is the same as some repeating
event within the profiled code, then the profile will be
misleading since the interrupt will always hit the same
bit of code. Deliberately selecting a rate that is not a
multiple of 10 (nor a power-of-2) 99 makes this unlikely.
Quick review for Arm PMU
Nowadays, modern CPUs provide performance
monitoring unit (PMU) to count CPU clock cycle, PMU
cache and branch events for profiling. A PMU is CPU_CLK Cycle counter
SPI_0
useful to observe performance and can monitor Performance counter CPU0
right down to CPU microarchitecture level. Performance counter

We can enable multiple PMU events in one perf …...

command, but it has limitation for support

maximum numbers of events at the same time
PMU
(e.g. CA53 supports max to 6 events + 1 cycle
CPU_CLK Cycle counter
counter). SPI_1
Performance counter CPU1
perf includes a general framework to expose
Performance counter
PMU event, keeps PMU driver simple in kernel;
…...
complexity is in userspace.
Profile with PMU
perf have defined standard event names for perf standard events don’t cover all available
instruction, cache and branch related hardware hardware events provided by PMU; we can use
events profiling. the raw mode to explore more hardware events,
e.g. we can directly access CA53 events with
perf state -a -e \ raw ID number: 03 for ‘L1 Data cache refill’ and
cache-references,cache-misses -- sleep 10 04 is for ‘L1 data cache access’.

perf provides comparison between metrics so can

perf stat -a -e r04,r03 -- sleep 10
easily get the ratio, e.g. comparing ‘cache-misses’
to ‘cache-references’ for cache missing percentage.
Arm platform refers to cache profiling with L1
Performance counter stats for 'system wide': cache level with standard event. For L2 cache
profiling, we can use raw mode to access
5756626419 cache-references
233027636 cache-misses # 4.048 % of all cache refs
related events and aggregate all related CPUs
statistics shared with the same L2 cache.
10.004134787 seconds time elapsed
Example for profiling hotspot with PMU
Step 1: use ‘top’ to browse which program consumes more CPU bandwidth than expected:
# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
582 root 20 0 0 0 0 D 2.3 0.0 0:02.70 cpu_hl_t1

Step 2: Gather profiling data with ‘cycles’ event with attaching to task with pid=582:
# perf record -e cycles -p 582 -- sleep 20
if the the CPU is dynamic
frequency scaling; rather than
Step 3: Generate perf report and find hotspot functions: time based profiling, we can
rely on PMU cycle counter for
# perf report more accurate profiling.
# Overhead Command Shared Object Symbol
# ........ ......... ................. ...............................
#
93.00% cpu_hl_t1 [kernel.kallsyms] [k] test_thread
1.94% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.67% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irq
Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
Profile with ftrace
perf can work with ftrace as wrapper to enable Based on ftrace, perf provides advanced tool
function or function_graph tracer for function perf sched to trace and measure scheduling
tracing; the another mode is to enable the latency.
tracepoint and statistics trace events: perf sched record -- sleep 1
perf ftrace -a --trace-funcs __kmalloc perf sched latency
perf record -e kmem:kmalloc -- sleep 5

# perf sched latency

Though we can use Ftrace SysFS node to

enable probes, but perf probe is more
convenient to enable probes without
disassembly and easily connect the tracing
with source code for analysis.

perf probe --line command is convenient

to check available probes mapping to source
code:
# perf probe --line "update_min_vruntime" \
-s $KERNEL_SRC
Profile with probes - cont.
perf probe --vars
tells available variables
at given probe point.

By complying probe syntax

we can define probe points
with command perf
probe --add; in the
example it enables probe
by specifying function
name and relative offset.
Integrate CoreSight with perf http://connect.linaro.org/resource/las16/las16-210/
CoreSight is a hardware IP which can trace Core0 ETM f
program flow and thus can facilitate u ETR
n
hardware assisted tracing and profiling. ETF
n
e
Core1 ETM l
To integrate Coresight with the perf
framework, CoreSight framework registers
Embedded Trace Macrocell (ETM) as a
perf record -e cs_etm/@826000.etr \
PMU event cs_etm to the perf core; Using --per-thread ./main
perf command to specify a sink to indicate
where to record the trace data.
perf report

OpenCSD libraries need to be linked with OpenCSD

Decoding
the perf building for CoreSight trace
decoding.
Limitations for CoreSight profiling
CoreSight ETM is used to trace program flow for Currently ETM can only support --per-thread
branch instructions, exception and return mode; when the task is scheduled on the CPU
instructions, etc. So perf tool can decode the then its ETM is enabled, after the task is
Coresight trace data to know the program flow. scheduled out, the corresponding ETM will be
disabled.
CoreSight ETM supports limitation for tracing
with perf options, e.g. -k and -u to specify only perf record -e cs_etm/@826000.etr \
for kernel space or user space; and support --per-thread ./main

option --filter to specify tracing address range:

Currently we are working on support for
CPU-wide trace scenarios, before this is
perf record -e cs_etm/@826000.etr/k \
completed we can manually open all tracing
--filter 'filter 0xffffff800856bc50/0x60' \
--per-thread ./main source for all CPUs from SysFS nodes.
Decode trace data with OpenCSD
Comparing to general PMU device, perf.data for CoreSight
CoreSight trace outputs compressed data CoreSight CoreSight
header ... ... ...
thus perf cannot directly generate sample meta data trace data

based structure.
perf report
perf script OpenCSD
At the runtime perf saves compressed data
Decoding
into perf file alongside metadata for ETM
configure informations. branch sample
packet
ID
packet Synthesize
During report the Coresight trace data, perf PID
packet samples
end_addr
decodes the trace data to packets and packet start_addr
generate synthesize samples. Finally the packet …...

samples can be used for statistics. ……

Profiling with CoreSight
After decoding CoreSight trace data, perf CoreSight works like a normal PMU
tool is straightforward to generate branch device mode and output result with
samples with branch end address and next commands perf report and perf
start address; so the branch samples can script.
be used for profiling.

# perf record -e cs_etm/@825000.etf/k --filter 'start 0xffffff80089278e8,stop 0xffffff8008928084' \

--per-thread ./timectxsw

# perf report --vmlinux=./userdata/vmlinux

# Samples: 328K of event 'instructions:k'
# Event count (approx.): 1624347
#
# Children Self Command Shared Object Symbol
# ........ ........ ......... ................. ......................
#
1.26% 1.26% timectxsw [kernel.kallsyms] [.] 0xffffff80080eb994
0.99% 0.99% timectxsw [kernel.kallsyms] [.] 0xffffff800812ec44
0.91% 0.91% timectxsw [kernel.kallsyms] [.] 0xffffff80080eb9d4
0.89% 0.89% timectxsw [kernel.kallsyms] [.] 0xffffff80080ea8cc
Post process CoreSight trace data
perf script can send the CoreSight Branch sample Python script
sampling stream to python script so
utilize python script flexibility to post packet
end_addr objdump
process trace data, e.g. disassembly packet
start_addr vmlinux
with trace data with symbol files to get ……

readable program flow.

# perf script -s arm-cs-trace-disasm.py -F cpu,event,ip,addr,sym -- -d objdump -k ./vmlinux
ARM CoreSight Trace Data Assembler Dump
ffff000008a5f2dc <etm4_enable_hw+0x344>:
ffff000008a5f2dc: 340000a0 cbz w0, ffff000008a5f2f0 <etm4_enable_hw+0x358>
ffff000008a5f2f0 <etm4_enable_hw+0x358>:
ffff000008a5f2f0: f9400260 ldr x0, [x19]
ffff000008a5f2f4: d5033f9f dsb sy
ffff000008a5f2f8: 913ec000 add x0, x0, #0xfb0
ffff000008a5f2fc: b900001f str wzr, [x0]
ffff000008a5f300: f9400bf3 ldr x19, [sp, #16]
ffff000008a5f304: a8c27bfd ldp x29, x30, [sp], #32
ffff000008a5f308: d65f03c0 ret
Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
The story - perf works with compiler for optimization
I want to optimize the performance ● The algorithm code might have complex logic,
so it have many branch instructions and
for my program and especially for
dependency when execution.
some small piece codes for
algorithm. ● Compiler is good at instruction scheduling
and reordering at compilation time and it
Does there have some advanced provides options -O3 for static optimization.

methods for performance

● Compiler is absent to know the program
optimization on Arm platform? execution runtime info, so perf profiling data
can be used as feedback by compiler and
explore more advanced optimization method.
Bubble sort example code https://gcc.gnu.org/wiki/AutoFDO/Tutorial

#define ARRAY_LEN 30000 7 2 5 4 3

void bubble_sort (int *a, int n) {
int i, t, s = 1;

while (s) {
s = 0; 2 5 4 3 7
for (i = 1; i < n; i++) {
if (a[i] < a[i - 1]) {
t = a[i];
a[i] = a[i - 1];
a[i - 1] = t;
s = 1; 2 4 3 5 7
}
}
}
} ……
Optimization with compiler flag -O3

Compile code without optimization: Compile code with -O3 flag:

# gcc sort.c -o sort # gcc -O3 sort.c -o sort_optimized

# ./sort # ./sort_optimized
Bubble sorting array of 30000 elements Bubble sorting array of 30000 elements
35308 ms 6621 ms
Feedback-Directed Optimization
Feedback-Directed Optimization (FDO): FDO needs the instrumentation build and run
with poor performance to generate the training
Build an instrumented version of the program for profiling: data set, thus this is difficult for applying in
# gcc sort.c -o sort_instrumented \ production.
-fprofile-generate

Run the instrumented binary and collect the Alternatively, the compiler can rely on profiling
execution profile: data at the runtime as feedback, this can avoid
# ./sort_instrumented instrumentation build.
Bubble sorting array of 30000 elements
45105 ms

Rebuild the program with feedback:

# gcc -O3 sort.c -o sort_fdo \
-fprofile-use=sort.gcda
# ./sort_fdo
Bubble sorting array of 30000 elements
6613 ms
AutoFDO with perf
Automatic feedback-directed optimization FDO
(AutoFDO) is to simplify deployment of
Instrumented Collection Rebuild
FDO by using the sampling of hardware binary *.gcda binary
performance monitor.

Since perf can collect the branch related perf + AutoFDO

information; the samples can be converted
to gcov format training data and at the Convert to Rebuild
perf profiling
*.gcov binary
end this can be used by the compiler for
AutoFDO with low overhead.
Arm doesn’t have last branch stack records ...
Statistical profiling helps identify a Though Arm PMU provides branch
particular code block is bottleneck, but it statistical profiling, it doesn’t provide
has no idea what the code paths execution branch stack sampling, as result it misses
to cause the bottleneck. to support -b option for last branch
records.
perf record provides -b for sampling
branch stack to log continuously branches, static int armpmu_event_init(struct perf_event
*event)
this feature requires hardware support, e.g. {
[ ... ]
Intel CPU last branch records (LBR); this
/* does not support taken branch sampling */
can be used for feedback optimization. if (has_branch_stack(event))
return -EOPNOTSUPP;
# perf record -b -e cycles:u ./sort
if (armpmu->map_event(event) == -ENOENT)
# create_gcov --binary=./sort \ return -ENOENT;
--profile=perf.data --gcov=sort.gcov \
-gcov_version=1 return __hw_perf_event_init(event);
}
Inject samples for CoreSight trace data
Last branch stack sample
By decoding the branch packets, perf packet
inject can generate instruction samples packet Branch
stack
with N interval with option --itrace=iN. ……
…...
Besides the instruction samples, it also can
artificially add last branch stack with option Branch sample Instruction sample
--itrace=ilN.
end_addr
ip
# perf report --itrace=i100il16 -k ./vmlinux --stdio start_addr
# Samples: 2K of event 'instructions' …... …...
# Event count (approx.): 2359
#
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ....... .................... ................................. ................................. ..................
#
8.82% ls ls [.] 0x0000aaaaaf096d10 [.] 0x0000aaaaaf096d40 -
8.82% ls ls [.] 0x0000aaaaaf096f24 [.] 0x0000aaaaaf096ce8 -
8.82% ls ls [.] 0x0000aaaaaf0971e0 [.] 0x0000aaaaaf096f18 -
8.77% ls ls [.] 0x0000aaaaaf0969e8 [.] 0x0000aaaaaf0971b8 -
8.77% ls ls [.] 0x0000aaaaaf096d6c [.] 0x0000aaaaaf0969d0 -
Use CoreSight for AutoFDO
Step 1: Capture CoreSight samples for Step 3: Convert the perf data into gcov
program: format:

# perf record -e cs_etm/@825000.etf/u \ # create_gcov --binary=./sort \

--per-thread taskset -c 2 ./sort --profile=inj.data --gcov=sort.gcov \
Bubble sorting array of 30000 elements -gcov_version=1
39044 ms

Step 2: Read Coresight trace data and Step 4: Rebuild binary with training data:
inject synthetic last branch samples:
# gcc -O3 -fauto-profile=sort.gcov sort.c \
-o sort_autofdo
# perf inject -i perf.data -o inj.data \
--itrace=il64 --strip
# taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
6609 ms
Thank You
For further information: www.linaro.org

This training presentation comes with a lifetime warranty.

All trainees here today can send any questions about today’s
session, at any point in the future, to [email protected] .
The story - Performance profiling for CPU cache
When I profile performance for my ● During performance optimization, the
program, seems it has no performance software architecture design and locking
downgradation introduced by the related optimization normally are the best
software architecture design and other places to start… but will eventually plateau.
software factors like locking.
● If the performance issue is related with data
But the data throughput still doesn’t throughput or SMP performance, we might
need to improve the cache profile.
look good enough, how can I explore
more performance improvement for
● We use one synthetic testing case to
this?
demonstrate the debugging flow with using
PMU events for statistics and analysis for
CPU cache.
Statistics for cache hardware events
We can use the event ‘cache-references’ to Due the two events are enabled in the
count cache accessing times during the 10 same group, their value can be
seconds; the event ‘cache-misses’ is used compared and perf reports the ratio for
to count cache missing times. The big cache missing percentage: 4.048%. This
amount of counting numbers indicate the means it’s about one cache missing in
case has big pressure for cache. average of 25 times cache accessing.

# perf stat -a -e cache-references,cache-misses -- sleep 10

Performance counter stats for 'system wide':

5756626419 cache-references
233027636 cache-misses # 4.048 % of all cache refs

10.004134787 seconds time elapsed

Record and report cache event samples
Step 1: Record perf data for cache miss
# perf record -a -e cache-references,cache-misses -- sleep 10
From the ‘cache-references’
Step 2: Generate report for every event samples, it can locate the two
# perf report --stdio threads ‘cpu_thread1’ and
# Samples: 80K of event 'cache-references' ‘cpu_thread2’ are mainly
# Event count (approx.): 5818036534 consumers for cache.
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ...........................................
#
54.39% cpu_hl_t1 [kernel.kallsyms] [k] cpu_thread1
45.17% cpu_hl_t2 [kernel.kallsyms] [k] cpu_thread2
0.09% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore

# Samples: 47K of event 'cache-misses'

From the ‘cache-misses’
# Event count (approx.): 220719660 samples, it can locate the
# thread ‘cpu_thread1’ are
# Overhead Command Shared Object Symbol
# ........ ............... ........................ heavily suffered by cache miss.
......................................
#
99.41% cpu_hl_t1 [kernel.kallsyms] [k] cpu_thread1
0.23% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.08% cpu_hl_t2 [kernel.kallsyms] [k] cpu_thread2
Review data structure
volatile struct share_struct {
unsigned int a;
unsigned int b;
} shared;

static int cpu_thread1(void data) static int cpu_thread2(void data)

{ {
unsigned int val; unsigned int val, i = 0;

do { do {
val = shared.a; shared.b += i;
(void)val; i++;
} while(1); } while(1);

return 0; return 0;
} }

‘cpu_thread1’ and ‘cpu_thread2’ threads access data in the same structure. If these two threads on
different CPUs then rely on snooping for cache coherency, ‘cpu_thread1’ will see cache invalidation
after data modification by ‘cpu_thread2’, this results in ‘cpu_thread1’ sees many cache missing.
Optimization cache line alignment
volatile struct share_struct { Add attribute ___cacheline_aligned
unsigned int a;
unsigned int b ___cacheline_aligned; for item b in the structure so can allocate
} shared; separate cache line for item b.

Cache miss percentage

# perf stat -a -e cache-references,cache-misses -- sleep 10 decreases from 4.048% to
0.008%.
Performance counter stats for 'system wide':

10669660594 cache-references
833994 cache-misses # 0.008 % of all cache refs

10.008088798 seconds time elapsed

Aside: Build perf tool
Method 1: Compilation perf on Debian/ARM64 platform
# apt-get install flex bison libelf-dev libaudit-dev libdw-dev libunwind* \
python-dev binutils-dev libnuma-dev libgtk2.0-dev libbfd-dev libelf1 \
libperl-dev libnuma-dev libslang2 libslang2-dev libunwind8 libunwind8-dev \
binutils-multiarch-dev elfutils libiberty-dev libncurses5-dev

# git clone https://github.com/Linaro/OpenCSD

# cd OpenCSD/decoder/build/linux/
# make DEBUG=1 LINUX64=1 & make install

# cd $KERNEL_DIR
# make VF=1 -C tools/perf/
Aside: Build perf tool - cont.
Method 2: Cross-Compilation perf for ARM64 on x86 PC
# export CROSS_COMPILE=aarch64-linux-gnu-
# export ARCH=arm64

# git clone https://github.com/Linaro/OpenCSD my-opencsd

# cd OpenCSD/decoder/build/linux/
# make DEBUG=1 LINUX64=1

# export CSINCLUDES=my-opencsd/decoder/include/
# export CSLIBS=my-opencsd/decoder/lib/builddir
# export LD_LIBRARY_PATH=$CSLIBS

# cd $KERNEL_DIR
# make LDFLAGS=-static NO_LIBELF=1 NO_JVMTI=1 VF=1 -C tools/perf/

HVCCTV
No ratings yet
HVCCTV
4 pages
B450M Pro VDH Max
No ratings yet
B450M Pro VDH Max
1 page
S8 Perf
No ratings yet
S8 Perf
15 pages
KernelRecipes Perf Events
No ratings yet
KernelRecipes Perf Events
79 pages
Introduction
No ratings yet
Introduction
21 pages
Profiling and Tracing
No ratings yet
Profiling and Tracing
9 pages
Profiling & Tracing With Perf - Julia Evans
No ratings yet
Profiling & Tracing With Perf - Julia Evans
24 pages
Linux Profiling at Netflix
No ratings yet
Linux Profiling at Netflix
84 pages
Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
No ratings yet
Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
84 pages
Tutorial - Perf Wiki
No ratings yet
Tutorial - Perf Wiki
23 pages
Assignment 1
No ratings yet
Assignment 1
10 pages
Linux Performance Tools: Brendan Gregg
No ratings yet
Linux Performance Tools: Brendan Gregg
90 pages
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
No ratings yet
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
90 pages
Percona2016linuxsystemsperf 160421182216
No ratings yet
Percona2016linuxsystemsperf 160421182216
72 pages
Unit 5 - Linux System Performance
No ratings yet
Unit 5 - Linux System Performance
27 pages
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
No ratings yet
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
69 pages
Howto Perf Profiling
No ratings yet
Howto Perf Profiling
6 pages
Howto Perf Profiling
No ratings yet
Howto Perf Profiling
6 pages
Access The Performance-Counter On Ubuntu Linux
No ratings yet
Access The Performance-Counter On Ubuntu Linux
8 pages
Linuxperftools 140820091946 Phpapp01
No ratings yet
Linuxperftools 140820091946 Phpapp01
85 pages
Howto-Perf Profiling
No ratings yet
Howto-Perf Profiling
6 pages
USE Method - Rosetta Stone of Performance Checklists
No ratings yet
USE Method - Rosetta Stone of Performance Checklists
8 pages
Howto Perf Profiling
No ratings yet
Howto Perf Profiling
5 pages
A0 Class
No ratings yet
A0 Class
30 pages
Thenewsystemsperformance 131014005720 Phpapp01
No ratings yet
Thenewsystemsperformance 131014005720 Phpapp01
17 pages
Lisa19 Slides Gregg
No ratings yet
Lisa19 Slides Gregg
64 pages
Linux System Administration
No ratings yet
Linux System Administration
39 pages
Optimizing Linux Performance
No ratings yet
Optimizing Linux Performance
26 pages
Search Results: Perf Wiki
No ratings yet
Search Results: Perf Wiki
3 pages
Javaone2015mixedmodeflamegraphs 151028205342 Lva1 App6891
No ratings yet
Javaone2015mixedmodeflamegraphs 151028205342 Lva1 App6891
92 pages
15782
No ratings yet
15782
52 pages
P51a 03 Part2
No ratings yet
P51a 03 Part2
38 pages
PSR 2920 2018-12-07T111416 Linux Observability Superpowers
No ratings yet
PSR 2920 2018-12-07T111416 Linux Observability Superpowers
47 pages
20 Linux System Monitoring Tools Every SysAdmin Should Know
No ratings yet
20 Linux System Monitoring Tools Every SysAdmin Should Know
14 pages
Broken Linux Performance Tools: Brendan Gregg
No ratings yet
Broken Linux Performance Tools: Brendan Gregg
95 pages
Java Profiling
No ratings yet
Java Profiling
100 pages
Cpu Utilisation Commands
No ratings yet
Cpu Utilisation Commands
7 pages
ACM Applicative 2016: System Methodology
No ratings yet
ACM Applicative 2016: System Methodology
57 pages
Systems Performance Book
No ratings yet
Systems Performance Book
4 pages
IBM Toolkit
No ratings yet
IBM Toolkit
81 pages
Linux Starce & Top
No ratings yet
Linux Starce & Top
8 pages
Linux 操作系统: Acegene IT Co. Ltd. 1
No ratings yet
Linux 操作系统: Acegene IT Co. Ltd. 1
23 pages
23.profiling I
No ratings yet
23.profiling I
29 pages
QV4311Exercise SG Hints
No ratings yet
QV4311Exercise SG Hints
134 pages
Operating Systems - CS 304: Rishit Saiya - 180010027, Assignment - 1 January 23, 2021
No ratings yet
Operating Systems - CS 304: Rishit Saiya - 180010027, Assignment - 1 January 23, 2021
11 pages
VPMC
No ratings yet
VPMC
11 pages
Nehalem Intel PTU Guide
No ratings yet
Nehalem Intel PTU Guide
33 pages
Introduction To Low-Level Profiling and Tracing
No ratings yet
Introduction To Low-Level Profiling and Tracing
71 pages
How To Sos Report
No ratings yet
How To Sos Report
6 pages
Top Linux Monitoring Tools
100% (1)
Top Linux Monitoring Tools
38 pages
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
No ratings yet
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
128 pages
Os PPT
No ratings yet
Os PPT
13 pages
Off-CPU Analysis
No ratings yet
Off-CPU Analysis
14 pages
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
No ratings yet
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
81 pages
Linux Sys Admin Tools
100% (1)
Linux Sys Admin Tools
24 pages
Monitoring
No ratings yet
Monitoring
8 pages
Javaone2016javaflamegraphs 160920172322
No ratings yet
Javaone2016javaflamegraphs 160920172322
71 pages
Eeus2012 Singhvi
No ratings yet
Eeus2012 Singhvi
26 pages
Linux Performance Analysis New Tools and Old Secrets: Brendan Gregg
No ratings yet
Linux Performance Analysis New Tools and Old Secrets: Brendan Gregg
75 pages
One AES S-Box To Increase Complexity and Its Cryptanalysis
No ratings yet
One AES S-Box To Increase Complexity and Its Cryptanalysis
7 pages
Interview Questions From PD and STA
No ratings yet
Interview Questions From PD and STA
22 pages
Tempus Cui Ug
No ratings yet
Tempus Cui Ug
576 pages
Hardware Implementation of AES Algorithm With Logic S-Box: Sou Ane Oukili and Seddik Bri
No ratings yet
Hardware Implementation of AES Algorithm With Logic S-Box: Sou Ane Oukili and Seddik Bri
19 pages
Standardized Design Environment and Methodologies Enable Simultaneous Implementation of 28nm Designs With A Single Flow
No ratings yet
Standardized Design Environment and Methodologies Enable Simultaneous Implementation of 28nm Designs With A Single Flow
31 pages
Efficient and High-Throughput Implementations of AES-GCM Fpgas
No ratings yet
Efficient and High-Throughput Implementations of AES-GCM Fpgas
8 pages
A High Throughput and Secure Authentication-Encryption AES-CCM Algorithm On Asynchronous Multicore Processor
No ratings yet
A High Throughput and Secure Authentication-Encryption AES-CCM Algorithm On Asynchronous Multicore Processor
12 pages
Asic & Fpga Implentation of Modified Advanced Encryption
No ratings yet
Asic & Fpga Implentation of Modified Advanced Encryption
13 pages
A High Performance, Low Energy, Compact Masked 128-Bit AES in 22nm CMOS Technology
No ratings yet
A High Performance, Low Energy, Compact Masked 128-Bit AES in 22nm CMOS Technology
4 pages
Implementation of The Aes-128 On Virtex-5 Fpgas
No ratings yet
Implementation of The Aes-128 On Virtex-5 Fpgas
11 pages
Optimization of Advanced Encryption Standard On Graphics Processing Units
No ratings yet
Optimization of Advanced Encryption Standard On Graphics Processing Units
12 pages
Comparison of Three CPU-Core Families For IoT Applications in Terms of Security and Performance of AES-GCM
No ratings yet
Comparison of Three CPU-Core Families For IoT Applications in Terms of Security and Performance of AES-GCM
10 pages
Design and Implementation A Different Architectures of Mixcolumn in FPGA
No ratings yet
Design and Implementation A Different Architectures of Mixcolumn in FPGA
12 pages
Minimal Instruction Set AES Processor Using Harvard Architecture
No ratings yet
Minimal Instruction Set AES Processor Using Harvard Architecture
5 pages
Optimization of Advanced Encryption Standard (AES) Using Vivado High Level Synthesis (HLS)
No ratings yet
Optimization of Advanced Encryption Standard (AES) Using Vivado High Level Synthesis (HLS)
9 pages
A Design Implementation and Comparative Analysis of Advanced Encryption Standard (AES) Algorithm On FPGA
100% (1)
A Design Implementation and Comparative Analysis of Advanced Encryption Standard (AES) Algorithm On FPGA
4 pages
A Modified Advanced Encryption Standard For Data Security: Lin Teng, Hang Li, Shoulin Yin, and Yang Sun
No ratings yet
A Modified Advanced Encryption Standard For Data Security: Lin Teng, Hang Li, Shoulin Yin, and Yang Sun
6 pages
Security Keypad Instructions
No ratings yet
Security Keypad Instructions
1 page
Advantages of Database Management System
No ratings yet
Advantages of Database Management System
3 pages
31
100% (1)
31
9 pages
Flowchart and Algo PDF
No ratings yet
Flowchart and Algo PDF
8 pages
GWC Humanize AI Challenge Data
No ratings yet
GWC Humanize AI Challenge Data
3 pages
Project 1 - Employee Management System - WEB - SRD
No ratings yet
Project 1 - Employee Management System - WEB - SRD
9 pages
Task To Complete - Frontend Internship
No ratings yet
Task To Complete - Frontend Internship
2 pages
Introduction To Linux - Basic Commands & Environment - Linux-2
No ratings yet
Introduction To Linux - Basic Commands & Environment - Linux-2
57 pages
Superplan Stream
No ratings yet
Superplan Stream
2 pages
Introduction To Computers - Languages: Department of Computer and Information Science, School of Science, IUPUI
No ratings yet
Introduction To Computers - Languages: Department of Computer and Information Science, School of Science, IUPUI
16 pages
E21 Operation Manual
No ratings yet
E21 Operation Manual
21 pages
25Q80BV Winbond
No ratings yet
25Q80BV Winbond
74 pages
B Cisco Nexus 9000 NX Os Interfaces Configuration Guide 92x - Chapter - 0110
No ratings yet
B Cisco Nexus 9000 NX Os Interfaces Configuration Guide 92x - Chapter - 0110
36 pages
Physics-Lab-Project-Report
No ratings yet
Physics-Lab-Project-Report
38 pages
10th IT 402 Answer Key
No ratings yet
10th IT 402 Answer Key
5 pages
Continuous Time Convolution: Author Phani Swathi Chitta Mentor Prof. Saravanan Vijayakumaran
No ratings yet
Continuous Time Convolution: Author Phani Swathi Chitta Mentor Prof. Saravanan Vijayakumaran
28 pages
BACnet
No ratings yet
BACnet
6 pages
Microsoft Case Study
No ratings yet
Microsoft Case Study
19 pages
100+ Python Challenging Programming Exercises For Python 3 1. Level Description Level 1 Beginner
No ratings yet
100+ Python Challenging Programming Exercises For Python 3 1. Level Description Level 1 Beginner
49 pages
Multi User Operatting System
No ratings yet
Multi User Operatting System
8 pages
SQL - ORA-01748 - Only Simple Column Names Allowed Here in Oracle - Stack Overflow
No ratings yet
SQL - ORA-01748 - Only Simple Column Names Allowed Here in Oracle - Stack Overflow
3 pages
JARKOM - Setting Modem Yang Digunakan Untuk Layanan Internet Pascabayar (Telkom Speedy)
No ratings yet
JARKOM - Setting Modem Yang Digunakan Untuk Layanan Internet Pascabayar (Telkom Speedy)
14 pages
L&T Electrical & Automation - Electrical Engg. Student - 2 Weeks Internship-2 - 282102
No ratings yet
L&T Electrical & Automation - Electrical Engg. Student - 2 Weeks Internship-2 - 282102
4 pages
CS604 Operating Systems Solved MCQs
67% (3)
CS604 Operating Systems Solved MCQs
6 pages
Cloud Computing Bits-R-18
No ratings yet
Cloud Computing Bits-R-18
9 pages
Creating Reports in Oracle BIP
No ratings yet
Creating Reports in Oracle BIP
51 pages
It100-4 1
No ratings yet
It100-4 1
62 pages
Power BI - Basic To Intermediate Training Content
100% (1)
Power BI - Basic To Intermediate Training Content
2 pages

Perf - Event Docume N

Uploaded by

Perf - Event Docume N

Uploaded by

Using perf On Arm platforms

Leo Yan & Daniel Thompson

We will review the fundamental mechanism

We will finish this material in 50 minutes.

hardware breakpoint for profiling. mem:<addr>[/len][:access] [Hardware breakpoint]

9p:9p_client_req [Tracepoint event]

current value of the counter and this is

perf stat). User space

The sampling counter generates events Events samples

perf events can be system wide, or they can be

used by another platform for cross-analysis.

Displayed information is straightforward to

By pressing a in perf report context it can

post process with python or perl scripts def raw_syscalls__sys_enter(event_name, context,

perf script -s syscall-enter.py

We can enable multiple PMU events in one perf …...

command, but it has limitation for support

perf provides comparison between metrics so can

# perf sched latency

Though we can use Ftrace SysFS node to

perf probe --line command is convenient

By complying probe syntax

OpenCSD libraries need to be linked with OpenCSD

option --filter to specify tracing address range:

samples can be used for statistics. ……

# perf record -e cs_etm/@825000.etf/k --filter 'start 0xffffff80089278e8,stop 0xffffff8008928084' \

# perf report --vmlinux=./userdata/vmlinux

readable program flow.

methods for performance

#define ARRAY_LEN 30000 7 2 5 4 3

Compile code without optimization: Compile code with -O3 flag:

# gcc sort.c -o sort # gcc -O3 sort.c -o sort_optimized

Rebuild the program with feedback:

Since perf can collect the branch related perf + AutoFDO

# perf record -e cs_etm/@825000.etf/u \ # create_gcov --binary=./sort \

This training presentation comes with a lifetime warranty.

# perf stat -a -e cache-references,cache-misses -- sleep 10

Performance counter stats for 'system wide':

10.004134787 seconds time elapsed

# Samples: 47K of event 'cache-misses'

static int cpu_thread1(void *data) static int cpu_thread2(void *data)

Cache miss percentage

10.008088798 seconds time elapsed

# git clone https://github.com/Linaro/OpenCSD

# git clone https://github.com/Linaro/OpenCSD my-opencsd

You might also like

static int cpu_thread1(void data) static int cpu_thread2(void data)