Perf - Event Docume N
Perf - Event Docume N
def trace_begin():
perf script -F cpu,event,ip print "in trace_begin"
def trace_end():
Furthermore, perf provides support for print "in trace_end"
Step 2: Gather profiling data with ‘cycles’ event with attaching to task with pid=582:
# perf record -e cycles -p 582 -- sleep 20
if the the CPU is dynamic
frequency scaling; rather than
Step 3: Generate perf report and find hotspot functions: time based profiling, we can
rely on PMU cycle counter for
# perf report more accurate profiling.
# Overhead Command Shared Object Symbol
# ........ ......... ................. ...............................
#
93.00% cpu_hl_t1 [kernel.kallsyms] [k] test_thread
1.94% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.67% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irq
Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
Profile with ftrace
perf can work with ftrace as wrapper to enable Based on ftrace, perf provides advanced tool
function or function_graph tracer for function perf sched to trace and measure scheduling
tracing; the another mode is to enable the latency.
tracepoint and statistics trace events: perf sched record -- sleep 1
perf ftrace -a --trace-funcs __kmalloc perf sched latency
perf record -e kmem:kmalloc -- sleep 5
based structure.
perf report
perf script OpenCSD
At the runtime perf saves compressed data
Decoding
into perf file alongside metadata for ETM
configure informations. branch sample
packet
ID
packet Synthesize
During report the Coresight trace data, perf PID
packet samples
end_addr
decodes the trace data to packets and packet start_addr
generate synthesize samples. Finally the packet …...
while (s) {
s = 0; 2 5 4 3 7
for (i = 1; i < n; i++) {
if (a[i] < a[i - 1]) {
t = a[i];
a[i] = a[i - 1];
a[i - 1] = t;
s = 1; 2 4 3 5 7
}
}
}
} ……
Optimization with compiler flag -O3
Run the instrumented binary and collect the Alternatively, the compiler can rely on profiling
execution profile: data at the runtime as feedback, this can avoid
# ./sort_instrumented instrumentation build.
Bubble sorting array of 30000 elements
45105 ms
Step 2: Read Coresight trace data and Step 4: Rebuild binary with training data:
inject synthetic last branch samples:
# gcc -O3 -fauto-profile=sort.gcov sort.c \
-o sort_autofdo
# perf inject -i perf.data -o inj.data \
--itrace=il64 --strip
# taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
6609 ms
Thank You
For further information: www.linaro.org
All trainees here today can send any questions about today’s
session, at any point in the future, to [email protected] .
The story - Performance profiling for CPU cache
When I profile performance for my ● During performance optimization, the
program, seems it has no performance software architecture design and locking
downgradation introduced by the related optimization normally are the best
software architecture design and other places to start… but will eventually plateau.
software factors like locking.
● If the performance issue is related with data
But the data throughput still doesn’t throughput or SMP performance, we might
need to improve the cache profile.
look good enough, how can I explore
more performance improvement for
● We use one synthetic testing case to
this?
demonstrate the debugging flow with using
PMU events for statistics and analysis for
CPU cache.
Statistics for cache hardware events
We can use the event ‘cache-references’ to Due the two events are enabled in the
count cache accessing times during the 10 same group, their value can be
seconds; the event ‘cache-misses’ is used compared and perf reports the ratio for
to count cache missing times. The big cache missing percentage: 4.048%. This
amount of counting numbers indicate the means it’s about one cache missing in
case has big pressure for cache. average of 25 times cache accessing.
5756626419 cache-references
233027636 cache-misses # 4.048 % of all cache refs
do { do {
val = shared.a; shared.b += i;
(void)val; i++;
} while(1); } while(1);
return 0; return 0;
} }
‘cpu_thread1’ and ‘cpu_thread2’ threads access data in the same structure. If these two threads on
different CPUs then rely on snooping for cache coherency, ‘cpu_thread1’ will see cache invalidation
after data modification by ‘cpu_thread2’, this results in ‘cpu_thread1’ sees many cache missing.
Optimization cache line alignment
volatile struct share_struct { Add attribute ___cacheline_aligned
unsigned int a;
unsigned int b ___cacheline_aligned; for item b in the structure so can allocate
} shared; separate cache line for item b.
10669660594 cache-references
833994 cache-misses # 0.008 % of all cache refs
# cd $KERNEL_DIR
# make VF=1 -C tools/perf/
Aside: Build perf tool - cont.
Method 2: Cross-Compilation perf for ARM64 on x86 PC
# export CROSS_COMPILE=aarch64-linux-gnu-
# export ARCH=arm64
# export CSINCLUDES=my-opencsd/decoder/include/
# export CSLIBS=my-opencsd/decoder/lib/builddir
# export LD_LIBRARY_PATH=$CSLIBS
# cd $KERNEL_DIR
# make LDFLAGS=-static NO_LIBELF=1 NO_JVMTI=1 VF=1 -C tools/perf/