0% found this document useful (0 votes)
35 views41 pages

Perf - Event Docume N

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views41 pages

Perf - Event Docume N

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Using perf On Arm platforms

Leo Yan & Daniel Thompson


Linaro Support and Solutions Engineering
Introduction
perf is a performance analysis tools for Linux,
it covers hardware level features and
software features for performance profiling
with the high efficiency.

We will review the fundamental mechanism


for perf, then go through different usages
delivered by perf, mainly for Arm related
hardware features. We will conclude the
session by discussing an examples.

We will finish this material in 50 minutes.


Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
perf - a family of useful tools
perf is a powerful profiling tool; primarily it Profiling and tracing
exploits the CPU performance counters but perf top perf stat perf record
can also gather information from other perf probe perf ftrace perf list
sources (including hrtimers, static
perf sched
tracepoints and dynamic probes).
Reports
perf is a family of useful tools collected into a perf report perf script perf annotate
single binary; it is a profiling tool to gather
perf data perf diff perf evlist
statistics info and report the result, it can act
perf inject
as a wrapper for ftrace and eBPF, it also
includes the benchmark suites for memory, Benchmark suites
scheduling performance testing, etc. perf bench
Profiling events
perf supports different kinds profiling events, perf list command is used to quickly
especially statistical profiling and check what events are supported in your
performance monitoring. system:
# perf list
At the most basic end, a timer (clock event)
cache-misses [Hardware event]
can used to periodically sample the PC, [...]
however profiling can be triggered by other
cpu-clock [Software event]
hardware events such as I$ or D$ miss, context-switches OR cs [Software event]
branch instruction, etc. perf also can rely on [...]

hardware breakpoint for profiling. mem:<addr>[/len][:access] [Hardware breakpoint]

9p:9p_client_req [Tracepoint event]


perf also supports software events for kernel [...]
software event statistics, like context
switches counting, ftrace tracepoints, etc.
Profiling modes
perf performance profiling can be free-run Free-run profiling
to count cycles, cache misses and branch Program execution period
misprediction (e.g. perf stat), or they
can interrupt after N samples to allow Start profiling Finish profiling
and read
statistical profiling (e.g. perf record) and statistics
also can capture context info.
Sampling based profiling
Different profilers have different levels of Program execution period
overhead, the statistical profiler has low
overhead, the tracing profiler is more
Start profiling N samples Finish profiling
accurate but with high overhead. interrupt
Interfaces between kernel and user space
The user space program uses the system Interfaces between kernel and user space
call perf_event_open() to open event perf record -e 'cycles' ls
and uses fcntl() to set the blocking
mode; A read() on a counter returns the perf stat -e 'cycles' ls perf.data

current value of the counter and this is


used to read free-running counters (e.g. sys_perf_event_open read mmap

perf stat). User space


Kernel

The sampling counter generates events Events samples


and store them in the ring buffer, which is ID
available to user space using mmap(). The Software Hardware PID
event event
data can be saved into perf.data file …...
with perf record. Interrupt
Control tracing scope for counters
perf organizes counters as the counter group, a Counters organization metrics
counter group is scheduled to the CPU as a unit,
so the values of the member counters can be CPU0 CPU1
meaningfully compared, added, divided (to get
task0 User mode
ratios), etc.

perf events can be system wide, or they can be


attached to specific CPUs with specific tasks; it task0 Kernel mode
Per thread profiling
can profile per-thread wise or per-cpu wise;
perf events also can be restricted to the times
task1 Hypervisor
when the CPU is in user, kernel or hypervisor
mode.
Per CPU profiling
perf record -e cs_etm/@826000.etr/u
--per-thread ./main
Profiling result analysis
The perf data can be investigated by perf Example for statistics result
report. It explores the tracer configuration
# Samples: 32K of event 'cache-misses'
info and sample data in the perf file and # Event count (approx.): 14284599
#
connect with Dynamic Shared Object (DSO) # Overhead Command Shared Object Symbol
for analysis. # ........ .......... ................. ........................
#
67.20% sched-pipe [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
DSOs are referred by build id and cached in 3.19% sched-pipe [kernel.kallsyms] [k] pipe_read
2.28% sched-pipe [kernel.kallsyms] [k] mutex_lock
the folder ~/.debug/ and they can be 2.15% sched-pipe [kernel.kallsyms] [k] copy_page_from_iter
archived by perf archive, the tar file can be 1.99% sched-pipe [kernel.kallsyms] [k] el0_svc_naked

used by another platform for cross-analysis.


Annotation with source code
perf annotate maps profile information to Example for perf annotate
source code; it displays the source code
alongside assembly code if the object file has
debug symbols; otherwise if without debug
symbols then it only displays assembly.

Displayed information is straightforward to


review and it is easy to associate lines in the
source code with percentage information.
Pressing enter can dig deeper function and
pressing q jumps to upper function.

By pressing a in perf report context it can


annotate for specific function.
Post process with scripts
perf script reads the input file and Example for dump syscall invoking
displays the detailed trace of the import os
import sys
workload with specified fields, e.g. pid,
from perf_trace_context import *
cpu and time, etc. from Core import *

def trace_begin():
perf script -F cpu,event,ip print "in trace_begin"

def trace_end():
Furthermore, perf provides support for print "in trace_end"

post process with python or perl scripts def raw_syscalls__sys_enter(event_name, context,


common_cpu, common_secs,
that aggregates and extracts useful common_nsecs, common_pid,
Common_comm, id, args):
information from a raw perf stream. print "id=%d, args=%s\n" % (id, args)

perf script -s syscall-enter.py


Profile with timer
perf includes support for time based Profile with CPU clock at 99Hz
profiling using hrtimers, it’s intuitive to # perf top -F 99 -ns comm,dso
understand how the code consumes time.
59.62% 22 perf [kernel]
36.15% 12 perf perf
perf provide two time based profilers 3.72% 28 swapper [kernel]
0.51% 14 kworker/1:1 [kernel]
cpu-clock and task-clock; cpu-clock is
wall-clock based and samples are taken Profile with task clock at 99Hz
at regular intervals relative to walltime;
# perf record -e task-clock -F 99 uname
task-clock is to sample the specific task
run time.
If sampling frequency is the same as some repeating
event within the profiled code, then the profile will be
misleading since the interrupt will always hit the same
bit of code. Deliberately selecting a rate that is not a
multiple of 10 (nor a power-of-2) 99 makes this unlikely.
Quick review for Arm PMU
Nowadays, modern CPUs provide performance
monitoring unit (PMU) to count CPU clock cycle, PMU
cache and branch events for profiling. A PMU is CPU_CLK Cycle counter
SPI_0
useful to observe performance and can monitor Performance counter CPU0
right down to CPU microarchitecture level. Performance counter

We can enable multiple PMU events in one perf …...

command, but it has limitation for support


maximum numbers of events at the same time
PMU
(e.g. CA53 supports max to 6 events + 1 cycle
CPU_CLK Cycle counter
counter). SPI_1
Performance counter CPU1
perf includes a general framework to expose
Performance counter
PMU event, keeps PMU driver simple in kernel;
…...
complexity is in userspace.
Profile with PMU
perf have defined standard event names for perf standard events don’t cover all available
instruction, cache and branch related hardware hardware events provided by PMU; we can use
events profiling. the raw mode to explore more hardware events,
e.g. we can directly access CA53 events with
perf state -a -e \ raw ID number: 03 for ‘L1 Data cache refill’ and
cache-references,cache-misses -- sleep 10 04 is for ‘L1 data cache access’.

perf provides comparison between metrics so can


perf stat -a -e r04,r03 -- sleep 10
easily get the ratio, e.g. comparing ‘cache-misses’
to ‘cache-references’ for cache missing percentage.
Arm platform refers to cache profiling with L1
Performance counter stats for 'system wide': cache level with standard event. For L2 cache
profiling, we can use raw mode to access
5756626419 cache-references
233027636 cache-misses # 4.048 % of all cache refs
related events and aggregate all related CPUs
statistics shared with the same L2 cache.
10.004134787 seconds time elapsed
Example for profiling hotspot with PMU
Step 1: use ‘top’ to browse which program consumes more CPU bandwidth than expected:
# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
582 root 20 0 0 0 0 D 2.3 0.0 0:02.70 cpu_hl_t1

Step 2: Gather profiling data with ‘cycles’ event with attaching to task with pid=582:
# perf record -e cycles -p 582 -- sleep 20
if the the CPU is dynamic
frequency scaling; rather than
Step 3: Generate perf report and find hotspot functions: time based profiling, we can
rely on PMU cycle counter for
# perf report more accurate profiling.
# Overhead Command Shared Object Symbol
# ........ ......... ................. ...............................
#
93.00% cpu_hl_t1 [kernel.kallsyms] [k] test_thread
1.94% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.67% cpu_hl_t1 [kernel.kallsyms] [k] _raw_spin_unlock_irq
Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
Profile with ftrace
perf can work with ftrace as wrapper to enable Based on ftrace, perf provides advanced tool
function or function_graph tracer for function perf sched to trace and measure scheduling
tracing; the another mode is to enable the latency.
tracepoint and statistics trace events: perf sched record -- sleep 1
perf ftrace -a --trace-funcs __kmalloc perf sched latency
perf record -e kmem:kmalloc -- sleep 5

# perf sched latency


-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
kworker/1:1-eve:46 | 0.913 ms | 22 | avg: 0.039 ms | max: 0.043 ms | max at: 5824.869919 s
kworker/2:1-eve:44 | 1.542 ms | 42 | avg: 0.038 ms | max: 0.042 ms | max at: 5824.833924 s
kworker/3:1-eve:95 | 0.924 ms | 23 | avg: 0.037 ms | max: 0.043 ms | max at: 5824.845919 s
kworker/0:1-eve:100 | 0.209 ms | 3 | avg: 0.034 ms | max: 0.043 ms | max at: 5824.881921 s
perf:3172 | 4.130 ms | 1 | avg: 0.025 ms | max: 0.025 ms | max at: 5825.800291 s
rcu_preempt:10 | 0.035 ms | 5 | avg: 0.020 ms | max: 0.050 ms | max at: 5824.825915 s
sleep:3173 | 4.780 ms | 5 | avg: 0.012 ms | max: 0.032 ms | max at: 5825.798935 s
-----------------------------------------------------------------------------------------------------------------
TOTAL: | 12.667 ms | 105 |
---------------------------------------------------
Profile with probes
Kprobes/Uprobes is dynamic event tracing in Example for perf probe --line
kernel and user space app irrespectively, the
probes can be added or removed on the fly.

Though we can use Ftrace SysFS node to


enable probes, but perf probe is more
convenient to enable probes without
disassembly and easily connect the tracing
with source code for analysis.

perf probe --line command is convenient


to check available probes mapping to source
code:
# perf probe --line "update_min_vruntime" \
-s $KERNEL_SRC
Profile with probes - cont.
perf probe --vars
tells available variables
at given probe point.

By complying probe syntax


we can define probe points
with command perf
probe --add; in the
example it enables probe
by specifying function
name and relative offset.
Integrate CoreSight with perf http://connect.linaro.org/resource/las16/las16-210/
CoreSight is a hardware IP which can trace Core0 ETM f
program flow and thus can facilitate u ETR
n
hardware assisted tracing and profiling. ETF
n
e
Core1 ETM l
To integrate Coresight with the perf
framework, CoreSight framework registers
Embedded Trace Macrocell (ETM) as a
perf record -e cs_etm/@826000.etr \
PMU event cs_etm to the perf core; Using --per-thread ./main
perf command to specify a sink to indicate
where to record the trace data.
perf report

OpenCSD libraries need to be linked with OpenCSD


Decoding
the perf building for CoreSight trace
decoding.
Limitations for CoreSight profiling
CoreSight ETM is used to trace program flow for Currently ETM can only support --per-thread
branch instructions, exception and return mode; when the task is scheduled on the CPU
instructions, etc. So perf tool can decode the then its ETM is enabled, after the task is
Coresight trace data to know the program flow. scheduled out, the corresponding ETM will be
disabled.
CoreSight ETM supports limitation for tracing
with perf options, e.g. -k and -u to specify only perf record -e cs_etm/@826000.etr \
for kernel space or user space; and support --per-thread ./main

option --filter to specify tracing address range:


Currently we are working on support for
CPU-wide trace scenarios, before this is
perf record -e cs_etm/@826000.etr/k \
completed we can manually open all tracing
--filter 'filter 0xffffff800856bc50/0x60' \
--per-thread ./main source for all CPUs from SysFS nodes.
Decode trace data with OpenCSD
Comparing to general PMU device, perf.data for CoreSight
CoreSight trace outputs compressed data CoreSight CoreSight
header ... ... ...
thus perf cannot directly generate sample meta data trace data

based structure.
perf report
perf script OpenCSD
At the runtime perf saves compressed data
Decoding
into perf file alongside metadata for ETM
configure informations. branch sample
packet
ID
packet Synthesize
During report the Coresight trace data, perf PID
packet samples
end_addr
decodes the trace data to packets and packet start_addr
generate synthesize samples. Finally the packet …...

samples can be used for statistics. ……


Profiling with CoreSight
After decoding CoreSight trace data, perf CoreSight works like a normal PMU
tool is straightforward to generate branch device mode and output result with
samples with branch end address and next commands perf report and perf
start address; so the branch samples can script.
be used for profiling.

# perf record -e cs_etm/@825000.etf/k --filter 'start 0xffffff80089278e8,stop 0xffffff8008928084' \


--per-thread ./timectxsw

# perf report --vmlinux=./userdata/vmlinux


# Samples: 328K of event 'instructions:k'
# Event count (approx.): 1624347
#
# Children Self Command Shared Object Symbol
# ........ ........ ......... ................. ......................
#
1.26% 1.26% timectxsw [kernel.kallsyms] [.] 0xffffff80080eb994
0.99% 0.99% timectxsw [kernel.kallsyms] [.] 0xffffff800812ec44
0.91% 0.91% timectxsw [kernel.kallsyms] [.] 0xffffff80080eb9d4
0.89% 0.89% timectxsw [kernel.kallsyms] [.] 0xffffff80080ea8cc
Post process CoreSight trace data
perf script can send the CoreSight Branch sample Python script
sampling stream to python script so
utilize python script flexibility to post packet
end_addr objdump
process trace data, e.g. disassembly packet
start_addr vmlinux
with trace data with symbol files to get ……

readable program flow.


# perf script -s arm-cs-trace-disasm.py -F cpu,event,ip,addr,sym -- -d objdump -k ./vmlinux
ARM CoreSight Trace Data Assembler Dump
ffff000008a5f2dc <etm4_enable_hw+0x344>:
ffff000008a5f2dc: 340000a0 cbz w0, ffff000008a5f2f0 <etm4_enable_hw+0x358>
ffff000008a5f2f0 <etm4_enable_hw+0x358>:
ffff000008a5f2f0: f9400260 ldr x0, [x19]
ffff000008a5f2f4: d5033f9f dsb sy
ffff000008a5f2f8: 913ec000 add x0, x0, #0xfb0
ffff000008a5f2fc: b900001f str wzr, [x0]
ffff000008a5f300: f9400bf3 ldr x19, [sp, #16]
ffff000008a5f304: a8c27bfd ldp x29, x30, [sp], #32
ffff000008a5f308: d65f03c0 ret
Agenda
● Statistical profiling on Arm platforms
○ Fundamental mechanism (for statistical profiling)
○ Profile with timer
○ Profile with PMU
● Using perf with tracing tools
○ Profile with ftrace
○ Profile with probes
○ Profile with CoreSight
● Debugging stories
The story - perf works with compiler for optimization
I want to optimize the performance ● The algorithm code might have complex logic,
so it have many branch instructions and
for my program and especially for
dependency when execution.
some small piece codes for
algorithm. ● Compiler is good at instruction scheduling
and reordering at compilation time and it
Does there have some advanced provides options -O3 for static optimization.

methods for performance


● Compiler is absent to know the program
optimization on Arm platform? execution runtime info, so perf profiling data
can be used as feedback by compiler and
explore more advanced optimization method.
Bubble sort example code https://gcc.gnu.org/wiki/AutoFDO/Tutorial

#define ARRAY_LEN 30000 7 2 5 4 3


void bubble_sort (int *a, int n) {
int i, t, s = 1;

while (s) {
s = 0; 2 5 4 3 7
for (i = 1; i < n; i++) {
if (a[i] < a[i - 1]) {
t = a[i];
a[i] = a[i - 1];
a[i - 1] = t;
s = 1; 2 4 3 5 7
}
}
}
} ……
Optimization with compiler flag -O3

Compile code without optimization: Compile code with -O3 flag:

# gcc sort.c -o sort # gcc -O3 sort.c -o sort_optimized


# ./sort # ./sort_optimized
Bubble sorting array of 30000 elements Bubble sorting array of 30000 elements
35308 ms 6621 ms
Feedback-Directed Optimization
Feedback-Directed Optimization (FDO): FDO needs the instrumentation build and run
with poor performance to generate the training
Build an instrumented version of the program for profiling: data set, thus this is difficult for applying in
# gcc sort.c -o sort_instrumented \ production.
-fprofile-generate

Run the instrumented binary and collect the Alternatively, the compiler can rely on profiling
execution profile: data at the runtime as feedback, this can avoid
# ./sort_instrumented instrumentation build.
Bubble sorting array of 30000 elements
45105 ms

Rebuild the program with feedback:


# gcc -O3 sort.c -o sort_fdo \
-fprofile-use=sort.gcda
# ./sort_fdo
Bubble sorting array of 30000 elements
6613 ms
AutoFDO with perf
Automatic feedback-directed optimization FDO
(AutoFDO) is to simplify deployment of
Instrumented Collection Rebuild
FDO by using the sampling of hardware binary *.gcda binary
performance monitor.

Since perf can collect the branch related perf + AutoFDO


information; the samples can be converted
to gcov format training data and at the Convert to Rebuild
perf profiling
*.gcov binary
end this can be used by the compiler for
AutoFDO with low overhead.
Arm doesn’t have last branch stack records ...
Statistical profiling helps identify a Though Arm PMU provides branch
particular code block is bottleneck, but it statistical profiling, it doesn’t provide
has no idea what the code paths execution branch stack sampling, as result it misses
to cause the bottleneck. to support -b option for last branch
records.
perf record provides -b for sampling
branch stack to log continuously branches, static int armpmu_event_init(struct perf_event
*event)
this feature requires hardware support, e.g. {
[ ... ]
Intel CPU last branch records (LBR); this
/* does not support taken branch sampling */
can be used for feedback optimization. if (has_branch_stack(event))
return -EOPNOTSUPP;
# perf record -b -e cycles:u ./sort
if (armpmu->map_event(event) == -ENOENT)
# create_gcov --binary=./sort \ return -ENOENT;
--profile=perf.data --gcov=sort.gcov \
-gcov_version=1 return __hw_perf_event_init(event);
}
Inject samples for CoreSight trace data
Last branch stack sample
By decoding the branch packets, perf packet
inject can generate instruction samples packet Branch
stack
with N interval with option --itrace=iN. ……
…...
Besides the instruction samples, it also can
artificially add last branch stack with option Branch sample Instruction sample
--itrace=ilN.
end_addr
ip
# perf report --itrace=i100il16 -k ./vmlinux --stdio start_addr
# Samples: 2K of event 'instructions' …... …...
# Event count (approx.): 2359
#
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ....... .................... ................................. ................................. ..................
#
8.82% ls ls [.] 0x0000aaaaaf096d10 [.] 0x0000aaaaaf096d40 -
8.82% ls ls [.] 0x0000aaaaaf096f24 [.] 0x0000aaaaaf096ce8 -
8.82% ls ls [.] 0x0000aaaaaf0971e0 [.] 0x0000aaaaaf096f18 -
8.77% ls ls [.] 0x0000aaaaaf0969e8 [.] 0x0000aaaaaf0971b8 -
8.77% ls ls [.] 0x0000aaaaaf096d6c [.] 0x0000aaaaaf0969d0 -
Use CoreSight for AutoFDO
Step 1: Capture CoreSight samples for Step 3: Convert the perf data into gcov
program: format:

# perf record -e cs_etm/@825000.etf/u \ # create_gcov --binary=./sort \


--per-thread taskset -c 2 ./sort --profile=inj.data --gcov=sort.gcov \
Bubble sorting array of 30000 elements -gcov_version=1
39044 ms

Step 2: Read Coresight trace data and Step 4: Rebuild binary with training data:
inject synthetic last branch samples:
# gcc -O3 -fauto-profile=sort.gcov sort.c \
-o sort_autofdo
# perf inject -i perf.data -o inj.data \
--itrace=il64 --strip
# taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
6609 ms
Thank You
For further information: www.linaro.org

This training presentation comes with a lifetime warranty.

All trainees here today can send any questions about today’s
session, at any point in the future, to [email protected] .
The story - Performance profiling for CPU cache
When I profile performance for my ● During performance optimization, the
program, seems it has no performance software architecture design and locking
downgradation introduced by the related optimization normally are the best
software architecture design and other places to start… but will eventually plateau.
software factors like locking.
● If the performance issue is related with data
But the data throughput still doesn’t throughput or SMP performance, we might
need to improve the cache profile.
look good enough, how can I explore
more performance improvement for
● We use one synthetic testing case to
this?
demonstrate the debugging flow with using
PMU events for statistics and analysis for
CPU cache.
Statistics for cache hardware events
We can use the event ‘cache-references’ to Due the two events are enabled in the
count cache accessing times during the 10 same group, their value can be
seconds; the event ‘cache-misses’ is used compared and perf reports the ratio for
to count cache missing times. The big cache missing percentage: 4.048%. This
amount of counting numbers indicate the means it’s about one cache missing in
case has big pressure for cache. average of 25 times cache accessing.

# perf stat -a -e cache-references,cache-misses -- sleep 10

Performance counter stats for 'system wide':

5756626419 cache-references
233027636 cache-misses # 4.048 % of all cache refs

10.004134787 seconds time elapsed


Record and report cache event samples
Step 1: Record perf data for cache miss
# perf record -a -e cache-references,cache-misses -- sleep 10
From the ‘cache-references’
Step 2: Generate report for every event samples, it can locate the two
# perf report --stdio threads ‘cpu_thread1’ and
# Samples: 80K of event 'cache-references' ‘cpu_thread2’ are mainly
# Event count (approx.): 5818036534 consumers for cache.
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ...........................................
#
54.39% cpu_hl_t1 [kernel.kallsyms] [k] cpu_thread1
45.17% cpu_hl_t2 [kernel.kallsyms] [k] cpu_thread2
0.09% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore

# Samples: 47K of event 'cache-misses'


From the ‘cache-misses’
# Event count (approx.): 220719660 samples, it can locate the
# thread ‘cpu_thread1’ are
# Overhead Command Shared Object Symbol
# ........ ............... ........................ heavily suffered by cache miss.
......................................
#
99.41% cpu_hl_t1 [kernel.kallsyms] [k] cpu_thread1
0.23% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.08% cpu_hl_t2 [kernel.kallsyms] [k] cpu_thread2
Review data structure
volatile struct share_struct {
unsigned int a;
unsigned int b;
} shared;

static int cpu_thread1(void *data) static int cpu_thread2(void *data)


{ {
unsigned int val; unsigned int val, i = 0;

do { do {
val = shared.a; shared.b += i;
(void)val; i++;
} while(1); } while(1);

return 0; return 0;
} }

‘cpu_thread1’ and ‘cpu_thread2’ threads access data in the same structure. If these two threads on
different CPUs then rely on snooping for cache coherency, ‘cpu_thread1’ will see cache invalidation
after data modification by ‘cpu_thread2’, this results in ‘cpu_thread1’ sees many cache missing.
Optimization cache line alignment
volatile struct share_struct { Add attribute ___cacheline_aligned
unsigned int a;
unsigned int b ___cacheline_aligned; for item b in the structure so can allocate
} shared; separate cache line for item b.

Cache miss percentage


# perf stat -a -e cache-references,cache-misses -- sleep 10 decreases from 4.048% to
0.008%.
Performance counter stats for 'system wide':

10669660594 cache-references
833994 cache-misses # 0.008 % of all cache refs

10.008088798 seconds time elapsed


Aside: Build perf tool
Method 1: Compilation perf on Debian/ARM64 platform
# apt-get install flex bison libelf-dev libaudit-dev libdw-dev libunwind* \
python-dev binutils-dev libnuma-dev libgtk2.0-dev libbfd-dev libelf1 \
libperl-dev libnuma-dev libslang2 libslang2-dev libunwind8 libunwind8-dev \
binutils-multiarch-dev elfutils libiberty-dev libncurses5-dev

# git clone https://github.com/Linaro/OpenCSD


# cd OpenCSD/decoder/build/linux/
# make DEBUG=1 LINUX64=1 & make install

# cd $KERNEL_DIR
# make VF=1 -C tools/perf/
Aside: Build perf tool - cont.
Method 2: Cross-Compilation perf for ARM64 on x86 PC
# export CROSS_COMPILE=aarch64-linux-gnu-
# export ARCH=arm64

# git clone https://github.com/Linaro/OpenCSD my-opencsd


# cd OpenCSD/decoder/build/linux/
# make DEBUG=1 LINUX64=1

# export CSINCLUDES=my-opencsd/decoder/include/
# export CSLIBS=my-opencsd/decoder/lib/builddir
# export LD_LIBRARY_PATH=$CSLIBS

# cd $KERNEL_DIR
# make LDFLAGS=-static NO_LIBELF=1 NO_JVMTI=1 VF=1 -C tools/perf/

You might also like