KernelRecipes Perf Events
KernelRecipes Perf Events
at Netflix
Brendan Gregg
Senior Performance Architect
Sep 2017
Case Study: ZFS is ea/ng my CPU
• Easy to debug using Ne9lix Vector & flame graphs
• How I expected it to look:
Case Study: ZFS is ea/ng my CPU (cont.)
• How it really looked:
Applica/on (truncated)
Flame Graphs
Heat Maps
…
Completely
CPU Flame Graph
Kernel
(C) JVM
(C++)
User Java
(C)
Why Linux perf?
• Available
– Linux, open source
• Low overhead
– Tunable sampling, ring buffers
• Accurate
– Applica/on-basic samplers don't know what's really RUNNING; eg, Java and epoll
• No blind spots
– See user, library, kernel with CPU sampling
– With some work: hardirqs & SMI as well
• No sample skew
– Unlike Java safety point skew
Why is this so important
• We typically scale microservices based on %CPU
– Small %CPU improvements can mean big $avings
• perf does lots more, but we spend ~95% of our /me looking
at CPU profiles, and 5% on everything else
– With new BPF capabili/es (off-CPU analysis), that might go from 95 to 90%
CPU profiling should be easy, but…
JIT runtimes
no frame pointers
no debuginfo
stale symbol maps
container namespaces
…
2. perf Basics
perf (aka "perf_events")
• The official Linux profiler
– In the linux-tools-common package
– Source code & docs in Linux: tools/perf
See 'perf help COMMAND' for more information on a specific command. from Linux 4.13
perf Basic Workflow
1. list -> find events
2. stat -> count them
3. record-> write event data to file
4. report -> browse summary
5. script -> event dump for post processing
Basic Workflow Example
# perf list sched:*
[…]
sched:sched_process_exec [Tracepoint event]
[…] 1. found an event of interest
# perf stat -e sched:sched_process_exec -a -- sleep 10
Performance counter stats for 'system wide':
19 sched:sched_process_exec 2. 19 per 10 sec is a very low
10.001327817 seconds time elapsed rate, so safe to record
# perf record -e sched:sched_process_exec -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ] 3. 21 samples captured
# perf report -n --stdio
# Children Self Samples Trace output
# ........ ........ ............ .................................................
4.76%
|
4.76%
1 filename=/bin/bash pid=7732 old_pid=7732
---_start
return_from_SYSCALL_64
4. summary style may be
do_syscall_64
sys_execve
sufficient, or,
do_execveat_common.isra.35
[…]
# perf script 5. script output in /me order
sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729
44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so)
[…]
perf stat/record Format
• These have three main parts: ac/on, event, scope.
• e.g., profiling on-CPU stack traces:
Event: 99 Hertz
• Hardware events
– CPU Performance Monitoring Counters (PMCs)
• Tracepoints
– Sta/cally defined in sooware
• Dynamic tracing
– Created using uprobes (user) or kprobes (kernel)
– Can do kernel line tracing with local variables (needs kernel debuginfo)
perf Events: Map
perf Events: List
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
[…]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
[…]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
[…]
skb:kfree_skb [Tracepoint event]
skb:consume_skb [Tracepoint event]
skb:skb_copy_datagram_iovec [Tracepoint event]
net:net_dev_xmit [Tracepoint event]
net:net_dev_queue [Tracepoint event]
net:netif_receive_skb [Tracepoint event]
net:netif_rx [Tracepoint event]
[…]
perf Scope
• System-wide: all CPUs (-a)
• Target PID (-p PID)
• Target command (…)
• Specific CPUs (-c …)
• User-level only (<event>:u)
• Kernel-level only (<event>:k)
• A custom filter to match variables (--filter …)
• This cgroup (container) only (--cgroup …)
One-Liners: Lis/ng Events
# Listing all currently known events:
perf list
# Detailed CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -dp PID
# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10
# Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:
perf record -F 99 -ag -- sleep 10
# Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
# Sample CPU stack traces, once every 100 last level cache misses, for 5 secs:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5
stack B B
samples: A A A A A
B syscall
A
/me
on-CPU off-CPU
block interrupt
perf Record
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--63.21%-- try_to_wake_up Sampling full
|
|
|
|--63.91%-- default_wake_function
stack traces
|
|
|
|
|
|--56.11%-- __wake_up_common
at 99 Hertz
| | | __wake_up_locked
| | | ep_poll_callback
| | | __wake_up_common
| | | __wake_up_sync_key
| | | |
| | | |--59.19%-- sock_def_readable
[…78,000 lines truncated…]
perf Repor/ng
• perf report summarizes by combining common paths
• Previous output truncated 78,000 lines of summary
• The following is what a mere 8,000 lines looks like…
perf report
… as a Flame Graph
Flame Graphs
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
• Flame Graphs:
– x-axis: alphabe/cal stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension
• e.g., sooware type, or difference between two profiles for
non-regression tes/ng ("differen/al flame graphs")
– interpreta/on: top edge is on-CPU, beneath it is ancestry
• Just a Perl program to convert perf stacks into SVG
– Includes JavaScript: open in a browser for interac/vity
• Easy to get working hUp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
flamegraph.pl Op/ons
$ flamegraph.pl --help
USAGE: flamegraph.pl [options] infile > outfile.svg
eg,
flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
perf Flame Graph Workflow (Linux 2.6+)
list events count events capture stacks
perf list perf stat perf record
Typical
Workflow perf.data
stackcollapse-perf.pl
flame graph
visualiza/on
flamegraph.pl
perf Flame Graph Workflow (Linux 4.5+)
list events count events capture stacks
perf list perf stat perf record
Typical
Workflow perf.data
awk
flame graph
visualiza/on
flamegraph.pl
Flame Graph Op/miza/ons
Linux 2.6 Linux 4.5 Linux 4.9
capture stacks capture stacks count stacks (BPF)
perf record perf record profile.py
write samples write samples not perf
perf.data perf.data
reads samples reads samples
perf script perf report folded
–g folded output
write text
folded report
stackcollapse-perf.pl awk
folded output folded output
flamegraph.pl flamegraph.pl flamegraph.pl
Gotchas
When we've tried to use perf
• Stacks don't work (missing)
• Symbols don't work (hex numbers)
• Instruc/on profiling looks bogus
• PMCs don't work in VM guests
• Container break things
• Overhead is too high
How to really get started
1. Get "perf" to work Install perf-tools-common and
perf-tools-`uname -r` packages;
2. Get stack walking to work
Or compile in the Linux source:
3. Fix symbol transla/on tools/perf
4. Get IPC to work The "gotchas"…
5. Test perf under load
Gotcha #1 Broken Stacks
perf record -F 99 -a –g -- sleep 30
perf report -n --stdio
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
|
|
__libc_start_main
not broken
--3.22%-- rpl_re_search
match_regex
do_subst
execute_program
process_files
main
__libc_start_main
Iden/fying Broken Stacks
78.50% 409 sed libc-2.19.so [.] 0x00000000000dd7d4
|
|--3.65%-- 0x7f2516d5d10d
|
broken
|--2.19%-- 0x7f2516d0332f
|
|--1.22%-- 0x7f2516cffbd2
|
|--1.22%-- 0x7f2516d5d5ad
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
| probably not broken
|--1.33%-- 0x40a4a1
| |
| |--60.01%-- 0x40659a
| | 0x408dd8 missing symbols, but
|
|
|
|
0x408ed1
0x402689
that's another problem
| | 0x7fa1cd08aec5
Broken Stacks Flame Graph
• Applica/on support
– hUps://github.com/jvm-profiling-tools/async-profiler
• Our current preference is (A), but (C) is also promising
– So how do we fix the frame pointer…
gcc -fno-omit-frame-pointer
• Once upon a time, x86 had fewer registers, and the frame
pointer register was reused for general purpose to improve
performance. This breaks system stack walking.
• gcc provides -fno-omit-frame-pointer to fix this
– Please make this the default in gcc!
Java -XX:+PreserveFramePointer
• I hacked frame pointers in the JVM (JDK-8068945) and Oracle rewrote
it as -XX:+PreserveFramePointer. Lets perf do FP stack walks of Java.
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp); Involved changes like this:
+ mov(rbp, rsp);
// Remove word for ebp
fixing x86-64 func/on
framesize -= wordSize; prologues
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Java
(no symbols)
Gotcha #2 Missing Symbols
• Missing symbols should be obvious in perf report/script:
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1 broken
| 0x402689
| 0x7fa1cd08aec5
• Automa/on: jmaps
– We use scripts to find Java processes and dump their map files, paying aUen/on to file
ownership etc
– hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps
– Needs to run as close as possible to the profile, to minimize symbol churn
# perf record -F 99 -a -g -- sleep 30; jmaps
Java Flame Graph: Stacks & Symbols
Kernel flamegraph.pl --color=java
(C)
Java
User
(C)
JVM
(C++)
Java: Inlining
A. Disabling inlining:
– -XX:-Inline No inlining
– Many more Java frames
– 80% slower (in this case)
– May not be necessary: inlined flame
graphs ooen make enough sense
– Or tune -XX:MaxInlineSize and -
XX:InlineSmallCode to reveal more
frames, without cos/ng much perf: can
even go faster!
• perf may not use the most recent symbol in the log
– We /dy logs before using them:
hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/
perfmap/dy.pl
16 NOPs in a loop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
(A) (B)
(C) (D)
Instruc/on Profiling
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 : 4004f8: nop
18.73 : 4004f9: nop
0.00 : 4004fa: nop
0.00 : 4004fb: nop
0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop Go home instruc/on pointer, you're drunk
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
PEBS
• I believe this is due to parallel and out-of-order execu/on of
micro-ops: the sampled IP is the resump/on instruc/on, not
what is currently execu/ng. And skid.
• PEBS may help: Intel's Precise Event Based Sampling
• perf_events has support:
– perf record -e cycles:pp
– The 'p' can be specified mul/ple /mes:
• 0 - SAMPLE_IP can have arbitrary skid
• 1 - SAMPLE_IP must have constant skid
• 2 - SAMPLE_IP requested to have 0 skid
• 3 - SAMPLE_IP must have 0 skid
– … from tools/perf/Documenta/on/perf-list.txt
Gotcha #4 VM Guests
• Using PMCs from most VM guests:
# perf stat -a -d sleep 5
architectural
set
# perf script
[…]
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]
[…]
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect –ag
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag
# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
One-Liners: Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use --del):
perf probe -d tcp_sendmsg
# Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):
perf probe 'tcp_sendmsg:81 seglen'
# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) var is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'
# Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'
# Trace previous probe when size > 0, and state != TCP_ESTABLISHED(1) (debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a
Copy-n-paste!
All other instances (of the same kernel version):
# perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)
hUp://www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html
There's s/ll a lot more to perf…
• Using PMCs
• perf scrip/ng interface
• perf + eBPF
• perf sched
• perf /mechart
• perf trace
• perf c2c (new!)
• perf orace (new!)
• …
Links & References
• perf_events
• Kernel source: tools/perf/DocumentaCon
• hUps://perf.wiki.kernel.org/index.php/Main_Page
• hUp://www.brendangregg.com/perf.html
• hUp://web.eece.maine.edu/~vweaver/projects/perf_events/
• Mailing list hUp://vger.kernel.org/vger-lists.html#linux-perf-users
• perf-tools: hUps://github.com/brendangregg/perf-tools
• PMU tools: hUps://github.com/andikleen/pmu-tools
• perf, orace, and more: hUp://www.brendangregg.com/linuxperf.html
• Java frame pointer patch
• hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
• hUps://bugs.openjdk.java.net/browse/JDK-8068945
• Node.js: hUp://techblog.ne9lix.com/2014/11/nodejs-in-flames.html
• Methodology: hUp://www.brendangregg.com/methodology.html
• Flame graphs: hUp://www.brendangregg.com/flamegraphs.html
• Heat maps: hUp://www.brendangregg.com/heatmaps.html
• eBPF: hUp://lwn.net/Ar/cles/603983/
Thank You
– Ques/ons?
– hUp://www.brendangregg.com
– hUp://slideshare.net/brendangregg
– [email protected]
– @brendangregg