Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
2015
Brendan Gregg
Senior Performance Architect
Performance Engineering Team
[email protected] @brendangregg
This Talk
• This talk is about Linux profiling using perf_events
– How to get CPU profiling to work, and overcome gotchas
– A tour of perf_events and its features
• This is based on our use of perf_events at Netflix
• Massive Amazon EC2 Linux cloud
– Tens of thousands of instances
– Autoscale by ~3k each day
– CentOS and Ubuntu, Java and Node.js
• FreeBSD for content delivery
– Approx 33% of the US Internet traffic at night
• Performance is critical
– Customer satisfaction: >50M subscribers
– $$$ price/performance
– Develop tools for cloud-wide analysis, and
make them open source: NetflixOSS
– Use server tools as needed
Agenda
1. Why We Need Linux Profiling
2. Crash Course
3. CPU Profiling
4. Gotchas
– Stacks (gcc, Java)
– Symbols (Node.js, Java)
– Guest PMCs
– PEBS
– Overheads
5. Tracing
1. Why We Need Linux Profiling
Why We Need Linux Profiling
• Our primary motivation is simple:
Understand CPU usage quickly and completely
Netflix Vector
Quickly:
Completely: Kernel
(C)
JVM
(C++)
Java
Value for Netflix
• Uses for CPU profiling:
– Help with incident response
– Non-regression testing
– Software evaluations
– Identify performance tuning targets
– Part of CPU workload characterization
• Built into Netflix Vector
– A near real-time instance analysis tool (will be NetflixOSS)
Workload Characterization
• For CPUs:
1. Who
2. Why
3. What
4. How
Workload Characterization
• For CPUs:
How What
CPU Tools
Who Why
How What
Who Why
How What
Note:
sleep
10
is
a
dummy
command
to
set
the
duraOon
perf Actions
• Count events (perf stat …)
– Uses an efficient in-kernel counter, and prints the results!
• Sample events (perf record …)
– Records details of every event to a dump file (perf.data)
• Timestamp, CPU, PID, instruction pointer, …
– This incurs higher overhead, relative to the rate of events!
– Include the call graph (stack trace) using -g!
• Other actions include:
– List events (perf list)
– Report from a perf.data file (perf report)
– Dump a perf.data file as text (perf script)
– top style profiling (perf top)
perf Actions: Workflow
Typical
Workflow
perf.data
text
UI
dump
profile
perf report! perf script!
stackcollapse-perf.pl!
flame
graph
visualizaOon
flamegraph.pl!
perf Events
• Custom timers
– e.g., 99 Hertz (samples per second)
• Hardware events
– CPU Performance Monitoring Counters (PMCs)
• Tracepoints
– Statically defined in software
• Dynamic tracing
– Created using uprobes (user) or kprobes (kernel)
– Can do kernel line tracing with local variables (needs kernel
debuginfo)
perf Events: Map
perf Events: List
# perf list!
List of pre-defined events (to be used in -e):!
cpu-cycles OR cycles [Hardware event]!
instructions [Hardware event]!
cache-references [Hardware event]!
cache-misses [Hardware event]!
branch-instructions OR branches [Hardware event]!
branch-misses [Hardware event]!
bus-cycles [Hardware event]!
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]!
stalled-cycles-backend OR idle-cycles-backend [Hardware event]!
[…]!
cpu-clock [Software event]!
task-clock [Software event]!
page-faults OR faults [Software event]!
context-switches OR cs [Software event]!
cpu-migrations OR migrations [Software event]!
[…]!
L1-dcache-loads [Hardware cache event]!
L1-dcache-load-misses [Hardware cache event]!
L1-dcache-stores [Hardware cache event]!
[…] !
skb:kfree_skb [Tracepoint event]!
skb:consume_skb [Tracepoint event]!
skb:skb_copy_datagram_iovec [Tracepoint event]!
net:net_dev_xmit [Tracepoint event]!
net:net_dev_queue [Tracepoint event]!
net:netif_receive_skb [Tracepoint event]!
net:netif_rx [Tracepoint event]!
[…]!
perf Scope
• System-wide: all CPUs (-a)
• Target PID (-p PID)
• Target command (…)
• Specific CPUs (-c …)
• User-level only (<event>:u)
• Kernel-level only (<event>:k)
• A custom filter to match variables (--filter …)
stack
B! B!
samples:
A! A! A! A! A!
B! syscall
A!
Ome
on-‐CPU
off-‐CPU
block
interrupt
perf Screenshot
• Sampling full stack traces at 99 Hertz:
# perf record -F 99 -ag -- sleep 30!
[ perf record: Woken up 9 times to write data ]!
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]!
# perf report -n --stdio!
1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock
|!
--- _raw_spin_lock!
| !
|--63.21%-- try_to_wake_up!
| | !
| |--63.91%-- default_wake_function!
| | | !
| | |--56.11%-- __wake_up_common!
| | | __wake_up_locked!
| | | ep_poll_callback!
| | | __wake_up_common!
| | | __wake_up_sync_key!
| | | | !
| | | |--59.19%-- sock_def_readable!
[…78,000 lines truncated…]!
perf Reporting
• perf report summarizes by combining common paths
• Previous output truncated 78,000 lines of summary
• The following is what a mere 8,000 lines looks like…
perf report
… as a Flame Graph
Flame Graphs
git clone --depth 1 https://github.com/brendangregg/FlameGraph!
cd FlameGraph!
perf record -F 99 -a –g -- sleep 30!
perf script | ./stackcollapse-perf.pl |./flamegraph.pl > perf.svg!
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension
• e.g., software type, or difference between two profiles for
non-regression testing ("differential flame graphs")
– interpretation: top edge is on-CPU, beneath it is ancestry
• Just a Perl program to convert perf stacks into SVG
– Includes JavaScript: open in a browser for interactivity
• Easy to get working
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
4. Gotchas
When you try to use perf
• Stacks don't work (missing)
• Symbols don't work (hex numbers)
• Can't profile Java
• Can't profile Node.js/io.js
• PMCs aren't available
• Dynamic tracing function arguments don't work
• perf locks up
How to really get started
1. Get "perf" to work
2. Get stack walking to work
3. Fix symbol translation
4. Get IPC to work
5. Test perf under load
!
7f417908c10b [unknown] (/tmp/perf-4458.map)! deep (junk values)
java 4579 cpu-clock: !
7f4179101c97 [unknown] (/tmp/perf-4458.map)!
!
java 4579 cpu-clock: !
7f41792fc65f [unknown] (/tmp/perf-4458.map)!
a2d53351ff7da603 [unknown] ([unknown])!
!
java 4579 cpu-clock: !
7f4179349aec [unknown] (/tmp/perf-4458.map)!
!
java 4579 cpu-clock: !
7f4179101d0f [unknown] (/tmp/perf-4458.map)!
!
java 4579 cpu-clock: !
7f417908c194 [unknown] (/tmp/perf-4458.map)!
[…]!
Fixed Java Stacks
# perf script!
[…]! • With JDK-8068945
java 8131 cpu-clock: !
7fff76f2dce1 [unknown] ([vdso])! stacks are full, and go
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…!
7fd301861e46 [unknown] (/tmp/perf-8131.map)! all the way to
7fd30184def8 [unknown] (/tmp/perf-8131.map)!
7fd30174f544 [unknown] (/tmp/perf-8131.map)! start_thread()
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)!
7fd30166d51c [unknown] (/tmp/perf-8131.map)!
7fd301750f34 [unknown] (/tmp/perf-8131.map)!
• This is what the CPUs
7fd3016c2280 [unknown] (/tmp/perf-8131.map)!
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)!
are really running:
7fd3016f9888 [unknown] (/tmp/perf-8131.map)!
7fd3016ece04 [unknown] (/tmp/perf-8131.map)!
inlined frames are not
7fd30177783c [unknown] (/tmp/perf-8131.map)!
7fd301600aa8 [unknown] (/tmp/perf-8131.map)!
present
7fd301a4484c [unknown] (/tmp/perf-8131.map)!
7fd3010072e0 [unknown] (/tmp/perf-8131.map)!
7fd301007325 [unknown] (/tmp/perf-8131.map)!
7fd301007325 [unknown] (/tmp/perf-8131.map)!
7fd3010004e7 [unknown] (/tmp/perf-8131.map)!
7fd3171df76a JavaCalls::call_helper(JavaValue*,…!
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…!
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…!
7fd31721b6ce thread_entry(JavaThread*, Thread*)…!
7fd3175389e0 JavaThread::thread_main_inner() (/…!
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…!
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…!
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…!
Fixed Stacks Flame Graph
Java
(no
symbols)
Gotcha #2 Missing Symbols
• Missing symbols should be obvious in perf report/script:
71.79% 334 sed sed [.] 0x000000000001afc1!
| !
|--11.65%-- 0x40a447!
| 0x40659a!
| 0x408dd8!
| 0x408ed1! broken
| 0x402689!
| 0x7fa1cd08aec5!
Kernel
JVM
Java
Java: Inlining
• Disabling inlining:
No
inlining
– -XX:-Inline
– Many more Java frames
– 80% slower (in this case)
• Not really necessary
– Inlined flame graphs often
make enough sense
– Or tune -XX:MaxInlineSize
and -XX:InlineSmallCode a
little to reveal more frames,
without costing much perf
– Can even go faster!
Node.js: Stacks & Symbols
• Covered previously on the Netflix Tech Blog
http://techblog.netflix.com/2014/11/nodejs-in-flames.html
Gotcha #3 Guest PMCs
• Using PMCs from a Xen guest (currently):
# perf stat -a -d sleep 5!
!
Performance counter stats for 'system wide':!
!
10003.718595 task-clock (msec) # 2.000 CPUs utilized [100.00%]!
323 context-switches # 0.032 K/sec [100.00%]!
17 cpu-migrations # 0.002 K/sec [100.00%]!
233 page-faults # 0.023 K/sec !
<not supported> cycles !
<not supported> stalled-cycles-frontend !
<not supported> stalled-cycles-backend !
<not supported> instructions !
<not supported> branches !
<not supported> branch-misses !
<not supported> L1-dcache-loads !
<not supported> L1-dcache-load-misses !
<not supported> LLC-loads !
<not supported> LLC-load-misses !
!
5.001607197 seconds time elapsed!
Guest PMCs
• Without PMCs, %CPU is ambiguous. We can't measure:
– Instructions Per Cycle (IPC)
– CPU cache hits/misses
– MMU TLB hits/misses
– Branch misprediction
– Stall cycles
• Should be fixable: hypervisors can expose PMCs
– At the very least, enough PMCs for IPC to work:
INST_RETIRED.ANY_P & CPU_CLK_UNHALTED.THREAD_P
• In the meantime
– I'm using a physical server for PMC analysis
– Also some MSRs on the cloud
MSRs
• Model Specific Registers (MSRs) may be exposed when
PMCs are not
• Better than nothing. Can solve some issues.
# ./showboost!
CPU MHz : 2500!
Turbo MHz : 2900 (10 active)!
Turbo Ratio : 116% (10 active)!
CPU 0 summary every 5 seconds...!
!
TIME C0_MCYC C0_ACYC UTIL RATIO MHz!
17:28:03 4226511637 4902783333 33% 116% 2900!
17:28:08 4397892841 5101713941 35% 116% 2900!
17:28:13 4550831380 5279462058 36% 116% 2900!
17:28:18 4680962051 5429605341 37% 115% 2899!
17:28:23 4782942155 5547813280 38% 115% 2899!
[...]!
(A) (B)
(C)
(D)
Instruction Profiling
# perf annotate -i perf.data.noplooper --stdio!
Percent | Source code & Disassembly of noplooper!
--------------------------------------------------------!
: Disassembly of section .text:!
:!
: 00000000004004ed <main>:!
0.00 : 4004ed: push %rbp!
0.00 : 4004ee: mov %rsp,%rbp!
20.86 : 4004f1: nop!
0.00 : 4004f2: nop!
0.00 : 4004f3: nop!
0.00 : 4004f4: nop!
19.84 : 4004f5: nop!
0.00 : 4004f6: nop!
0.00 : 4004f7: nop!
0.00 : 4004f8: nop!
18.73 : 4004f9: nop!
0.00 : 4004fa: nop!
0.00 : 4004fb: nop!
0.00 : 4004fc: nop!
19.08 : 4004fd: nop! Go
home
instrucOon
0.00 :
0.00 :
4004fe:
4004ff:
nop!
nop!
pointer,
you're
drunk
0.00 : 400500: nop!
21.49 : 400501: jmp 4004f1 <main+0x4>!
PEBS
• I believe this is sample "skid", plus parallel and out-of-
order execution of micro-ops: the sampled IP is the
resumption instruction, not what is currently executing.
• PEBS may help: Intel's Precise Event Based Sampling
• perf_events has support:
– perf record -e cycles:pp!
– The 'p' can be specified multiple times:
• 0 - SAMPLE_IP can have arbitrary skid
• 1 - SAMPLE_IP must have constant skid
• 2 - SAMPLE_IP requested to have 0 skid
• 3 - SAMPLE_IP must have 0 skid
– … from tools/perf/Documentation/perf-list.txt
Gotcha #5 Overhead
• Overhead is relative to the rate of events instrumented
• perf stat does in-kernel counts, with relatively low
CPU overhead
• perf record writes perf.data, which has slightly
higher CPU overhead, plus file system and disk I/O
• Test before use
– In the lab
– Run perf stat to understand rate, before perf record
• Also consider --filter, to filter events in-kernel
5. Tracing
Profiling vs Tracing
• Profiling takes samples. Tracing records every event.
• There are many tracers for Linux (SystemTap, ktap, etc),
but only two in mainline: perf_events and ftrace
one-‐liners:
many
front-‐end
tools:
perf,
perf-‐tools
tracing
frameworks:
perf_events,
krace,
eBPF,
…
tracing
instrumentaOon:
tracepoints,
kprobes,
uprobes
Tracing Example
# perf record -e block:block_rq_insert -a!
^C[ perf record: Woken up 1 times to write data ]!
[ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]!
!
# perf script!
[…]!
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]!
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]!
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]!
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]!
[…]!
Tracing Example
# perf record -e block:block_rq_insert -a!
^C[ perf record: Woken up 1 times to write data ]!
[ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]!
!
# perf script!
[…]!
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]!
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]!
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]!
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]!
[…]!
hop://www.brendangregg.com/blog/2014-‐07-‐01/perf-‐heat-‐maps.html
One-Liners: Static Tracing
# Trace new processes, until Ctrl-C:!
perf record -e sched:sched_process_exec -a!
!
# Trace all context-switches with stack traces, for 1 second:!
perf record -e context-switches –ag -- sleep 1!
!
# Trace CPU migrations, for 10 seconds:!
perf record -e migrations -a -- sleep 10!
!
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:!
perf record -e syscalls:sys_enter_connect –ag!
!
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:!
perf record -e block:block_rq_insert -ag!
!
# Trace all block device issues and completions (has timestamps), until Ctrl-C:!
perf record -e block:block_rq_issue -e block:block_rq_complete -a!
!
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:!
perf record -e block:block_rq_complete --filter 'nr_sector > 200'!
!
# Trace all block completions, synchronous writes only, until Ctrl-C:!
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'!
!
# Trace all block completions, all types of writes, until Ctrl-C:!
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'!
!
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:!
perf record -e 'ext4:*' -o /tmp/perf.data -a!
Tracepoint Variables
• Some previous one-liners used variables with --filter
• The ftrace interface has a way to print them:
# cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format !
name: block_rq_insert!
ID: 884!
format:!
!field:unsigned short common_type; !offset:0;!size:2; !signed:0;!
!field:unsigned char common_flags; !offset:2;!size:1; !signed:0;!
!field:unsigned char common_preempt_count; !offset:3;!size:1; !signed:0;!
!field:int common_pid; !offset:4;!size:4; !signed:1;!
!
!field:dev_t dev; !offset:8;!size:4; !signed:0;! variables
!field:sector_t sector; !offset:16; !size:8; !signed:0;!
!field:unsigned int nr_sector; !offset:24;
!size:4; !signed:0;!
!field:unsigned int bytes;!offset:28;
!size:4; !signed:0;!
!field:char rwbs[8];!offset:32;!size:8; !signed:1;! (format
string
!field:char comm[16];!offset:40;
!size:16; !signed:1;!
!field:__data_loc char[] cmd; !offset:56;!size:4; !signed:1;!
internals)
!
print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->dev) >> 20)),
((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd),
(unsigned long long)REC->sector, REC->nr_sector, REC->comm!
One-Liners: Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):!
perf probe --add tcp_sendmsg!
!
# Remove the tcp_sendmsg() tracepoint (or use --del):!
perf probe -d tcp_sendmsg!
!
# Add a tracepoint for the kernel tcp_sendmsg() function return:!
perf probe 'tcp_sendmsg%return'!
!
# Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):!
perf probe -V tcp_sendmsg --externs!
!
# Show available line probes for tcp_sendmsg() (needs debuginfo):!
perf probe -L tcp_sendmsg!
!
# Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):!
perf probe 'tcp_sendmsg:81 seglen'!
!
# Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):!
perf probe 'do_sys_open filename:string'!
!
# Add a tracepoint for myfunc() return, and include the retval as a string:!
perf probe 'myfunc%return +0($retval):string'!
!
# Add a tracepoint for the user-level malloc() function from libc:!
perf probe -x /lib64/libc.so.6 malloc!
!
# List currently available dynamic probes:!
perf probe -l!
One-Liners: Advanced Dynamic Tracing
# Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):!
perf probe 'tcp_sendmsg %ax %dx %cx'!
!
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:!
perf probe 'tcp_sendmsg bytes=%cx'!
!
# Trace previously created probe when the bytes (alias) var is greater than 100:!
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'!
!
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:!
perf probe 'tcp_sendmsg%return $retval'!
!
# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):!
perf probe 'tcp_sendmsg size'!
!
# Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):!
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'!
!
# Trace previous probe when size > 0, and state != TCP_ESTABLISHED(1) (debuginfo):!
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a!
Copy-‐n-‐paste!
All other instances (of the same kernel version):
# perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'!
Failed to find path of kernel module.!
Added new event:!
probe:tcp_sendmsg (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)!
!
You can now use it in all perf tools, such as:!
!
!perf record -e probe:tcp_sendmsg -aR sleep 1!
Low
High
latency
latency
Time
cache
device
hits
I/O
Linux Profiling Future
• eBPF is integrating, and provides the final missing piece
of tracing infrastructure: efficient kernel programming
– perf_events + eBPF?
– ftrace + eBPF?
– Other tracers + eBPF?
• At Netflix, the future is Vector, and more self-service
automation of perf_events
Summary & Your Action Items
• Short term: get full CPU profiling to work
A. Automate perf CPU profiles with flame graphs. See this talk!
B. … or use Netflix Vector when it is open sourced
C. … or ask performance monitoring vendors for this
– Most importantly, you should expect that full CPU profiles are
available at your company. The ROI is worth it.
• Long term: PMCs & tracing The
"real"
checklist
reminder:
– Use perf_events to profile 1. Get
"perf"
to
work
other targets: CPU cycles, 2. Get
stack
walking
to
work
file system I/O, disk I/O, 3. Fix
symbol
translaOon
memory usage, … 4. Get
IPC
to
work
• Go forth and profile! 5. Test
perf
under
load
Links & References
• perf_events
• Kernel source: tools/perf/Documentation
• https://perf.wiki.kernel.org/index.php/Main_Page
• http://www.brendangregg.com/perf.html
• http://web.eece.maine.edu/~vweaver/projects/perf_events/
• Mailing list http://vger.kernel.org/vger-lists.html#linux-perf-users
• perf-tools: https://github.com/brendangregg/perf-tools
• PMU tools: https://github.com/andikleen/pmu-tools
• perf, ftrace, and more: http://www.brendangregg.com/linuxperf.html
• Java frame pointer patch
• http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
• https://bugs.openjdk.java.net/browse/JDK-8068945
• Node.js: http://techblog.netflix.com/2014/11/nodejs-in-flames.html
• Methodology: http://www.brendangregg.com/methodology.html
• Flame graphs: http://www.brendangregg.com/flamegraphs.html
• Heat maps: http://www.brendangregg.com/heatmaps.html
• eBPF: http://lwn.net/Articles/603983/
Thanks
• Questions?
• http://slideshare.net/brendangregg
• http://www.brendangregg.com
• [email protected]
– Performance and Reliability Engineering
• @brendangregg