Javaone2016javaflamegraphs 160920172322
Javaone2016javaflamegraphs 160920172322
C++ Java C
(JVM) (User)
Java
(Inlined) C
(Kernel)
Cloud
• Tens of thousands of AWS EC2 instances
• Mostly Java (Oracle JVM)
Auto Scaling
Instance: usually Ubuntu Linux Group
CloudWatch, servo
Node.js, … GC and latency, …
Tomcat
thread
dump
Atlas, Vector, S3 logging Instance
logs, sar, trace,
perf, perf-tools, Instance
(BPF soon) hystrix, servo
Instance
The Problem with Profilers
Java Profilers
Kernel,
libraries,
JVM
Java
GC
Java Profilers
• Visibility
– Java method execution
– Object usage
– GC logs
– Custom Java context
• Typical problems:
– Sampling often happens at safety/yield points (skew)
– Method tracing has massive observer effect
– Misidentifies RUNNING as on-CPU (e.g., epoll)
– Doesn't include or profile GC or JVM CPU time
– Tree views not quick (proportional) to comprehend
• Inaccurate (skewed) and incomplete profiles
System Profilers
Java Kernel
TCP/IP
JVM GC
Locks epoll
Idle
Time
thread
System Profilers
• Visibility
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Typical problems (x86):
– Stacks missing for Java
– Symbols missing for Java methods
• Other architectures (e.g., SPARC) have fared better
• Profile everything except Java
Workaround
• Capture both:
Java System
Kernel
Java JVM
GC
Solution
• Fix system profiling,
see everything: Kernel
– Java methods
Java
– JVM (C++) GC JVM
– GC (C++)
– libraries (C)
– kernel (C)
– Other apps
• Minor Problems:
– 0-3% CPU overhead to enable frame pointers (usually <1%).
– Symbol dumps can consume a burst of CPU
• Complete and accurate (asynchronous) profiling
Saving 13M CPU Minutes Per Day
• eu
hXp://techblog.neZlix.com/2016/04/saving-13-million-computaJonal-minutes.html
System Example
GC internals, visualized:
CPU Profiling
CPU Profiling
• Record stacks at a timed interval: simple and effective
– Pros: Low (deterministic) overhead
– Cons: Coarse accuracy, but usually sufficient
stack B B
samples: A A A A A
B syscall
A
on-CPU off-CPU time
block interrupt
Stack Traces
• A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6) running
at Func_abc.func_b(Func_abc.java:16) parent
at Func_abc.func_a(Func_abc.java:23) g.parent
at Func_abc.main(Func_abc.java:27) g.g.parent
System Profilers
• Linux
– perf_events (aka "perf")
• Oracle Solaris
– DTrace
• OS X
– Instruments
• Windows
– XPerf, WPA (which now has flame graphs!)
• And many others…
Linux perf_events
• Standard Linux profiler
– Provides the perf command (multi-tool)
– Usually pkg added by linux-tools-common, etc.
• Many event sources:
– Timer-based sampling
– Hardware events
– Tracepoints
– Dynamic tracing
• Can sample stacks of (almost) everything on CPU
– Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)
perf Profiling
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events call tree
| summary
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Full perf report Output
… as a Flame Graph
Flame Graphs
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random (default), or a dimension
• Currently made from Perl + SVG + JavaScript
– Multiple d3 versions are being developed
• References:
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– http://queue.acm.org/detail.cfm?id=2927301
– "The Flame Graph" CACM, June 2016
• Easy to make
– Converters for many profilers
Flame Graph Interpretation
g()
e() f()
d()
c() i()
b() h()
a()
Flame Graph Interpretation (1/3)
Top edge shows who is running on-CPU,
and how much (width)
g()
e() f()
d()
c() i()
b() h()
a()
Flame Graph Interpretation (2/3)
Top-down shows ancestry
e.g., from g():
g()
e() f()
d()
c() i()
b() h()
a()
Flame Graph Interpretation (3/3)
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
g()
e() f()
d()
c() i()
b() h()
a()
Mixed-Mode Flame Graphs
• Hues: Mixed-Mode
– green == Java
– aqua == Java (inlined)
Kernel
• if included
Java JVM
– red == system
– yellow == C++
• Intensity:
– Randomized to
differentiate frames
– Or hashed on
function name
Differential Flame Graphs
• Hues: Differential
– red == more samples
– blue == less samples
• Intensity:
– Degree of difference
• Compares two profiles
• Can show other
more less
metrics: e.g., CPI
• Other types exist
– flamegraphdiff
Flame Graph Search
• Color: magenta to show matched frames
search
button
Flame Charts
• Final note: these are useful, but are not flame graphs
Java stacks
(but no symbols)
Stack Depth
• perf had a 127 frame limit
• Now tunable in Linux 4.8
– sysctl -w kernel.perf_event_max_stack=512
– Thanks Arnaldo Carvalho de Melo!
A Java microservice
with a stack depth
of > 900
broken stacks
perf_event_max_stack=1024
Symbols
Fixing Symbols
• For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
# perf script
java 14025 [017] 8048.157085: cpu-clock: 7fd781253265 Ljava/util/
HashMap;::get (/tmp/perf-12149.map)
[…]
Stacks & Symbols
Java Mixed-Mode Flame Graph
Kernel
Java JVM
GC
Stacks & Symbols (zoom)
Inlining
• Many frames may be missing (inlined)
– Flame graph may still make enough sense
• Inlining can be tuned
– -XX:-Inline to disable, but can be 80% slower!
– -XX:MaxInlineSize and -XX:InlineSmallCode
can be tuned a little to reveal more frames
• Can even improve performance!
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
1. Check Java Version
• Need JDK8u60 or better
– for -XX:+PreserveFramePointer
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
or
Select
Metrics
Flame Graphs
Near real-time,
per-second metrics
Netflix Vector
• Open source, on-demand, instance analysis tool
– https://github.com/netflix/vector
• Shows various real-time metrics
• Flame graph support currently in development
– Automating previous steps
– Using it internally already
– Also developing a new d3 front end
Advanced Analysis
Linux perf_events Coverage
GC
TCP Events
• TCP transmit, using dynamic tracing:
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks
# perf probe --del tcp_sendmsg
zoomed:
Java Package Flame Graph
• Sample on-CPU instruction pointer only (no stack)
– Don't need -XX:+PreserveFramePointer
• y-axis: package name hierarchy
– java / util / ArrayList / ::size
Thanks
• Questions?
• http://techblog.netflix.com
• http://slideshare.net/brendangregg
• http://www.brendangregg.com
• [email protected]
• @brendangregg