0% found this document useful (0 votes)

45 views79 pages

KernelRecipes Perf Events

This document discusses using the Linux perf tool at Netflix to profile CPU performance. It provides an overview of why Netflix needs Linux perf, describes the basic workflow and commands of perf like perf list, stat, record, report and script. It also covers perf events like hardware events, tracepoints, and dynamic tracing as well as common issues with CPU profiling like JIT runtimes and missing symbols.

Uploaded by

Jankull

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views79 pages

KernelRecipes Perf Events

Uploaded by

Jankull

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Using Linux perf

at Netflix

Brendan Gregg
Senior Performance Architect
Sep 2017
Case Study: ZFS is ea/ng my CPU
•  Easy to debug using Ne9lix Vector & ﬂame graphs
•  How I expected it to look:
Case Study: ZFS is ea/ng my CPU (cont.)
•  How it really looked:
Applica/on (truncated)

38% kernel /me (why?)

Case Study: ZFS is ea/ng my CPU (cont.)
Zoomed:

•  ZFS ARC (adap/ve replacement cache) reclaim.

•  But… ZFS is not in use. No pools, datasets, or ARC buffers.
•  CPU /me is in random entropy, picking which (empty) list to evict.

Bug: hUps://github.com/zfsonlinux/zfs/issues/6531
Agenda
1.  Why Ne9lix Needs Linux Profiling
2.  perf Basics
3.  CPU Profiling & Gotchas
–  Stacks (gcc, Java)
–  Symbols (Node.js, Java)
–  Guest PMCs
–  PEBS
–  Overheads

4.  perf Advanced

1. Why Ne)lix Needs Linux Proﬁling
Understand CPU usage quickly and completely
Quickly
Eg, Ne9lix Vector (self-service UI):

Flame Graphs
Heat Maps
…
Completely
CPU Flame Graph

Kernel
(C) JVM
(C++)
User Java
(C)
Why Linux perf?
•  Available
–  Linux, open source
•  Low overhead
–  Tunable sampling, ring buﬀers
•  Accurate
–  Applica/on-basic samplers don't know what's really RUNNING; eg, Java and epoll
•  No blind spots
–  See user, library, kernel with CPU sampling
–  With some work: hardirqs & SMI as well
•  No sample skew
–  Unlike Java safety point skew
Why is this so important
•  We typically scale microservices based on %CPU
–  Small %CPU improvements can mean big $avings

•  CPU proﬁling is used by many ac/vi/es

–  Explaining regressions in new sooware versions
–  Incident response
–  3rd party sooware evalua/ons
–  Iden/fy performance tuning targets
–  Part of CPU workload characteriza/on

•  perf does lots more, but we spend ~95% of our /me looking
at CPU proﬁles, and 5% on everything else
–  With new BPF capabili/es (oﬀ-CPU analysis), that might go from 95 to 90%
CPU profiling should be easy, but…

JIT runtimes
no frame pointers
no debuginfo
stale symbol maps
container namespaces
…
2. perf Basics
perf (aka "perf_events")
•  The oﬃcial Linux proﬁler
–  In the linux-tools-common package
–  Source code & docs in Linux: tools/perf

•  Supports many proﬁling/tracing features:

–  CPU Performance Monitoring Counters (PMCs)
–  Sta/cally defined tracepoints
–  User and kernel dynamic tracing
–  Kernel line and local variable tracing
–  Efficient in-kernel counts and filters
perf_events
–  Stack tracing, libunwind ponycorn
–  Code annota/on

•  Some bugs in the past; has been stable for us

A Mul/tool of Subcommands
# perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
ftrace simple wrapper for kernel's ftrace functionality
inject Filter to augment the events stream with additional information
kallsyms Searches running kernel for symbols
kmem Tool to trace/measure kernel memory properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
probe Define new dynamic tracepoints
trace strace inspired tool

See 'perf help COMMAND' for more information on a specific command. from Linux 4.13
perf Basic Workflow
1.  list -> find events
2.  stat -> count them
3.  record-> write event data to file
4.  report -> browse summary
5.  script -> event dump for post processing
Basic Workflow Example
# perf list sched:*
[…]
sched:sched_process_exec [Tracepoint event]
[…] 1.  found an event of interest
# perf stat -e sched:sched_process_exec -a -- sleep 10
Performance counter stats for 'system wide':
19 sched:sched_process_exec 2.  19 per 10 sec is a very low
10.001327817 seconds time elapsed rate, so safe to record
# perf record -e sched:sched_process_exec -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ] 3.  21 samples captured
# perf report -n --stdio
# Children Self Samples Trace output
# ........ ........ ............ .................................................
4.76%
|
4.76%
1 filename=/bin/bash pid=7732 old_pid=7732

---_start
return_from_SYSCALL_64
4.  summary style may be
do_syscall_64
sys_execve
suﬃcient, or,
do_execveat_common.isra.35
[…]
# perf script 5.  script output in /me order
sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729
44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so)
[…]
perf stat/record Format
•  These have three main parts: ac/on, event, scope.
•  e.g., proﬁling on-CPU stack traces:

AcCon: record stack traces

perf record -F 99 -a -g -- sleep 10

Scope: all CPUs

Event: 99 Hertz

Note: sleep 10 is a dummy command to set the dura/on

perf Ac/ons
•  Count events (perf stat …)
–  Uses an eﬃcient in-kernel counter, and prints the results

•  Sample events (perf record …)

–  Records details of every event to a dump ﬁle (perf.data)
•  Timestamp, CPU, PID, instruc/on pointer, …
–  This incurs higher overhead, rela/ve to the rate of events
–  Include the call graph (stack trace) using -g

•  Other ac/ons include:

–  List events (perf list)
–  Report from a perf.data file (perf report)
–  Dump a perf.data file as text (perf script)
–  top style profiling (perf top)
perf Events
•  Custom /mers
–  e.g., 99 Hertz (samples per second)

•  Hardware events
–  CPU Performance Monitoring Counters (PMCs)

•  Tracepoints
–  Sta/cally deﬁned in sooware

•  Dynamic tracing
–  Created using uprobes (user) or kprobes (kernel)
–  Can do kernel line tracing with local variables (needs kernel debuginfo)
perf Events: Map
perf Events: List
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
[…]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
[…]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
[…]
skb:kfree_skb [Tracepoint event]
skb:consume_skb [Tracepoint event]
skb:skb_copy_datagram_iovec [Tracepoint event]
net:net_dev_xmit [Tracepoint event]
net:net_dev_queue [Tracepoint event]
net:netif_receive_skb [Tracepoint event]
net:netif_rx [Tracepoint event]
[…]
perf Scope
•  System-wide: all CPUs (-a)
•  Target PID (-p PID)
•  Target command (…)
•  Speciﬁc CPUs (-c …)
•  User-level only (<event>:u)
•  Kernel-level only (<event>:k)
•  A custom ﬁlter to match variables (--filter …)
•  This cgroup (container) only (--cgroup …)
One-Liners: Lis/ng Events
# Listing all currently known events:
perf list

# Searching for "sched" tracepoints:

perf list | grep sched

# Listing sched tracepoints:

perf list 'sched:*'

Dozens of perf one-liners:

hUp://www.brendangregg.com/perf.html#OneLiners
One-Liners: Coun/ng Events
# CPU counter statistics for the specified command:
perf stat command

# CPU counter statistics for the entire system, for 5 seconds:

perf stat -a sleep 5

# Detailed CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -dp PID

# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command

# Count system calls for the specified PID, until Ctrl-C:

perf stat -e 'syscalls:sys_enter_*' -p PID

# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10

# Show system calls by process, refreshing every 2 seconds:

perf top -e raw_syscalls:sys_enter -ns comm
One-Liners: Proﬁling Events
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command

# Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10

# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:
perf record -F 99 -ag -- sleep 10

# Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5

# Sample CPU stack traces, once every 100 last level cache misses, for 5 secs:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5

# Sample on-CPU kernel instructions, for 5 seconds:

perf record -e cycles:k -a -- sleep 5

# Sample on-CPU user instructions, for 5 seconds:

perf record -e cycles:u -a -- sleep 5
One-Liners: Repor/ng
# Show perf.data in an ncurses browser (TUI) if possible:
perf report

# Show perf.data with a column for sample count:

perf report -n

# Show perf.data as a text report, with data coalesced and percentages:

perf report --stdio

# List all raw events from perf.data:

perf script

# List all raw events from perf.data, with customized fields:

perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso

# Dump raw contents from perf.data as hex (for debugging):

perf script -D

# Disassemble and annotate instructions with percentages (needs debuginfo):

perf annotate --stdio
3. CPU Profiling
CPU Profiling
•  Record stacks at a /med interval: simple and effec/ve
–  Pros: Low (determinis/c) overhead
–  Cons: Coarse accuracy, but usually sufficient

stack B B
samples: A A A A A

B syscall
A
/me
on-CPU oﬀ-CPU

block interrupt
perf Record
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--63.21%-- try_to_wake_up Sampling full
|
|
|
|--63.91%-- default_wake_function
stack traces
|
|
|
|
|
|--56.11%-- __wake_up_common
at 99 Hertz
| | | __wake_up_locked
| | | ep_poll_callback
| | | __wake_up_common
| | | __wake_up_sync_key
| | | |
| | | |--59.19%-- sock_def_readable
[…78,000 lines truncated…]
perf Repor/ng
•  perf report summarizes by combining common paths
•  Previous output truncated 78,000 lines of summary
•  The following is what a mere 8,000 lines looks like…
perf report
… as a Flame Graph
Flame Graphs
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg

•  Flame Graphs:
–  x-axis: alphabe/cal stack sort, to maximize merging
–  y-axis: stack depth
–  color: random, or hue can be a dimension
•  e.g., sooware type, or difference between two profiles for
non-regression tes/ng ("differen/al flame graphs")
–  interpreta/on: top edge is on-CPU, beneath it is ancestry
•  Just a Perl program to convert perf stacks into SVG
–  Includes JavaScript: open in a browser for interac/vity
•  Easy to get working hUp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
flamegraph.pl Op/ons
$ flamegraph.pl --help
USAGE: flamegraph.pl [options] infile > outfile.svg

--title TEXT # change title text

--subtitle TEXT # second level title (optional)
--width NUM # width of image (default 1200)
--height NUM # height of each frame (default 16)
--minwidth NUM # omit smaller functions (default 0.1 pixels)
--fonttype FONT # font type (default "Verdana")
--fontsize NUM # font size (default 12)
--countname TEXT # count type label (default "samples")
--nametype TEXT # name type label (default "Function:")
--colors PALETTE # set color palette. choices are: hot (default), mem,
# io, wakeup, chain, java, js, perl, red, green, blue,
# aqua, yellow, purple, orange
--hash # colors are keyed by function name hash
--cp # use consistent palette (palette.map)
--reverse # generate stack-reversed flame graph
--inverted # icicle graph
--negate # switch differential hues (blue<->red)
--notes TEXT # add notes comment in SVG (for debugging)
--help # this message

eg,
flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
perf Flame Graph Workﬂow (Linux 2.6+)
list events count events capture stacks
perf list perf stat perf record

Typical
Workﬂow perf.data

text UI dump proﬁle

perf report perf script

stackcollapse-perf.pl
ﬂame graph
visualiza/on
flamegraph.pl
perf Flame Graph Workﬂow (Linux 4.5+)
list events count events capture stacks
perf list perf stat perf record

Typical
Workﬂow perf.data

text UI dump summary

perf report perf script perf report
-g folded

awk
flame graph
visualiza/on
flamegraph.pl
Flame Graph Op/miza/ons
Linux 2.6 Linux 4.5 Linux 4.9
capture stacks capture stacks count stacks (BPF)
perf record perf record profile.py
write samples write samples not perf
perf.data perf.data
reads samples reads samples
perf script perf report folded
–g folded output
write text
folded report
stackcollapse-perf.pl awk
folded output folded output
flamegraph.pl flamegraph.pl flamegraph.pl
Gotchas
When we've tried to use perf
•  Stacks don't work (missing)
•  Symbols don't work (hex numbers)
•  Instruc/on profiling looks bogus
•  PMCs don't work in VM guests
•  Container break things
•  Overhead is too high
How to really get started
1.  Get "perf" to work Install perf-tools-common and
perf-tools-ùname -r` packages;
2.  Get stack walking to work
Or compile in the Linux source:
3.  Fix symbol transla/on tools/perf
4.  Get IPC to work The "gotchas"…
5.  Test perf under load
Gotcha #1 Broken Stacks
perf record -F 99 -a –g -- sleep 30
perf report -n --stdio

1.  Take a CPU proﬁle

2.  Run perf report
3.  If stacks are ooen < 3 frames, or don't reach "thread start" or
"main", they are probably broken. Fix them.
Iden/fying Broken Stacks
28.10% 146 sed libc-2.19.so [.] re_search_internal
|
--- re_search_internal
|
|--12.25%-- 0x3
| 0x100007
broken

|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
| probably not broken
|--1.33%-- 0x40a4a1
| |
| |--60.01%-- 0x40659a
| | 0x408dd8 missing symbols, but
|
|
|
|
0x408ed1
0x402689
that's another problem
| | 0x7fa1cd08aec5
Broken Stacks Flame Graph

Broken Java stacks Java == green

(missing frame pointer) system == red
C++ == yellow
Fixing Broken Stacks
•  Either:
•  Fix frame pointer-based stack walking (the default)
–  Pros: simple, supports any system stack walker
–  Cons: might cost a liUle extra CPU to make available
•  Use libunwind and DWARF: perf record -g dwarf
–  Pros: more debug info
–  Cons: not on older kernels, and inﬂates instance size
–  … there's also ORC on the latest kernel

•  Applica/on support
–  hUps://github.com/jvm-profiling-tools/async-profiler
•  Our current preference is (A), but (C) is also promising
–  So how do we fix the frame pointer…
gcc -fno-omit-frame-pointer
•  Once upon a time, x86 had fewer registers, and the frame
pointer register was reused for general purpose to improve
performance. This breaks system stack walking.
•  gcc provides -fno-omit-frame-pointer to fix this
–  Please make this the default in gcc!
Java -XX:+PreserveFramePointer
•  I hacked frame pointers in the JVM (JDK-8068945) and Oracle rewrote
it as -XX:+PreserveFramePointer. Lets perf do FP stack walks of Java.
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp); Involved changes like this:
+ mov(rbp, rsp);
// Remove word for ebp
fixing x86-64 func/on
framesize -= wordSize; prologues
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …

[...]

•  Costs some overhead to use. Usually <1%. Rare cases 10%.

Broken Java Stacks
# perf script
[…]
java 4579 cpu-clock: •  Check with "perf script" to see
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… stack samples
java 4579 cpu-clock: •  These are 1 or 2 levels deep (junk
7f417908c10b [unknown] (/tmp/perf-4458.map)
values)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)

java 4579 cpu-clock:

7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])

java 4579 cpu-clock:

7f4179349aec [unknown] (/tmp/perf-4458.map)

java 4579 cpu-clock:

7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
Fixed Java Stacks
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso]) •  With -XX:+PreserveFramePointer
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
stacks are full, and go all the way to
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map) start_thread()
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map) •  This is what the CPUs are really
running: inlined frames are not
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map) present
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
Fixed Stacks Flame Graph

Java
(no symbols)
Gotcha #2 Missing Symbols
•  Missing symbols should be obvious in perf report/script:
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1 broken
| 0x402689
| 0x7fa1cd08aec5

12.06% 62 sed sed [.] re_search_internal

•  For JIT (Java, Node.js, …):

A.  Create a /tmp/perf-PID.map file. perf already looks for this
•  Map format is "START SIZE symbolname"
B.  Or use a symbol loggers. Eg tools/perf/jvm/.
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
Java Symbols
•  perf-map-agent
–  Agent aUaches and writes the map file on demand (previous versions aUached on Java
start, and wrote con/nually)
–  hUps://github.com/jvm-profiling-tools/perf-map-agent
(was hUps://github.com/jrudolph/perf-map-agent)

•  Automa/on: jmaps
–  We use scripts to find Java processes and dump their map files, paying aUen/on to file
ownership etc
–  hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps
–  Needs to run as close as possible to the profile, to minimize symbol churn
# perf record -F 99 -a -g -- sleep 30; jmaps
Java Flame Graph: Stacks & Symbols
Kernel flamegraph.pl --color=java

(C)

Java

User
(C)
JVM
(C++)
Java: Inlining
A.  Disabling inlining:
–  -XX:-Inline No inlining
–  Many more Java frames
–  80% slower (in this case)
–  May not be necessary: inlined ﬂame
graphs ooen make enough sense
–  Or tune -XX:MaxInlineSize and -
XX:InlineSmallCode to reveal more
frames, without cos/ng much perf: can
even go faster!

B.  Symbol agents can uninline

–  perf-map-agent unfoldall
–  We some/mes need and use this
Node.js: Stacks & Symbols
•  Frame pointer stacks work
•  Symbols currently via a logger
–  --perf-basic-prof: everything. We found it can log over 1 Gbyte/day.
–  --perf-basic-prof-only-functions: tries to only log symbols we care about.

•  perf may not use the most recent symbol in the log
–  We /dy logs before using them:
hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/
perfmap/dy.pl

•  Future v8's may support on-demand symbol dumps

Gotcha #3 Instruc/on Proﬁling
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop

16 NOPs in a loop
0.00 : 4004f2: nop
0.00 : 4004f3: nop

0.00 : 4004f4: nop
19.84 : 4004f5: nop

Let's proﬁle instruc/ons

0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 :
18.73 :
4004f8:
4004f9:
nop
nop to see which are hot!

0.00 : 4004fa: nop
0.00 : 4004fb: nop

(have I lost my mind?)

0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
Instruc/on Proﬁling
•  Even distribu/on (A)? Or something else?

(A) (B)

(C) (D)
Instruc/on Proﬁling
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 : 4004f8: nop
18.73 : 4004f9: nop
0.00 : 4004fa: nop
0.00 : 4004fb: nop
0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop Go home instruc/on pointer, you're drunk
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
PEBS
•  I believe this is due to parallel and out-of-order execu/on of
micro-ops: the sampled IP is the resump/on instruc/on, not
what is currently execu/ng. And skid.
•  PEBS may help: Intel's Precise Event Based Sampling
•  perf_events has support:
–  perf record -e cycles:pp
–  The 'p' can be speciﬁed mul/ple /mes:
•  0 - SAMPLE_IP can have arbitrary skid
•  1 - SAMPLE_IP must have constant skid
•  2 - SAMPLE_IP requested to have 0 skid
•  3 - SAMPLE_IP must have 0 skid
–  … from tools/perf/Documenta/on/perf-list.txt
Gotcha #4 VM Guests
•  Using PMCs from most VM guests:
# perf stat -a -d sleep 5

Performance counter stats for 'system wide':

10003.718595 task-clock (msec) # 2.000 CPUs utilized [100.00%]

323 context-switches # 0.032 K/sec [100.00%]
17 cpu-migrations # 0.002 K/sec [100.00%]
233 page-faults # 0.023 K/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
<not supported> L1-dcache-loads
<not supported> L1-dcache-load-misses
<not supported> LLC-loads
<not supported> LLC-load-misses

5.001607197 seconds time elapsed

VM Guest PMCs
•  Without PMCs, %CPU is ambiguous. We need IPC.
–  Can't measure instruc/ons per cycle (IPC), cache hits/misses, MMU/TLB events, etc.

•  Is ﬁxable: eg, Xen can enable PMCs (vpmu boot op/on)

–  I added vpmu support for subsets, eg, vpmu=arch for Intel architectural set (7 PMCs only)
–  hUp://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html

architectural
set

–  Now available on the largest AWS EC2 instance types

VM Guest MSRs
•  Model Speciﬁc Registers (MSRs) may be exposed when PMCs are not
•  BeUer than nothing. Can solve some issues.

# ./showboost
CPU MHz : 2500
Turbo MHz : 2900 (10 active)
Turbo Ratio : 116% (10 active)
CPU 0 summary every 5 seconds...

TIME C0_MCYC C0_ACYC UTIL RATIO MHz

17:28:03 4226511637 4902783333 33% 116% 2900
17:28:08 4397892841 5101713941 35% 116% 2900
17:28:13 4550831380 5279462058 36% 116% 2900
17:28:18 4680962051 5429605341 37% 115% 2899
17:28:23 4782942155 5547813280 38% 115% 2899

[...]

–  showboost is from my msr-cloud-tools collec/on (on github)

VM Guest PEBS
•  Not possible yet in Xen
–  please ﬁx

•  DiUo for LBR, BTS, processor trace

Gotcha #5 Containers
•  perf from the host can't find symbol files in different mount
namespaces
•  We currently workaround it
–  hUp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/

•  Should be ﬁxed in 4.14

–  Krister Johansen's patches
Gotcha #6 Overhead
•  Overhead is rela/ve to the rate of events instrumented
•  perf stat does in-kernel counts: rela/vely low overhead
•  perf record writes perf.data, which has slightly higher
CPU overhead, plus ﬁle system and disk I/O
•  Test before use
–  In the lab
–  Run perf stat to understand rate, before perf record

•  Also consider --filter, to ﬁlter events in-kernel

4. perf Advanced
perf for Tracing Events
Tracepoints
# perf record -e block:block_rq_insert -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]

# perf script
[…]
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]
[…]

process PID [CPU] /mestamp: eventname: format string

include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]

DECLARE_EVENT_CLASS(block_rq,
[...]
TP_printk("%d,%d %s %u (%s) %llu + %u [%s]", kernel source
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->rwbs, __entry->bytes, __get_str(cmd), may be the
(unsigned long long)__entry->sector,
__entry->nr_sector, __entry->comm)
only docs

Also see: cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format

One-Liners: Sta/c Tracing
# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a
# Trace all context-switches with stack traces, for 1 second:
perf record -e context-switches –ag -- sleep 1

# Trace CPU migrations, for 10 seconds:

perf record -e migrations -a -- sleep 10

# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect –ag

# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag

# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a

# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'

# Trace all block completions, synchronous writes only, until Ctrl-C:

perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'

# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
One-Liners: Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use --del):
perf probe -d tcp_sendmsg

# Add a tracepoint for the kernel tcp_sendmsg() function return:

perf probe 'tcp_sendmsg%return'

# Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs

# Show available line probes for tcp_sendmsg() (needs debuginfo):

perf probe -L tcp_sendmsg

# Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):
perf probe 'tcp_sendmsg:81 seglen'

# Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):

perf probe 'do_sys_open filename:string'

# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc

# List currently available dynamic probes:

perf probe -l
One-Liners: Advanced Dynamic Tracing
# Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'

# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:
perf probe 'tcp_sendmsg bytes=%cx'

# Trace previously created probe when the bytes (alias) var is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'

# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'

# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):

perf probe 'tcp_sendmsg size'

# Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'

# Trace previous probe when size > 0, and state != TCP_ESTABLISHED(1) (debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a

•  Kernel debuginfo is an onerous requirement for the Netflix cloud

•  We can use registers instead (as above). But which registers?
The RoseUa Stone of Registers
One server instance with kernel debuginfo, and -nv (dry run, verbose):
# perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'
[…]
Added new event:
Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8
probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->__sk_common.skc_state)

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -aR sleep 1

Copy-n-paste!
All other instances (of the same kernel version):
# perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -aR sleep 1

Masami Hiramatsu was investigating a way to better automate this

perf Visualiza/ons: Block I/O Latency Heat Map
•  We automated this for analyzing disk I/O latency issues

SSD I/O HDD I/O

(fast, with queueing) (random, modes)

hUp://www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html
There's s/ll a lot more to perf…
•  Using PMCs
•  perf scrip/ng interface
•  perf + eBPF
•  perf sched
•  perf /mechart
•  perf trace
•  perf c2c (new!)
•  perf orace (new!)
•  …
Links & References
•  perf_events
•  Kernel source: tools/perf/DocumentaCon
•  hUps://perf.wiki.kernel.org/index.php/Main_Page
•  hUp://www.brendangregg.com/perf.html
•  hUp://web.eece.maine.edu/~vweaver/projects/perf_events/
•  Mailing list hUp://vger.kernel.org/vger-lists.html#linux-perf-users
•  perf-tools: hUps://github.com/brendangregg/perf-tools
•  PMU tools: hUps://github.com/andikleen/pmu-tools
•  perf, orace, and more: hUp://www.brendangregg.com/linuxperf.html
•  Java frame pointer patch
•  hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
•  hUps://bugs.openjdk.java.net/browse/JDK-8068945
•  Node.js: hUp://techblog.ne9lix.com/2014/11/nodejs-in-ﬂames.html
•  Methodology: hUp://www.brendangregg.com/methodology.html
•  Flame graphs: hUp://www.brendangregg.com/ﬂamegraphs.html
•  Heat maps: hUp://www.brendangregg.com/heatmaps.html
•  eBPF: hUp://lwn.net/Ar/cles/603983/
Thank You
–  Ques/ons?
–  hUp://www.brendangregg.com
–  hUp://slideshare.net/brendangregg
–  [email protected]
–  @brendangregg

Linux Performance Analysis and Tools: Brendan Gregg
No ratings yet
Linux Performance Analysis and Tools: Brendan Gregg
115 pages
Kernelshark (Quick Tutorial) : Steven Rostedt
No ratings yet
Kernelshark (Quick Tutorial) : Steven Rostedt
49 pages
Linux Profiling at Netflix
No ratings yet
Linux Profiling at Netflix
84 pages
Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
No ratings yet
Linux Profiling at Netflix: Using Perf - Events (Aka "Perf")
84 pages
Tutorial - Perf Wiki
No ratings yet
Tutorial - Perf Wiki
23 pages
Access The Performance-Counter On Ubuntu Linux
No ratings yet
Access The Performance-Counter On Ubuntu Linux
8 pages
Perf - Event Docume N
No ratings yet
Perf - Event Docume N
41 pages
S8 Perf
No ratings yet
S8 Perf
15 pages
Unit 5 - Linux System Performance
No ratings yet
Unit 5 - Linux System Performance
27 pages
Percona2016linuxsystemsperf 160421182216
No ratings yet
Percona2016linuxsystemsperf 160421182216
72 pages
Introduction
No ratings yet
Introduction
21 pages
Linux Performance Tools: Brendan Gregg
No ratings yet
Linux Performance Tools: Brendan Gregg
90 pages
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
No ratings yet
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
90 pages
Assignment 1
No ratings yet
Assignment 1
10 pages
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
No ratings yet
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
69 pages
Javaone2015mixedmodeflamegraphs 151028205342 Lva1 App6891
No ratings yet
Javaone2015mixedmodeflamegraphs 151028205342 Lva1 App6891
92 pages
Linuxperftools 140820091946 Phpapp01
No ratings yet
Linuxperftools 140820091946 Phpapp01
85 pages
P51a 03 Part2
No ratings yet
P51a 03 Part2
38 pages
Profiling & Tracing With Perf - Julia Evans
No ratings yet
Profiling & Tracing With Perf - Julia Evans
24 pages
Profiling and Tracing
No ratings yet
Profiling and Tracing
9 pages
Lisa19 Slides Gregg
No ratings yet
Lisa19 Slides Gregg
64 pages
20 Command Line Tools To Monitor Linux Performance
No ratings yet
20 Command Line Tools To Monitor Linux Performance
23 pages
Thenewsystemsperformance 131014005720 Phpapp01
No ratings yet
Thenewsystemsperformance 131014005720 Phpapp01
17 pages
Linux Performance Analysis New Tools and Old Secrets: Brendan Gregg
No ratings yet
Linux Performance Analysis New Tools and Old Secrets: Brendan Gregg
75 pages
Optimizing Linux Performance
No ratings yet
Optimizing Linux Performance
26 pages
Javaone2016javaflamegraphs 160920172322
No ratings yet
Javaone2016javaflamegraphs 160920172322
71 pages
Linux 操作系统: Acegene IT Co. Ltd. 1
No ratings yet
Linux 操作系统: Acegene IT Co. Ltd. 1
23 pages
Linux Sys Admin Tools
100% (1)
Linux Sys Admin Tools
24 pages
20 Linux System Monitoring Tools Every SysAdmin Should Know
No ratings yet
20 Linux System Monitoring Tools Every SysAdmin Should Know
14 pages
Meetbsd2014performance 141102131236 Conversion Gate01
No ratings yet
Meetbsd2014performance 141102131236 Conversion Gate01
60 pages
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
No ratings yet
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
81 pages
Linux Perf Examples
No ratings yet
Linux Perf Examples
45 pages
Linux Performance Analysis and Tools: Brendan Gregg
No ratings yet
Linux Performance Analysis and Tools: Brendan Gregg
115 pages
Ftrace Linux Kernel Tracing: Steven Rostedt
No ratings yet
Ftrace Linux Kernel Tracing: Steven Rostedt
50 pages
Off-CPU Analysis
No ratings yet
Off-CPU Analysis
14 pages
Broken Linux Performance Tools: Brendan Gregg
No ratings yet
Broken Linux Performance Tools: Brendan Gregg
95 pages
Linux Starce & Top
No ratings yet
Linux Starce & Top
8 pages
Linux System Administration
No ratings yet
Linux System Administration
39 pages
Untitled Document
No ratings yet
Untitled Document
13 pages
USE Method - Rosetta Stone of Performance Checklists
No ratings yet
USE Method - Rosetta Stone of Performance Checklists
8 pages
20 Linux System Tool Monitor
No ratings yet
20 Linux System Tool Monitor
19 pages
Velocity2017bpfsuperpowers 170622233822
No ratings yet
Velocity2017bpfsuperpowers 170622233822
54 pages
Top Linux Monitoring Tools
100% (1)
Top Linux Monitoring Tools
38 pages
A0 Class
No ratings yet
A0 Class
30 pages
Scale2017perfanalysisbpf169 170304230834
No ratings yet
Scale2017perfanalysisbpf169 170304230834
70 pages
Eeus2012 Singhvi
No ratings yet
Eeus2012 Singhvi
26 pages
20 Linux System Monitoring Tools Every SysAdmin Should Know
No ratings yet
20 Linux System Monitoring Tools Every SysAdmin Should Know
35 pages
QV4311Exercise SG Hints
No ratings yet
QV4311Exercise SG Hints
134 pages
20 Linux System Monitoring Tools Every SysAdmin Should Know
No ratings yet
20 Linux System Monitoring Tools Every SysAdmin Should Know
13 pages
Advanced Troubleshooting Linux
No ratings yet
Advanced Troubleshooting Linux
2 pages
Linux 4.10
No ratings yet
Linux 4.10
19 pages
Advanced Linux Detection and Forensics Cheatsheet by Defensive Security
No ratings yet
Advanced Linux Detection and Forensics Cheatsheet by Defensive Security
28 pages
Enhancing The Monitoring Using Linux - 101112024111
No ratings yet
Enhancing The Monitoring Using Linux - 101112024111
74 pages
Cpu Utilisation Commands
No ratings yet
Cpu Utilisation Commands
7 pages
030-036 Tuning
No ratings yet
030-036 Tuning
7 pages
Linux Debugging Tools
No ratings yet
Linux Debugging Tools
20 pages
Unix Process Control. Linux Tools and The Proc File System
No ratings yet
Unix Process Control. Linux Tools and The Proc File System
89 pages
Feniex Product Catalog 2013-2014
No ratings yet
Feniex Product Catalog 2013-2014
56 pages
PSG Mechanical Design Data Book
No ratings yet
PSG Mechanical Design Data Book
1 page
LTCC Process Overview
No ratings yet
LTCC Process Overview
1 page
Css OB
No ratings yet
Css OB
14 pages
AGM Night Vision Catalog 2025
No ratings yet
AGM Night Vision Catalog 2025
44 pages
Pamplet Penyerahan Result Pt3
No ratings yet
Pamplet Penyerahan Result Pt3
2 pages
3 Cpu Scheduling SJF RSJF
No ratings yet
3 Cpu Scheduling SJF RSJF
10 pages
Cashify Whitepaper 2020
No ratings yet
Cashify Whitepaper 2020
28 pages
COMEN Brochure V1.6 20230722
No ratings yet
COMEN Brochure V1.6 20230722
41 pages
VT 100 Log
No ratings yet
VT 100 Log
4 pages
Game Engine Gems 2 1st Edition Eric Lengyel Instant Download
100% (3)
Game Engine Gems 2 1st Edition Eric Lengyel Instant Download
81 pages
How To Build A RAG Chatbot Using Ollama - Serve LLMs Locally
No ratings yet
How To Build A RAG Chatbot Using Ollama - Serve LLMs Locally
12 pages
Prompt Engineering Guide For Students
No ratings yet
Prompt Engineering Guide For Students
5 pages
IFHO Optimization: Radio Network Optimization - Cairo Team
No ratings yet
IFHO Optimization: Radio Network Optimization - Cairo Team
14 pages
Designing Advanced Encryption Methods For Secure IoT Communication
No ratings yet
Designing Advanced Encryption Methods For Secure IoT Communication
4 pages
Freehand or Graphic Method
No ratings yet
Freehand or Graphic Method
11 pages
Invoice Bali Smart Travels
No ratings yet
Invoice Bali Smart Travels
1 page
Unit 1: Introduction To Markup Languages
No ratings yet
Unit 1: Introduction To Markup Languages
41 pages
7368 ISAM ONT G-440G-A Datasheet
No ratings yet
7368 ISAM ONT G-440G-A Datasheet
2 pages
Plankalkül
No ratings yet
Plankalkül
7 pages
Rockwell Operation Manual v0
No ratings yet
Rockwell Operation Manual v0
25 pages
ICT Skills Class 10
No ratings yet
ICT Skills Class 10
2 pages
Mobile Virus and Security
No ratings yet
Mobile Virus and Security
25 pages
Mini Project 2 Semster
No ratings yet
Mini Project 2 Semster
35 pages
Topographic Survey of Comprehensive Secondary School Nawfia, Anambra State
100% (1)
Topographic Survey of Comprehensive Secondary School Nawfia, Anambra State
8 pages
wb3 Draft
No ratings yet
wb3 Draft
6 pages
Cyber Security
No ratings yet
Cyber Security
27 pages
Number System Solutions
No ratings yet
Number System Solutions
5 pages
Asyril Datasheet Asycube 240 en
No ratings yet
Asyril Datasheet Asycube 240 en
2 pages
2025 04 11 Meet-Your-New-Household-Robot
No ratings yet
2025 04 11 Meet-Your-New-Household-Robot
8 pages

KernelRecipes Perf Events

Uploaded by

KernelRecipes Perf Events

Uploaded by

Using Linux perf

38% kernel /me (why?)

• ZFS ARC (adap/ve replacement cache) reclaim.

4. perf Advanced

• CPU proﬁling is used by many ac/vi/es

• Supports many proﬁling/tracing features:

• Some bugs in the past; has been stable for us

AcCon: record stack traces

perf record -F 99 -a -g -- sleep 10

Scope: all CPUs

Note: sleep 10 is a dummy command to set the dura/on

• Sample events (perf record …)

• Other ac/ons include:

# Searching for "sched" tracepoints:

# Listing sched tracepoints:

Dozens of perf one-liners:

# CPU counter statistics for the entire system, for 5 seconds:

# Count system calls for the specified PID, until Ctrl-C:

# Show system calls by process, refreshing every 2 seconds:

# Sample on-CPU kernel instructions, for 5 seconds:

# Sample on-CPU user instructions, for 5 seconds:

# Show perf.data with a column for sample count:

# Show perf.data as a text report, with data coalesced and percentages:

# List all raw events from perf.data:

# List all raw events from perf.data, with customized fields:

# Dump raw contents from perf.data as hex (for debugging):

# Disassemble and annotate instructions with percentages (needs debuginfo):

--title TEXT # change title text

text UI dump proﬁle

text UI dump summary

1. Take a CPU proﬁle

Broken Java stacks Java == green

• Costs some overhead to use. Usually <1%. Rare cases 10%.

java 4579 cpu-clock:

java 4579 cpu-clock:

java 4579 cpu-clock:

12.06% 62 sed sed [.] re_search_internal

• For JIT (Java, Node.js, …):

B. Symbol agents can uninline

• Future v8's may support on-demand symbol dumps

Let's proﬁle instruc/ons

(have I lost my mind?)

Performance counter stats for 'system wide':

10003.718595 task-clock (msec) # 2.000 CPUs utilized [100.00%]

5.001607197 seconds time elapsed

• Is ﬁxable: eg, Xen can enable PMCs (vpmu boot op/on)

– Now available on the largest AWS EC2 instance types

TIME C0_MCYC C0_ACYC UTIL RATIO MHz

– showboost is from my msr-cloud-tools collec/on (on github)

• DiUo for LBR, BTS, processor trace

• Should be ﬁxed in 4.14

• Also consider --filter, to ﬁlter events in-kernel

process PID [CPU] /mestamp: eventname: format string

include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]

Also see: cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format

# Trace CPU migrations, for 10 seconds:

# Trace all block completions, synchronous writes only, until Ctrl-C:

# Add a tracepoint for the kernel tcp_sendmsg() function return:

# Show available line probes for tcp_sendmsg() (needs debuginfo):

# Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):

# List currently available dynamic probes:

# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):

• Kernel debuginfo is an onerous requirement for the Netflix cloud

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -aR sleep 1

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -aR sleep 1

Masami Hiramatsu was investigating a way to better automate this

SSD I/O HDD I/O

You might also like

•  ZFS ARC (adap/ve replacement cache) reclaim.

4.  perf Advanced

•  CPU proﬁling is used by many ac/vi/es

•  Supports many proﬁling/tracing features:

•  Some bugs in the past; has been stable for us

•  Sample events (perf record …)

•  Other ac/ons include:

1.  Take a CPU proﬁle

•  Costs some overhead to use. Usually <1%. Rare cases 10%.

•  For JIT (Java, Node.js, …):

B.  Symbol agents can uninline

•  Future v8's may support on-demand symbol dumps

•  Is ﬁxable: eg, Xen can enable PMCs (vpmu boot op/on)

–  Now available on the largest AWS EC2 instance types

–  showboost is from my msr-cloud-tools collec/on (on github)

•  DiUo for LBR, BTS, processor trace

•  Should be ﬁxed in 4.14

•  Also consider --filter, to ﬁlter events in-kernel

•  Kernel debuginfo is an onerous requirement for the Netflix cloud