Performance Tuning in Production
James Gough
Sadiq Jaffer
Kirk Pepperdine
Richard Warburton
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Session Overview
• Optimizing Java: a brief tour of the JVM
• Moving to G1GC
• Production Profiling: What, Why and How
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Optimizing Java: A JVM Tour
James Gough
@Jim__Gough
http://jamesgough.net
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
This Talk
• Who Am I
• Creating Bytecode
• Classloading
• Profiling Code
• Runtime Optimisations
• JITWatch
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
About Me
• Started programming BASIC on the C64
• Worked as a Java and Web Developer
• Helped to design and test JSR 310
• Spent 4 years training Java and C++
• Written a book called Optimizing Java
• Work at Morgan Stanley
• Building Client Facing Technology
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Creating Bytecode
JVM
+
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Creating Bytecode
JVM
+
=
Java
source
code
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Creating Bytecode
JVM
javac
Java
source
code
Class file creation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Creating Bytecode
JVM
.class
file
javac
Java
source
code
Class file creation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
The anatomy of a classfile
Magic Number 0xCAFEBABE
Version of Class The minor and major versions of the class file
File Format
Constant Pool Pool of constants for the class
Access Flags For example whether the class is abstract, static, etc.
This Class The name of the current class
Super Class The name of the super class
Interfaces Any interfaces in the class
Fields Any fields in the class
Methods Any methods in the class
Attributes Any attributes of the class (e.g. name of the sourcefile, etc.)
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
The anatomy of a classfile
My Very Cute Animal Turns Savage In Full Moon Areas
M V C A T S I F M A
Magic Version Constant Access This Super Interfaces Fields Methods Attributes
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Type Descriptors
• Describe signatures Descriptor Type
B byte
• Common in javap output
C char
• E.g. D double
• ()Ljava/lang/String; F float
I int
• (I)V
J long
• (Ljava/lang/String;I)J
L<type>; Reference type
S short
Z boolean
[ Array-of
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
How Bytecode is executed
JVM
.class
file classloader
javac
Java
source
code
Class file creation Continuous (Just In Time) compilation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Classloaders
• Classes are loaded just before they are needed
• proven by the painful ClassNotFoundException
• Loads classfile into the Class object
• mechanism for representing classes in the VM
• Example used in watching-classloader
• https://github.com/jpgough/watching-classloader
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Interpreting Bytecode
JVM
.class classloader method cache
file
javac JVM
Interpreter
Java
source
code
Class file creation Continuous (Just In Time) compilation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Interpreting Bytecode
• Bytecode initially fully interpreted
• Conversion of each instruction to machine instruction
• Time not spent compiling code that is only used once
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
How does Interpreting Help?
• Provides the opportunity to observe code execution paths
• may not be the same for each execution of the app
• The profiler observes the execution and looks for the best
optimisations
• Code is compiled after hitting a threshold
• Configurable
• JVM can revert optimisation decisions
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Profiling Code
JVM
.Class classloader method cache
file
JIT compiler
javac Profile Guided
Interpreter
Optimisation
Java
source
code
Class file creation Continuous (Just In Time) compilation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Profiling Code
• Looking for loops or frequent execution of code blocks
• Barometer used to count the number of executions
• Threshold is reached and mode changes to tracing
• Tracing follows the execution path involving that method
• proactively looking for optimisation opportunities
• often stored as an intermediate representation
• traces are used in the code generation phase
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
The Hotspot JVM
JVM
.Class classloader method cache
file
JIT compiler
javac
Interpreter Profiler Emitter
Java
source
code
Class file creation Continuous (Just In Time) compilation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Viewing Code Compilation
java -XX:-TieredCompilation -XX:+PrintCompilation HelloWorld 2> /dev/null
Time
Task Method Name (size of compiled code)
Offset
321 40 sun.nio.cs.StreamEncoder::isOpen (5 bytes)
322 41 sun.nio.cs.StreamEncoder::implFlushBuffer (15 bytes)
327 42 sun.nio.cs.StreamEncoder::writeBytes (132 bytes)
331 43 ! java.io.PrintStream::write (69 bytes)
335 44 s java.io.BufferedOutputStream::write (67 bytes)
337 46 java.nio.Buffer::clear (20 bytes)
337 47 java.lang.String::indexOf (7 bytes)
338 48 ! java.io.PrintStream::println (24 bytes)
338 49 java.io.PrintStream::print (13 bytes)
343 50 ! java.io.PrintStream::write (83 bytes)
346 51 ! java.io.PrintStream::newLine (73 bytes)
347 52 java.io.BufferedWriter::newLine (9 bytes)
347 53 % HelloWorld::main @ 2 (23 bytes)
! method has exception handler(s)
s method declared synchronized
n native method (no compilation, generate wrapper)
% on-stack replacement used
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Inlining
• Calling a method has an overhead
• creation of a new stack frame
• copying values required to the stack frame
• returning from the stack frame post execution
• Consider a method call in a for loop
public class HelloWorld {
public static void main(String[] args) {
for(int i=0; i < 100_000; i++) {
System.err.println("Hello World");
}
}
}
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Inlining
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining HelloWorld 2> /dev/null
@ 40 java.io.BufferedOutputStream::flush (12 bytes) inline (hot)
\-> TypeProfile (19272/19272 counts) = java/io/BufferedOutputStream
@ 1 java.io.BufferedOutputStream::flushBuffer (29 bytes) inline (hot)
@ 20 java.io.FileOutputStream::write (12 bytes) inline (hot)
\-> TypeProfile (4696/4696 counts) = java/io/FileOutputStream
@ 8 java.io.FileOutputStream::writeBytes (0 bytes) native method
@ 8 java.io.OutputStream::flush (1 bytes) inline (hot)
\-> TypeProfile (7047/7047 counts) = java/io/FileOutputStream
!m @ 13 java.io.PrintStream::println (24 bytes)
@ 6 java.io.PrintStream::print (13 bytes)
!m @ 9 java.io.PrintStream::write (83 bytes) callee is too large
!m @ 10 java.io.PrintStream::newLine (73 bytes) callee is too large
!m @ 13 java.io.PrintStream::println (24 bytes)
@ 6 java.io.PrintStream::print (13 bytes)
!m @ 9 java.io.PrintStream::write (83 bytes) callee is too large
!m @ 10 java.io.PrintStream::newLine (73 bytes) callee is too large
!m @ 13 java.io.PrintStream::println (24 bytes) already compiled into a big method
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Constant Subexpression Elimination
• Compiler hunts through code for common expressions
• if results analyses replacement with a single variable
• Relies on data flow analysis of the program
• which is done during the profiling and tracing part
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Dead Code Elimination
• Removes code that is never executed
• shrinks the size of the program
• avoid executing irrelevant operations
• Dynamic dead code elimination
• eliminated base on possible set of values
• determined at runtime
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Register Allocation
• Identification of variables suitable for registers
• to avoid cache misses
• improve execution speed of the program
• Uses data from the trace to make informed decision
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Loop-Invariant Code Motion
• Involves removal of code from loops
• for code that doesn’t impact the outcome of the loop
• moved above the loop to avoid unnecessary execution
• Hoisted code can now be cached in a register
• improving performance of the loop execution
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Escape Analysis
• Introduced in later versions of Java 6
• Analyses code to assert if an object reference
• returns or leaves the scope of the method
• stored in global variables
• Allocates unescaped objects on the stack
• avoids the cost of garbage collection
• prevents workload pressures on Eden
• beneficial effects to counter high infant mortality GC impact
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Loop Unrolling
private static final String[] RESPONSES =
{ "Yes", "No", "Maybe" };
public void processResponses () {
for ( String response: RESPONSES ) {
process(response);
}
}
private static final String[] RESPONSES =
{ "Yes", "No", "Maybe" };
public void processResponses () {
process(RESPONSES[0]);
process(RESPONSES[1]);
process(RESPONSES[2]);
}
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Loop Unrolling
@Benchmark
public long intStride1()
{
long sum = 0;
for (int i = 0; i < MAX; i++)
{
sum += data[i];
}
return sum;
}
@Benchmark
public long longStride1()
{
long sum = 0; Benchmark Mode Cnt Score Error Units
for (long l = 0; l < MAX; l++) LoopUnrollingCounter.intStride1 thrpt 200 2423.818 ± 2.547 ops/s
{ LoopUnrollingCounter.longStride1 thrpt 200 1469.833 ± 0.721 ops/s
sum += data[(int) l];
}
return sum;
}
Excerpt From: Benjamin J. Evans, James Gough, and Chris Newland. “Optimizing Java.” iBooks.
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Loop Unrolling
• Can unroll int, char and short loops
• Can remove safe point checks
• Removes back branches and branch prediction cost
• Reduces the work needed by each “iteration”
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Monomorphic Dispatch
• When HotSpot encounters a virtual call site, often only one type
will ever be seen there
• e.g. There's only one implementing class for an interface
• Hotspot can optimize vtable lookup
• Subclasses have the same vtable structure as their parent
• Hotspot can collapse the child into the parent
• Classloading tricks can invalidate monomorphic dispatch
• The class word in the header is checked
• If changed then this optimisation is backed out
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Code Cache
JVM
.Class classloader method cache
file
JIT compiler
javac
Interpreter Profiler Emitter
Java
source code cache
code
Class file creation Continuous (Just In Time) compilation
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Code Cache
• The code cache contains the JIT native compiled code
• Code is JIT'd on a per method basis
• 1. This occurs when an entry counter is exceeded
• 2. Internal Representation (IR) is built
• 3. Optimisations are applied
• 4. JIT turns IR into native code
• Pointers are swizzled to use the native code
• native code is executed on the next call
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Introduction to JITWatch
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Summary
• Java has carried a brand name of being slow
• Java can emit instructions comparable to C++
• javac doesn’t do much optimisation
• We can make better decisions from profiling at runtime
• JITWatch makes life easier
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe
Performance Landscape
Hardware
JVM
.Class classloader Garbage
method cache
file Collection
JIT compiler
javac Executing
Interpreter Profiler Emitte Code Quality
Java
source code cache
code
Databases/Networks/IO bound operations
#PerfTuningInProd @Jim__Gough, @opsian, @RichardWarburto, @kcpeppe