Use new releases of JVM! 1.6 is going to EOL (end of life) on autumn 2012!!!
JVM KEY: verbose:gc USE TOOLS JVM KEY: -X X:+PrintGCDetails VisualGC try change thread stack size (each thread takes 200-300 KBytes for its stack by default) JVM KEY: -Xss<size>
Hack: you can try to replace the HotSpot DLL in 1.6 by HotSpot v23 from JVM v7
The JVM v6 has HotSpot v20, this HotSpot version doesn't have many critical performance xes! try increase the heap size JVM KEY: -Xmx<size> than more memory for GC than better its work!
The default GC is chosen for your platform characteristics!!! CMS or G1 are never chosen as default ones!!!
Serial GC JVM KEY: -X X:+UseSerialGC JVM KEY: -X X:+UseParallelGC JVM KEY: -X X:+UseParallelOldGC CMS (Concurrent Mark Sweep) only the GC supports NUMA! it didn't turn on the old gen collection until 7u2 it will be chosen automatically since 7u2 Doesn't unload classes by default JVM KEY: -X X:+CMSClassUnloadingEnabled
Parallel GC
JVM KEY: -X X:+UseConcMarkSweepGC
Fully supported by Oracle since 7u4!!!
Don't recommended to be used for JVM 6!
At present (june 2012) it knows nothing about NUMA! JVM KEY: -X X:+UseG1GC the default heap memory region size 1MB tuning JVM KEY: -X X:G1HeapRegionSize=n the maximum value 32MB!
set the maximum pause for GC (a soft parameter)
JVM KEY: -X X:MaxGCPauseMillis=<milliseconds> JVM KEY: -X X:GCPauseIntervalMillis=<milliseconds>
set the allowed time interval between GC (a soft parameter) a good productivity is desired Throughput Latency
Glossary
Garbage-First GC (G1) recommended if..
pause length < 0.5-1s minimal tuning is desired the heap size more than 5-8Gb the heap is used more than 50%
http://www.amazon.com/gp/product/0596003773
Java Performance Tuning (2nd Edition) Java Performance
Jack Shirazi Charlie Hunt
http://www.amazon.com/Java-Performance-Charlie-Hunt/dp/0137142528
Books
GC Oracle HotSpot
serious vary object allocation time (for inst. day 500Mb/sec, night 10Mb/sec) try change CG algorithm lesser heap fragmentation desired (decreasing FullGC) your current chosen GC works well not recommended if.. strict requirement for pauses lesser than 100ms the maximal throughput is desired (use ParallelGC)
Information sources and authors of used materials
common tuning GC
[email protected] https://shipilev.net http://www.linkedin.com/in/alekseyshipilev https://shipilev.net/pub/talks/jeeconf-May2012-perfMethodology.pdf https://shipilev.net/pub/talks/jugru-June2012-perfMethodology-hi. v
Aleksey Shipilev JVM problems?
do you need maximal throughput?
still choosing GC?
do you have heap less than 2GB? do you need lesser pauses? do you have strict requirement for pauses <20-30msec?
Materials
Java Performance don't be afraid to use the GC logging in production!!! the possible overhead is very low!
JVM KEY: -X X:+PrintGCDetails JVM KEY: -X X:+PrintGCTimeStamps JVM KEY: -X X:+PrintGCDateStamps
Sergey Kuksenko
[email protected] http://ru.linkedin.com/pub/sergey-kuksenko/0/b49/b81
The Main information sources
USE TOOLS
JVM KEY: -X X:+PrintHeapAtGC JVM KEY: -X X:+PrintTenuringDistribution JVM KEY: -Xloggc=< le> PrintGCStats
Authors
GChisto VisualGC JVM KEY: -XgcPrio:deterministic JVM KEY: -XpauseTarget=<milliseconds>
Oracle JRockit [email protected] http://ru.linkedin.com/in/iwanowww http://vimeo.com/43574752 http://www.slideshare.net/iwanowww/g1-gc-hotspot-jvm
try change GC algorithm
DeterministicGC
Vladimir Ivanov
Garbage Collector
USE TOOLS JVM KEY: -X X:+PrintCompilation MXBeans (VisualVM)
Materials
THE DEFAULT MODE DEPENDS ON THE PLATFORM CHARACTERISTICS!!!
Must be the rst switch provided on the command line. It makes more aggressive optimization Chose the mode if you need faster work JIT a method will be compiled into native code after 10000 calls (by default) slow launch? JVM KEY: -XX:+TieredCompilation JVM KEY: -X X:MaxInlineSize=<bytecode size> may be to try more aggressive inline strategy? the default value is 35 (it is very small number) The default value varies with the platform on which the JVM is running. JVM KEY: -X X:CompileThreshold=<calls number> the default value is 10000
Igor Maznitsa
[email protected] http://www.igormaznitsa.com http://ru.linkedin.com/in/igormaznitsa
The mind map preparation and translation
try change work mode (select the server mode for the JVM)
JVM KEY: -server
It's a good idea to make warm-up calls for your methods to be compiled into native code
JVM KEY: -X X:InlineSmallCode=<size of native code in bytes> JVM KEY: -X X:FreqInlineSize=<bytecode size>
The default value varies with the platform on which the JVM is running.
JVM KEY: -client sudo apt-get install sysstat mpstat - Report processors related statistics
Must be the rst switch provided on the command line. Choose the mode if you need faster start
CPU
Oracle JVM has a lot of keys to tune JIT compilation JVM KEY: verbose:class MXBeans JVM KEY: --no-verify JVM KEY: -Xshare:on
netstat - Print network connections, tables, statistics, connections sudo apt-get install iptraf sudo apt-get install bwm-ng iptraf - Interactive Colorful IP LAN Monitor
USE TOOLS
Network and IO
Classload
bwm-ng - a live bandwidth monitor for network and disk io sar - collect, report, or save system activity information top - display Linux tasks strace - trace system calls and signals
try disable class veri cation try switch on the class data sharing Ubuntu USE TOOLS Prolers
sudo apt-get install linux-tools
perf - Performance analysis tools for Linux
System
too complex algorithms
make some simpli cation of algorithms try algorithms with lesser "performance constants" (ArrayList instead of LinkedList for instance) check that you have caching in appropriate places may be it is better to use new objects instead caching in your case?
opro le - is a system-wide pro ler for Linux systems vmstat - Report virtual memory statistics sudo apt-get install numactl numastat - Print statistics about NUMA memory allocation
TOOLS
algorithmic problems?
data (anti)caching polling
active idle http://java.net/projects/gchisto http://java.net/projects/printgcstats/ GChisto - a garbage collection log visualization tool
PrintGCStats - a tool to report garbage collection statistics from HotSpot GC VisualGC - a plugin for VisualVM jstack - Stack Trace jrmc - Oracle JRocket Mission Control pro ler
GC
Java
Executor
http://visualvm.java.net/
VisualVM - a visual tool integrating several command line JDK tools
Java Performance Mind-Map (the info is compiled from presentations and the internet.) IT IS NOT AN OFFICIAL DOCUMENT AND MAY HAVE ERRORS! USE IT FOR YOUR OWN RISK!
v 1.0.8
too big %usr (mpstat)
spinloops
They look like 100% CPU loading solstudio USE TOOLS vtune perf hardware counters try large memory pages Out of memory but not grows JVM KEY: -X X:+UseLargePages JVM KEY: -X X:PermSize JVM KEY: -X X:MaxPermSize JVM KEY: -X X:+UseConcMarkSweepGC Grows and out of memory JVM KEY: -X X:+CMSPermGenSweepingEnabled JVM KEY: -X X:+CMSClassUnloadingEnabled
TLB (translation lookaside buer)
netstat sar iptraf bwm-ng check network cables and their characteristics! try decrease number of writing/reading operations try decrease data size for writing/reading operations try data compression try buerization, Bandwith-Delay Product, MTU try to change network interfaces to faster ones try to use virtual network interfaces (move your application components into cloud) vmstat mpstat USE TOOLS
Troubles with PermGen
USE TOOLS
busstat (Solaris)
too big network utilizing?
memory bandwidth
try faster memory (dual channel memory instead single channel memory) several channels in IMC (integrated memory controller) solstudio USE TOOLS vtune perf hardware counters try compressed 32 bit pointers JVM KEY: -X X:+UseCompressedOops for 64 bit systems!
USE TOOLS
too big scheduler utilizing?
shrink your data sets (remember that RAM is a slow entity!)
too big %sys (mpstat)
JVM KEY: -X X:AllocPrefetchStyle=<N> capacity try enable/disable software JVM prefertcher
try limit the thread number in your application top sar
0 - no prefetch instructions are generated 1 - execute prefetch instructions after each allocation (DEFAULT) 2 - use TLAB allocation watermark pointer to gate when prefetch instructions are executed JVM KEY: -X X:AllocatePrefetchLines=<number of lines>
USE TOOLS
try to add physical memory try to decrease memory per process swappiness
swapping?
JVM KEY: -X X:AllocatePrefetchDistance=<distance in bytes> try enable/disable hardware prefetcher Temporal locality Spatial locality try block decompositions try more compact data structures Complex java.util collections may take 14-30 times more memory per its item than its primitive representation! since Java 6u21 since Java 6u20
GC IS NOT SWAPPING FRIENDLY!!!!
strace perf oprole tune kernel open bugs for the kernel USE TOOLS
JVM KEY: -X X:+UseCompressedStrings strings JVM KEY: -X X:+OptimizeStringConcat JVM KEY: -X X:+UseStringCache Java prolers USE TOOLS perf solstudio hardware counters
memory problems? kernel calls?
mpstat sar
caches USE TOOLS
plain shared memory (not any guarantee) HB via volatile (guarantee that changes are visible)
device communication?
too big %irq,%soft (mpstat)
primitives
Atomics (CAS) (guarantee of atomic changes) Spin-loops Spin-locks Locks Wait-locks They generate 100% CPU loading synchronized java.util.concurrent.ReentrantLock There is a bug in java.util.concurrent.locks.ReentrantReadWriteLock, the bug xed since 7b25! consistence
try balance irq processing, may be only one CPU processes interruptions in the system check number of timers in your system
system items?
iostat sar caching buerization
USE TOOLS
many disk operations?
decrease number of disk operations DON'T USE SSD!
choose the right primitive for interthread communication coherence
expected number of conicts expected conict length
too big %iowait (mpstat)
try noncoherency checks
use light condition before hard operations it doesnt work? it's a hard operation Locks
top sar
USE TOOLS
try striping shared places
Queues Counters Immutability
le/block cache number is not enough?
techniques
increase memory for caches don't call ush() too often
try give up interthread communication at all
Thread Locals Check that your threads don't share a java.util.Random object Use java.util.concurrent.ThreadLocalRandom
vmstat mpstat jstack add/increase parallelization into your application switch-o CMT (chip multithreading) (?) USE TOOLS lock prolers (jrmc, etc) jstack (it will show only very big lock) USE TOOLS wait locks RUNNABLE thread number is not enough? vtune USE TOOLS too few GC threads? (a rare case) USE TOOLS solstudio perf hardware counters overclocking the frequency is not enough? tune cpufreq check for the "ondemand" mode and change if it is turned on Libraries https://github.com/peter-lawrey/Java-Thread-Anity numastat try restrict communications between cores, packages, data centers JVM KEY: -X X:+UseNUMA USE TOOLS thread number is not enough? check "false sharing" (cores working with the same cache memory line) make object padding
ARM (32 bytes) x86/SPARC/ARM (64 bytes) PowerPC (128 bytes) @sun.misc.Contended Break to independent objects and padding Java 1.8
too big %idle (mpstat)
NUMA(NUCA) Non-Uniform Memory Access
Fractal structure try switch on NUMA
it partly works under windows (only interleaving)!
try to optimize locks and decrease their number try lock-free algorithms and data structures JVM KEY: -verbose:gc
try increase GC thread number decrease GC pauses
lock critical threads to a CPU (Thread anity)
CPU problems?
Remember that to wake up a Thread in Java is an expensive operation takes about 50 uS! try special code go to native code (JNI) make own intrinsics for JIT cryptoprocessors GPU Java->JNI calls are faster than JNI->Java ones too hardcore solution!
the number of execution units is not enough?
try special equipment add more CPUs
Use ForkAndJoin to parallel your tasks try decrease number of branches limited ILP (instruction level parallelism)?
Since Java 1.7
try rewrite code to decrease LSD (Loop Stream Detector) make data loose coupling
it's a very hard approach and mainly it's impossible from java