0% found this document useful (0 votes)

80 views91 pages

Linux Perf Tuning 2010 1up

This document provides an overview of an upcoming Linux performance tuning tutorial. It outlines the agenda and logistics of the full-day event. The introduction discusses the complexity of performance tuning and the goals of identifying bottlenecks and incrementally tuning systems. Basic tools like free, top, and iostat are introduced to help understand resource usage and identify potential problems. The document provides examples of using these tools and questions to consider when analyzing the output. It also demonstrates how profiling an application can reveal optimization opportunities.

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views91 pages

Linux Perf Tuning 2010 1up

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Linux Performance Tuning

Wednesday, November 10, 2010

1
Logistics

Tutorial runs from 9 to 5:00pm

Morning break at 10:30am
Lunch at 12:30--1:30 pm
Afternoon break at 3:00-3:30pm
Feel free to ask me questions
But I reserve the right to defer some answers until
later in the session or to the break/end of the class.
Please fill out and return the tutorial evaluation
form!
2
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

3
Introduction to Performance
Tuning
Complex task that requires in-depth
understanding of hardware, software, and
application
If it were easy the OS would do it automatically (and
the OS does a lot automatically to begin with)
Goals of Performance Tuning
Speed up time to do a single large task (time to do
perform some large matrix calculation)
Graceful degredation of a web/application server as
it is asked to service a larger and larger number of
requests 4
Stress Testing

What happens when a server is put under a

large amount of stress?
My web server just got slashdotted!
Typically the server behaves well until the load
increases beyond a certain critical point; then it
breaks down.
Transaction latencies go through the roof
The server may cease functioning altogether
Measure the system when it is functioning
normally, and then when it is under stress. 5
What changes?
Finding Bottlenecks

Careful tuning of memory usage won't matter if

the problem is caused by a shortage of disk
bandwidth
Performance measurement tools are hugely
important to diagnose what is placing limits on
the scalability or performance of your
application
Start with large areas, then narrow down
Is your application I/O bound? CPU bound?
Network bound?
6
Incremental Tuning

Use the scientific method

Establish a baseline
Define testing parameters which are replicated from
test to test.
Measure the performance given a starting
configuration.
Change one parameter at a time
Record everything
Make sure you get the same results when you
repeat a test!
7
Measurement overhead

Some performance measurement tools may

impact your application's behavior
If you're not familiar with how a particular tool
interacts with your workload, don't assume that a
tool has zero overhead with your application!
Enabling application performance metering or
debugging may also change its baseline
numbers.

8
A basic performance tuning
methodology
Define your baseline configuration and
measure its performance
[ If appropriate, define a stress test workload, and
measure it. ]
Make a single change to the system
configuration. Measure the results of that
change and record it.
Repeat as necessary
Make sure to test single changes as well as
combination of changes. Sometimes effects are
synergistic 9
Basic Performance Tools

free
top
iostat

10
The free(1) command

Basic command which shows memory usage

11
Questions to ask yourself after
looking at free(1) output
Will adding more memory help?
Often the cheapest way to speed up server
If the system is using paging or swapping,
adding more physical memory may help
Will a larger page cache help?
More sophisticated tools will answer these
questions later...
But asking questions is the beginning of wisdom

12
The top(1) command

Good general place to start

13
Questions to ask yourself when
looking at top(1) output
What are the top tasks running; should they
be there? Are they running? Waiting for disk?
How much memory are they taking up?
How is the CPU time (overall) being spent?
User time, System time, Niced user time, I/O Wait,
Hardware IRQ, Software IRQ, Stolen time

14
The iostat(1) command

Part of the systat package; shows I/O statistics

Use -k to for kilobytes instead of 512 sectors

15
Advanced iostat(1)

Many more details with the -x option

rrqm/s, wrqm/s read/write requests merged per
second
r/s, w/s read/write request per second
rkB/s, wkB/s number of kilobytes of read/write
transfers per second
avgrq-sz --- average request size in 512 byte
sectors per second
avgqu-sa average request queue length

16
Advanced iostat(1), continued

Still more details revealed with the -x option

await average time (in ms) between when a
requested is issued and when it is completed (time
in queue plus time for device to service the request)
svctm average sevice time (in ms) for I/O
requests that were issued to the device
%util Percentage of CPU time during which the
device was servicing requests. (100% means the
device is fully saturated)

17
Example of iostat -xk 1

Workload fs_mark -s 10240 -n 1000 -d /mnt

Creates 1000 files, each 10k, in /mnt, with an fsync
after writing each file
Result: 33.7 files/second

18
Conclusions we can draw from
the iostat results
Utilization: 98.48%
The system is I/O bound
Adding memory or speeding up the CPU clock
won't help
Solution attack the I/O bottleneck
Add more I/O bandwidth resources (use a faster
disk or use a RAID array)
Or, do less work!

19
Speeding up fs_mark

If we mount the (ext4) file system with -o

barrier=0, file/sec becomes 358.3
But this risks fs corruption after a power fail
Is the fsync() really needed? Without it, file/sec
goes up to 17,010.30
Depends on application requirements
Better: use -o journal_async_commit
Using journal checksums, it allows ext4 to safely
use only one barrier per fsync() instead of two.
(Requires Linux 2.6.32)
20
Using -o journal_async_commit

Using fs_mark -s 10240 -n 1000 -d /mnt again

Result: 48.2 files/sec (a 46% improvement over
33.7 files/sec!)

21
Comparing the two results

33.7
files/sec

2000
barrier
ops

49.2
files/sec

1000
barrier
ops

22
Before we leave fs_mark...

How does fs_mark fare on other file systems?

ext2 (no barriers) 574.9
ext3 (no barriers) 348.8 (w/ barriers) 30.8
ext4 (no barriers) 358.3 (w/ barriers) 49.2
XFS (no barriers) 337.3 (w/ barriers) 29.0
reiserfs (no barriers) 210.0 (w/ barriers) 31.5
Important note: these numbers are specific to
this workload (small files, fsync heavy) and not
a general figure of merit for these file systems
23
Lessons Learned So Far

Measure, analyze, and then tweak

Bottleneck analysis is critical
It is very useful to understand how things work
under the covers
Adding more resources is one way to address a
bottleneck
But so is figuring ways of doing less work!
Sometimes you can achieve your goal by working
smarter, not harder.

24
The snap script

Handy quickie shell script which I and a

colleague developed while working on the
Advanced Linux Response Team at IBM
Collects a lot of statistics: iostat, meminfo,
slabinfo, sar, etc. in a low impact fashion.
Collects system configuration information
Especially useful when I might not have access to
the system for security reasons
Gather information for a day; then analyze for
trends or patterns
25
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

26
File system and storage tuning

Choosing the right storage devices

Hard Drives
SSD
RAID
NFS appliances
File System Tuning
General Tips
File system specific

27
Hard Drives

Disks are probably the biggest potential

bottleneck in your system
Punched cards and paper tape having fallen out of
favor...
Critical performance specs you should examine
Sustained Data Transfer Rate
Rotational speed: 5400rpm, 7200rpm, 10,000 rpm
Areal density (max capacity in that product family)
Seek time (actually 3 numbers: average, track-to-
track, full stroke)
28
Transfer Rates

The important number is the sustained data

transfer rate (aka disk to buffer) rate
Typically around 70-100 Mb/s; slower for laptop
drives
Much less important: The I/O transfer rate
At least for hard drives, whether you are using
SATA I's 1.5 Gb/s or SATA II's 3.0 Gb/s won't
matter except for rare cases when transfering data
out of the track buffer
SDD's might be a different story, of course...
29
Short stroking hard drives

HDD performance are not uniform across the

platter
Up to 100% performance improvements on the
outer edge of the disk
Consider partitioning your disk to take this into
account!
If you don't need the full 1TB of space, partitioning
your disk to only use the first 100GB or 300GB
could speed things up!
Also when running benchmarks, use the same
partitions for each file system tested. 30
What about SSD's?

Advantages of SSD
Fast random access reads
Fails usually when writing, not when reading
Less suceptible to mechanical shock/vibration
Most SSD's use less power than HDD's
Disadvantage of SSD's
Cost per Gb much more expensive
Limited number of write cycles
Writes are slower than reads; random writes can be
much slower (up to a ½ sec average, 2 sec worst 31
case for 4k random writes for really bad SSD's!)
Getting the right SSD is important

Really good website for goes into great detail

about this is Anand Tech
http://www.anandtech.com/storage/
Many of the OEM SSD's included laptops are
not the good SSD's, and you the pay the OEM
markup to add insult to injury.

32
Should you use SSD's?

For laptops and desktops, absolutely!

For servers, it depends...
If you need fast random access reads, yes!
If you care about power consumption, be careful
When idle, SSD's save only 0.2 to 0.4 Watts
When active, SSD's use roughly the same power as
5400rpm 2.5 drive and save 3W or so compared to
high performance 3.5 drives
For certain workloads, the write endurance problem
of SSD's may be a strong concern
33
PCIe attached flash

Like SSD's, only more so

Speed achieved by writing to large numbers of
flash chips in parallel
Potentially 100k to 1M 4k random reads / seconds
Synchronous 4k random write just as slow as SSD's
Very expensive, but the price is starting to drop
In some cases, they can be cost effective
1 server with PCIe attached flash could replace
several servers with HDD's/SSD's in some cases
34
RAID

Redundant Array of Inexpensive Disks

RAID 0 Striping
RAID 1 Mirroring
RAID 5 3 or more disks, with a rotating parity
stripe
RAID 6 4 more disks, with two rotating parity
stripes
RAID 10 Mirroring + striping

35
RAID tuning considerations

Adding more spindle improves performance

RAID 5/6 requires some special care
Writes smaller than the N*stripe size will require a
read/modify/write cycle in order to update the parity
stripe (where N is the number of non-spare disks)
If the RAID device is going to be broken up using
LVM or partitions, make sure the LV/parition is
aligned on a full stripe boundary

36
Filesystem Tuning

Most general purpose file systems work quite

well for most workloads
But in some file systems are better for certain
specialized workloads
Reiserfs small (< 4k) files
XFS very big RAID arrays, very large files
Ext3 is a good general purpose filesystem that
many people use by default
Ext4 will be better at RAID, larger files, while still
working well on small-medium sized files
37
Managing Access-time Updates

Posix requires that a file's last access time is

updated each time its contents are accessed.
This means a disk write for every single read
The mount options noatime and relatime can
reduce this overhead.
The relatime option will only update the atime if
mtime and ctime is newer than the last atime.
Only saves approximately half the writes compared
to noatime
Some applications do depend on atime being
updated 38
Tuning ext3/ext4 journals

Sometimes increasing the journal size can help;

especially if your workload is very metadata-
intensive (lots of small files; lots of file
creates/deletes/renames)

Journal data modes
data=ordered (default) data is written first before
metadata is committed
data=journal data is written into the journal
data=writeback only metadata is logged; after a
crash, uninitialized data can appear in newly
allocated data blocks 39
Using ionice to control read/write
priorities
Like the nice command but affects the priority
of read/write requests issued by the process
Three scheduling classes
Idle only if there are no other high priority
requests pending
Best-effort requests served round-robin (default)
Real time highest priority request always gets
access
For best-effort and real time classes, there are
8 priorities, with 0 being the highest priority and
7 the lowest priority 40
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

41
Network Tuning

Before you do anything else... check the basic

health of the network
Speed, duplex, errors
Tools: ethtool, ifconfig, ping
Check TCP throughput: ttcp or nttcp
Look for wierd stuff using wireshark / tcpdump
Network is a shared resource
Who else is using it?
What are bottlenecks in the network topology?
42
Latency vs Throughput

Latency
When applications need maximum responsiveness

Lockstep protocols (i.e., no sliding window
optimizations)
RPC-based protocols
Throughput
When transfering large data sets
Very often tuning efforts will trade off latency for
throughput or vice versa

43
Interrupt Coalescing

This reduces CPU load by amortizing the cost

of an interrupt over multiple packets; this allos
us to trade off latency for throughput
ethtool -C ethX rx-usecs 80 rx-frames 20
This will delay a receive interrupt for 80 s or until 20
packets are received, whichever comes first
ethtool -C ethX rx-usecs 0 rx-frames 1
This will cause an interrupt to be sent for every
packet received
Different NIC's will have different defaults and
may have additional tuning parameters
44
Enable NIC optimizations

Some device drivers don't enable these

features by default
You can check using ethtool -k eth0
TCP segment offload
ethtool -K tso on
Checksum off-load
ethtool -K tx on rx on
Large Receive offload (for throughput)
ethtool -K lro on
45
The bandwidth-delay product

Very important when optimizing for throughput,

especially for high speed, long distance links
Represents the amount of data that can be in
flight at any particular point in time.
BDP = 2 * bandwidth * delay
BDP = bandwidth * Round Trip Time (RTT)
example:
(100 Mbits/sec / 8 bits/byte) * 50 ms ping time =
625kbytes

46
Why the BDP matters

TCP has to be able to retransmit any dropped

packets; so the kernel has to remember what
data has been sent in case it needs to
retransmit it.
TCP Window
Limits on the size of the TCP window to control
kernel memory consumed by the networking stack

47
Using the BDP

The BDP in bytes plus some overhead room

should be used as [wmax] below when setting
these parameters in /etc/sysctl.conf:
net.core.rmem_max= [wmax]
Maximum Socket Receive Buffer size
net.core.wmem_max= [wmax]
Maximum Socket Send Buffer size
net.core.rmem_max also known as
/proc/sys/net/core/rmem_max
e.g., set via echo 2097152 >
/proc/sys/net/core/rmem_max 48
Per-socket /etc/sysctl.conf
settings
net.ipv4.tcp_rmem = [wmin] [wstd] [wmax]
receive buffer sizing in bytes (per socket)
net.ipv4.tcp_wmem = [wmin] [wstd] [wmax]
memory reserved for send buffers in bytes (per
socket)
Modern kernels do automatic tuning of the
receive and send buffers; and the defaults are
better; still if your BDP is very high, you may
need to boost [wstd] and [wmax]. Keep [wmin]
small for out-of-memory situations.
49
For large numbers of TCP
connections
net.ipv4.tcp_mem = [pmin] [pdef] [pmax]
pages allowed to be used by TCP (for all sockets)
For 32-bit x86 systems, kernel text & data
(including TCP buffers) can only be in the low
896MB.
So on 32-bit x86 systems, do not adjust these
numbers, since they are needed to balance
memory usage with other Lowmem users.
If this is a problem, best bet is to switch to a 64-bit
x86 system first.
50
Increase transmit queue length

The ethernet default of 100 is good for most

networks and where we need to balance
interactive responsiveness with large transfers
However, for high speed networks and bulk
transfer, this needs to be increased to some
value between 1000-50000
ifconfig eth0 txqueuelength 2000
Tradeoffs: more kernel memory used;
interactive response may be impacted.
Experiment with ttcp to find the slowest value that
51
works for your network/application.
Optimizing for Low Latency TCP

This can be very painful, because TCP is not

really designed for low latency applications.
TCP is engineered to worry about congestion
control on wide-area networks, and to optimize for
throughput on large data streams.
If you are writing your own application from
scratch, very often basing your own protocol on
UDP is often a better bet.
Do you really need a byte-oriented service?
Do you only need automatic retransmission to deal
with lost packets? 52
Nagle Algorithm

Goal: To make networking more efficient by

batching small writes into a bigger packet for
efficiency
When the OS gets a small amount of data (a single
keystroke in an telnet connection), delay a very
small amount of time to see if more bytes will be
coming.
This naturally increases latency!
Requires application-level change
int on = 1;
setsockopt (sockfd, SOL_TCP, TCP_NODELAY, 53

&on, sizeof (on));

Delayed Acknowledgements

On the receiver end, wait a small amount of

time before sending a bare acknowledgement
to see if there's more data coming (or if the
program will send a response upon which you
can piggy-back your response)
This can interact with TCP slow-start to cause
longer latencies when the send window is
initially small.
After congestion or after the TCP connection has
been idle, the send window (maxmimum bytes of
unack'ed data) must be set down the MSS value 54
Solving the Delayed Ack problem

Disable slow-start algorithm on the sender?

Slow-start is a MUST implement (RFC 2581)
Disable delayed acknowledgments on the
receiver?
Delayed acknowledgments is a SHOULD (RFC
2581)
Some OS's have a way of disabling delayed
acknowledgments; Linux does not
There is a hack that works on a per-packet basis,
though...
55
Enabling QUICKACK
Linux tries to be clever and automatically
figure out when to disable delayed
acknowledgments when it believes the other
side is in slow start.
Hack to force quickack mode:
int on = 1;
setsockopt (sockfd, SOL_TCP, TCP_QUICKACK,
&on, sizeof (on));
But QUICKACK mode is disabled once other side
is done with slow start. So you have to re-enable
it any time the connection is idle for longer than 56
the retransmission time.
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

57
NFS Performance tuning

Optimize both your network and your filesystem

In addition, various client and server specific
settings that we'll discuss now
General hint: use dedicated NFS servers
NFS file serving uses all parts of your system: CPU
time, memory, disk bandwidth, network bandwidth,
PCI bus bandwidth
Trying to run applications on your NFS servers will
make both NFS and the apps run slowly

58
Tuning a NFS Server

If you only export file system mountpoints, use

the no_subtree_check option in /etc/exports
Can burn large amonuts of CPU for metadata
intensive workloads
Bump up the number of NFS threads to a large
number (it doesn't hurt that much to have too
many). Say, 128... (instead of 4 or 8 which is
way too little). How to do this is distro-specific:
/etc/sysconfig/nfs
/etc/defaults/nfs-kernel-server
59
PCI Bus tuning

NFS serving puts heavy demands on both

networking cards and hard bus adapters
If you have a system with multiple PCI buses,
put the networking and storage cards on
different buses
Network cards tend to use lots of small DMA
transfers, which tends to hog the bus

60
NFS client tuning

Make sure you use NFSv3 and not NFSv2

Make sure you use TCP and not UDP
Use the largest rsize/wsize that the client/server
kernels support
Modern client/servers can do a megabyte at a time
Use the hard mount option, and not soft
Use intr so you can recover an NFS server is down
All of these are the default except for intr
Remove outdated fstab mount options. Just use
61
rw,intr
Tuning your network config for
NFS
Tune the network for bulk transfers (throughput)
Use the largest MTU size you can
For ethernets, consider using jumbo frames if all of
the intervening switches/routers support it

62
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

63
Memory Tuning

Memory tuning problems can often look like

other problems
Unneeded I/O caused by excessive paging/swaping
Extra CPU time caused by cache/TLB thrashing
Extra CPU time caused by NUMA-induced memory
access latencies
These subtleties require using more
sophisticated performance measurement tools

64
To measure swapping activity

The top(1) and free(1) command will both tell

you if any swap space is in use
To a first approximation, if there is any swap in use,
the system can be made faster by adding more
RAM.
To see current swap activity, use the sar(8)
program
First use of a very handy (and rather complicated)
system activity recorder program; reading through
the man page strongly recommended
Part of the systat package 65
Using sar to obtain swapping
information
Use sar -W <interval> [<num. of samples>]
Reports number of pages written (swapped out)
and read (swapped in) from the page device
out per second.
The first output is the average since system was
started.

66
Optimizing swapping

Use multiple swap devices

Use fast swap devices
Fast devices can be given a higher priority
Add more memory to avoid swapping in the first
place

67
Swapping vs. Paging

Swap used for anonymous pages

i.e., pages which are not backed by a file
Pages which are backed by a file are subject to
paging
If they have been modified, or made dirty, they are
cleaned by being written to their backing store
If a page has not been be used recently, it is
deactivated by removing it from processes' page
table
Clean and inactive pages may be purposed for
other uses on an LRU basis 68
Optimizing Paging

Unlike swapping, some amount of paging is

normal and unavoidable
So we can't just manage the amount of paging to
zero, like we can with swapping
Goal: to minimize amount of paging in the steady-
state case
Key statistics:
majflts/s major faults (which result in I/O) / second
pgsteal/s pages reclaimed from the page and
swap cache / second to satisfy memory demands
69
Using sar to obtain information
about paging
Use sar -B <interval> [<num. of samples>]
Reports many statistics
pgpgin/s, pgpgout/s ignore, not useful/misleading
fault/s # of page faults / sec.
majfault/s # of page faults that result in I/O / sec.
pgfree/s # of pages placed on the free list / sec.
pgscank/s # of pages scanned by kswaped / sec.
pgscand/s # of pages scanned directly / sec.
pgsteal/s # of pages reclaimed from scache / sec.
%vmeff pgsteal/s / (pgscank/s + pgscand/s) 70
Other ways of finding information
about memory utilization
cat /proc/meminfo
Something especially important on 32-bit x86
kernels: Low Memory vs. High Memory
Documentation/filesystems/proc.txt
cat /proc/slabinfo
Useful for seeing how the kernel is using memory
ALT-sysrq-m (or 'echo m > /proc/sysrq-trigger')
Different for different kernel versions and
distributions; /proc/slabinfo may not exist if
CONFIG_SLUB used and not CONFIG_SLAB 71
/proc/meminfo

72
Interesting bits from sysrq-m

Per-zone statistics

73
About Memory Caches

2GHz processor 2 billion cycles per second

Memory is much slower
Solution: use small amounts of fast cache
memory
Typically 32Kb very fast Level 1 cache
Maybe 4-8MB of somewhat slower Level 2 cache
Can see how much cache you have using
dmidecode and x86info
Not much tuning that can be done except by
improving the C/C++ program code 74
TLB Caches

The Translation Lookaside Cache speeds up

translation from a virtual address to a physical
address
Normally requires 2-3 lookups in the page tables
TLB cache short circuits this lookup process
The x86info program will show the TLB cache
layout
Hugepages are a way to avoid consuming too
many TLB cache entries
75
Using hugepages

Build a kernel that avoids using modules

The core kernel text segment uses huge pages;
modules do not
Modify an application to use hugepages (or
configure an application to use it if it already
has provision to use hugepages).
mount -t hugetlbfs none /hugepages then mmap
pages in /hugepages
On new qemu/kvm, you can use the option
-mem-path /hugepages
Use shmget(2) with the flag SHM_HUGETLB 76
Configuring hugepages

On most enterprise distro's this must be done at

boot time or shortly after it
Kernel boot option hugepages=nnn
/etc/sysctl.conf: vm.nr_hugepages=nnn
These pages are reserved for hugepages and can
not be used for anything else
With kernels newer than 2.6.23, things are
more flexible
Kernel boot option movablecore=nnn[KMG]
Memory reserved this way can be used for
77
hugepages and other uses
Agenda

Introduction to Performance Tuning

Filesystem and storage tuning
Network tuning
NFS performance tuning
Memory tuning
Application tuning

78
Application Tuning

Access to the source code?

Open source vs. Proprietary
Ability/willingness to modify the code?
Even if it's open source, you might not want to
modify the code
Proprietary programs
Read the documentation; find the knobs and find
the application-level statistics you can gather
but there are still some tricks we can do to figure
out what is going on when you don't have the
79
source...
A quick aside: Java Performance
Tuning
I'm not a Java programmer.... but I've worked
with a lot of Java performance tuning experts
First thing to consider is Garbage Collection
The GC is overhead that burns CPU time
GC can cause unpredictable pauses in the program
Collecting GC stats: JVM command-line option
-verbose:gc
Sizing the heap
Larger heap means less GC's
but more time spent GC'ing when you do 80
Generational GC

Observation: objects in Java have a high infant

mortality rate
Temporary objects, etc.
So put them in a separate arenas.
An object starts in the nursery (aka eden) space.
The nursery is GC'ed more frequently.
Objects which survive a certain number of GC
passes get promoted from the nursery to a tenured
space (which is GC'ed less frequently)
Need to configure the size of the nursury and
tenured space 81
Reducing GC's by not creating as
much Garbage
Requires being able to modify the code
Very often, though, Java programmers can
make extra work for the Java Run-time
Environment without realizing it
Two common examples
Using String and Integer class variables to do
calculations (instead of StringBuffer and the
primitive int type)
Using Java.util.Map instead of creating a Class

82
Back to C/C++ applications

Tools for investigating applications

strace/ltrace
valgrind
gprof
oprofile
perf
Most of these tools work better if you have
source access
But sometimes not source is not absolutely required
83
strace and ltrace

Useful for seeing what the application is doing

Especially useful when you don't have source
System call tracing: strace
Shared library tracing: ltrace
Run a new command with tracing:
strace /bin/ls /usr
Attach to an already existing process
ltrace -p 12345
84
Valgrind

Used for finding memory leaks and other

memory access bugs
Best used with source access (compiled with -g);
but not strictly necessary
Works by emulating x86 in x86 and adding
checks to pointer references and malloc/free
calls
Other architectures supported
Commercial alternative: purify (uses object
code insertion)
85
C/C++ profiling using gprof

To use, compile your code using the -pg option

This will add code to the compiled binary to
track each function call and its caller
In addition the program counter is sampled by
the kernel at some regular interval (i.e., 100Hz
or 1kHz) to find the hot spots
Demo time!

86
System profiling using oprofile

Basic operation very similar to gprof

Sample the program counter at regular intervals
Advantages over gprof
Does not require recompiling application with -pg
Can profile multiple processes and the kernel all at
the same time
Demo time!

87
Perf: the next generation

Originally intended to be a way to access

performance counters
Added the ability to sample kernel tracepoints
Sampling can be restricted to a process, a
process and its children, or the whole system
With perf record/report/perf annotate
performance events can be tied to specific
C/C++ lines of code (with source files and
object files compiled with -g)
Demo time! 88
Userspace Locking

One other application issues which can be a

very big deal: userspace locking
Rip out fancy multi-level locking (i.e., user-
space spinlocks, sched_yield() calls, etc.)
Just use pthread mutexes, and be happy
Linux implements pthread mutexes using the
futex(2) system call. Avoids kernel context switch
except in the contended case
The fast path really is fast! (So need for
fancy/complex multi-level locking just rip it out)
89
Processor Affinity

Rarely a good idea... but can be used to

improve response time for critical tasks
Set CPU affinity for tasks using taskset(1)
Set CPU affinity for interrupt handlers using
/proc/irq/<nn>/smp_affinity
Strategies
Put producer/consumer processes on the same
CPU
Move interrupt handlers to a different CPU
Use mpstat(1) and /proc/interrupts to get 90

processor-related statistics
Conclusion

Performance tuning is fractal

There's always more to tweak
It's more addictive than pistachios!
Understanding when to stop
Great way of learning more up and down the
technology stack from the CPU chip up
through to the OS to the application tuning

Array Install - Changing Purity Version On A Fresh Array
No ratings yet
Array Install - Changing Purity Version On A Fresh Array
3 pages
IBM® - Redbook - IBM SVC Best Practices
No ratings yet
IBM® - Redbook - IBM SVC Best Practices
516 pages
Gcfi6e Im Ch07
100% (1)
Gcfi6e Im Ch07
10 pages
DNS-320l (LW) Reva Releasenotes 1.06.b03 en
No ratings yet
DNS-320l (LW) Reva Releasenotes 1.06.b03 en
5 pages
It22c3 Unit V
No ratings yet
It22c3 Unit V
61 pages
How To Identify The HBA Cards - Ports and WWN in Solaris - The Geek Diary - 1
No ratings yet
How To Identify The HBA Cards - Ports and WWN in Solaris - The Geek Diary - 1
10 pages
FCoE Handbook First-A Ebook
100% (2)
FCoE Handbook First-A Ebook
59 pages
Unit 4 - Linux Forensics
No ratings yet
Unit 4 - Linux Forensics
6 pages
Theory Assignment 01
No ratings yet
Theory Assignment 01
13 pages
IBM Power Systems E870C and E880C: Technical Overview and Introduction
No ratings yet
IBM Power Systems E870C and E880C: Technical Overview and Introduction
172 pages
UNIX System and Network Performance Tuning
No ratings yet
UNIX System and Network Performance Tuning
64 pages
DM&P X-Linux Developer's Manual
No ratings yet
DM&P X-Linux Developer's Manual
21 pages
Basic Concepts For Clustered Data Ontap 8.3 V1.1-Lab Guide
No ratings yet
Basic Concepts For Clustered Data Ontap 8.3 V1.1-Lab Guide
182 pages
LinuxFoundation Lfcs v2018-03-05 q234
No ratings yet
LinuxFoundation Lfcs v2018-03-05 q234
57 pages
Installing Backtrack 3 On An Eee PC
No ratings yet
Installing Backtrack 3 On An Eee PC
12 pages
LSP Unit 3
No ratings yet
LSP Unit 3
25 pages
Fabric OS Troubleshooting
No ratings yet
Fabric OS Troubleshooting
120 pages
Linux Cheat Sheat
100% (1)
Linux Cheat Sheat
126 pages
HPE RDX Removable Disk Solutions: Reliable Storage With Unmatched Portability, Fast Recovery and Easy System Integration
No ratings yet
HPE RDX Removable Disk Solutions: Reliable Storage With Unmatched Portability, Fast Recovery and Easy System Integration
2 pages
Mondorescue Howto PDF
No ratings yet
Mondorescue Howto PDF
53 pages
I.mx 6SoloLite EVK Linux User's Guide
No ratings yet
I.mx 6SoloLite EVK Linux User's Guide
35 pages
Fedora 15 Installation Guide en US
No ratings yet
Fedora 15 Installation Guide en US
318 pages
iSCSI Protocol Concepts: and Implementation
No ratings yet
iSCSI Protocol Concepts: and Implementation
12 pages
Ext4 Fast FSCK Ted Tso
No ratings yet
Ext4 Fast FSCK Ted Tso
24 pages
XFS - Extended Filesystem
No ratings yet
XFS - Extended Filesystem
46 pages
Red Hat Enterprise Linux 6: Technical Notes
No ratings yet
Red Hat Enterprise Linux 6: Technical Notes
58 pages
Basic IPV6 Routing Config
No ratings yet
Basic IPV6 Routing Config
15 pages
(IBM) FlashSystem V9000 Model AC3 With Flash Enclosure Model AE3 Product Guide
No ratings yet
(IBM) FlashSystem V9000 Model AC3 With Flash Enclosure Model AE3 Product Guide
64 pages
JohnHufferd Fiber Channel Over Ethernet
No ratings yet
JohnHufferd Fiber Channel Over Ethernet
48 pages
FCIA Is FC NVMe Ready For Prime Time Final
No ratings yet
FCIA Is FC NVMe Ready For Prime Time Final
35 pages
Chapter 5: Linux Operating System (Continue)
No ratings yet
Chapter 5: Linux Operating System (Continue)
11 pages
NVMe 102 1 Part 2 NVMe of Transports - Final
No ratings yet
NVMe 102 1 Part 2 NVMe of Transports - Final
46 pages
Network Servers Lecture 2 Notes
No ratings yet
Network Servers Lecture 2 Notes
8 pages
Android Security: Attacks and Defenses: by Abhishek Dubey and Anmol Misra CRC Press. (C) 2013. Copying Prohibited
No ratings yet
Android Security: Attacks and Defenses: by Abhishek Dubey and Anmol Misra CRC Press. (C) 2013. Copying Prohibited
18 pages
Best Prac of Maintenance
No ratings yet
Best Prac of Maintenance
15 pages
Brocade BCFP 4.0 Zoning Guide
No ratings yet
Brocade BCFP 4.0 Zoning Guide
0 pages
MySQL Database Administration
100% (3)
MySQL Database Administration
36 pages
Brocade BCFA 250 Preparing BCFA Certified Professionals For The 16Gbps BCA Exam
No ratings yet
Brocade BCFA 250 Preparing BCFA Certified Professionals For The 16Gbps BCA Exam
130 pages
Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm
No ratings yet
Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm
46 pages
Lets Talk Fabrics NVMe Over Fabrics
No ratings yet
Lets Talk Fabrics NVMe Over Fabrics
49 pages
How To Configure Network Bonding in Linux
100% (1)
How To Configure Network Bonding in Linux
5 pages
NVMe-oF An Advanced Introduction
No ratings yet
NVMe-oF An Advanced Introduction
39 pages
OpenStack OperationsPartB
No ratings yet
OpenStack OperationsPartB
15 pages
Brocade - Gen7 SAN Specialist FAQ Rev0921
No ratings yet
Brocade - Gen7 SAN Specialist FAQ Rev0921
4 pages
Concepts: Ontap 9
No ratings yet
Concepts: Ontap 9
43 pages
Linux Unit1
No ratings yet
Linux Unit1
23 pages
Nvme Over Fabrics
No ratings yet
Nvme Over Fabrics
25 pages
Vio Creating A Vhost
No ratings yet
Vio Creating A Vhost
22 pages
FC Naming and Addressing: F - Port
No ratings yet
FC Naming and Addressing: F - Port
16 pages
Cluster and SVM Peering Express Guide: Ontap 9
No ratings yet
Cluster and SVM Peering Express Guide: Ontap 9
15 pages
Net App Diag
No ratings yet
Net App Diag
481 pages
eMMC To SD Hack Rescues Data From A Waterlogged Phone - Hackaday
No ratings yet
eMMC To SD Hack Rescues Data From A Waterlogged Phone - Hackaday
10 pages
IBM System Storage DS8870 Performance With High Performance Flash Enclosure
No ratings yet
IBM System Storage DS8870 Performance With High Performance Flash Enclosure
32 pages
Install Oracle Database 10g R2 On Linux
No ratings yet
Install Oracle Database 10g R2 On Linux
53 pages
Emc Avamar Compatibility and Interoperability Matrix
No ratings yet
Emc Avamar Compatibility and Interoperability Matrix
86 pages
Making The Most of Data ONTAP PowerShell Toolkit
No ratings yet
Making The Most of Data ONTAP PowerShell Toolkit
83 pages
Nvme™ and Nvme-Of™ in Enterprise Arrays
No ratings yet
Nvme™ and Nvme-Of™ in Enterprise Arrays
33 pages
NetApp SnapDrive For Windows
No ratings yet
NetApp SnapDrive For Windows
19 pages
Linux Practical PDF
No ratings yet
Linux Practical PDF
96 pages
LPIC-1 Exam 101: LPI 101-500 Dumps Available Here at
No ratings yet
LPIC-1 Exam 101: LPI 101-500 Dumps Available Here at
5 pages
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
No ratings yet
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
23 pages
Snapmirror and Snapvault in Clustered Data Ontap 8.3 V1.1-Lab Guide
No ratings yet
Snapmirror and Snapvault in Clustered Data Ontap 8.3 V1.1-Lab Guide
57 pages
Iscsi Vs FC: European Storage Competence Center
No ratings yet
Iscsi Vs FC: European Storage Competence Center
49 pages
SAN Health FAQ100
No ratings yet
SAN Health FAQ100
6 pages
3PAR Performance
No ratings yet
3PAR Performance
45 pages
TR-3824 Storage Efficiency and Best Practices For Microsoft Exchange Server 2010 Technical Report
No ratings yet
TR-3824 Storage Efficiency and Best Practices For Microsoft Exchange Server 2010 Technical Report
22 pages
ONTAP 90 Volume Move Express Guide
No ratings yet
ONTAP 90 Volume Move Express Guide
16 pages
INF BCO1436 Vmware Replication Best Practices
No ratings yet
INF BCO1436 Vmware Replication Best Practices
46 pages
How To Check FC-HBA Connectivity To SAN Fabric - PDF
No ratings yet
How To Check FC-HBA Connectivity To SAN Fabric - PDF
6 pages
Performance Tuning Oracle Rac On Linux
No ratings yet
Performance Tuning Oracle Rac On Linux
12 pages
Brocade Fibre Channel Buffer Credits and Frame Management WP
No ratings yet
Brocade Fibre Channel Buffer Credits and Frame Management WP
20 pages
OnCommand System Manager SetupGuide
No ratings yet
OnCommand System Manager SetupGuide
25 pages
Channel Bondind in Linux
No ratings yet
Channel Bondind in Linux
14 pages
Brocade Zoning Basic Commands
No ratings yet
Brocade Zoning Basic Commands
7 pages
T5-Linux Performance Tuning
No ratings yet
T5-Linux Performance Tuning
52 pages
Unix Dumps
No ratings yet
Unix Dumps
21 pages
Netapp Data Motion For Volumes
No ratings yet
Netapp Data Motion For Volumes
36 pages
DCG 20 MAY DCP111 - Storage Masters-DM
No ratings yet
DCG 20 MAY DCP111 - Storage Masters-DM
54 pages
Emulex Leaders in Fibre Channel Connectivity: Fibre Chanel Over Ethernet
No ratings yet
Emulex Leaders in Fibre Channel Connectivity: Fibre Chanel Over Ethernet
29 pages
EVA Architecture Introduction
No ratings yet
EVA Architecture Introduction
61 pages
Sample Configuration of EtherChannel - Link Aggregation Control Protocol...
No ratings yet
Sample Configuration of EtherChannel - Link Aggregation Control Protocol...
8 pages
Brocade SAN Switches & Directors: Business Unit or Product Name
No ratings yet
Brocade SAN Switches & Directors: Business Unit or Product Name
34 pages
030-036 Tuning
No ratings yet
030-036 Tuning
7 pages
3PAR CLI - Commands PDF
No ratings yet
3PAR CLI - Commands PDF
7 pages
5-Minute Initial Troubleshooting On Brocade Equipment: Elonden Elonden
No ratings yet
5-Minute Initial Troubleshooting On Brocade Equipment: Elonden Elonden
5 pages
OpenStack Object Storage (Swift) Essentials
From Everand
OpenStack Object Storage (Swift) Essentials
Amar Kapadia
No ratings yet
Windows Server 2008 For Dummies
From Everand
Windows Server 2008 For Dummies
Ed Tittel
No ratings yet
Oracle WebLogic Server Second Edition
From Everand
Oracle WebLogic Server Second Edition
Gerardus Blokdyk
No ratings yet
VMware Horizon View Essentials
From Everand
VMware Horizon View Essentials
Peter von Oven
No ratings yet
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet

Linux Perf Tuning 2010 1up

Uploaded by

Linux Perf Tuning 2010 1up

Uploaded by

Linux Performance Tuning

Wednesday, November 10, 2010

Tutorial runs from 9 to 5:00pm

Introduction to Performance Tuning

What happens when a server is put under a

Careful tuning of memory usage won't matter if

Use the scientific method

Some performance measurement tools may

Basic command which shows memory usage

Good general place to start

Part of the systat package; shows I/O statistics

Many more details with the -x option

Still more details revealed with the -x option

Workload fs_mark -s 10240 -n 1000 -d /mnt

If we mount the (ext4) file system with -o

Using fs_mark -s 10240 -n 1000 -d /mnt again

How does fs_mark fare on other file systems?

Measure, analyze, and then tweak

Handy quickie shell script which I and a

Introduction to Performance Tuning

Choosing the right storage devices

Disks are probably the biggest potential

The important number is the sustained data

HDD performance are not uniform across the

Really good website for goes into great detail

For laptops and desktops, absolutely!

Like SSD's, only more so

Redundant Array of Inexpensive Disks

Adding more spindle improves performance

Most general purpose file systems work quite

Posix requires that a file's last access time is

Sometimes increasing the journal size can help;

Introduction to Performance Tuning

Before you do anything else... check the basic

This reduces CPU load by amortizing the cost

Some device drivers don't enable these

Very important when optimizing for throughput,

TCP has to be able to retransmit any dropped

The BDP in bytes plus some overhead room

The ethernet default of 100 is good for most

This can be very painful, because TCP is not

Goal: To make networking more efficient by

&amp;on, sizeof (on));

On the receiver end, wait a small amount of

Disable slow-start algorithm on the sender?

Introduction to Performance Tuning

Optimize both your network and your filesystem

If you only export file system mountpoints, use

NFS serving puts heavy demands on both

Make sure you use NFSv3 and not NFSv2

Introduction to Performance Tuning

Memory tuning problems can often look like

The top(1) and free(1) command will both tell

Use multiple swap devices

Swap used for anonymous pages

Unlike swapping, some amount of paging is

2GHz processor 2 billion cycles per second

The Translation Lookaside Cache speeds up

Build a kernel that avoids using modules

On most enterprise distro's this must be done at

Introduction to Performance Tuning

Access to the source code?

Observation: objects in Java have a high infant

Tools for investigating applications

Useful for seeing what the application is doing

Used for finding memory leaks and other

To use, compile your code using the -pg option

Basic operation very similar to gprof

Originally intended to be a way to access

One other application issues which can be a

Rarely a good idea... but can be used to

Performance tuning is fractal

You might also like

Workload fs_mark -s 10240 -n 1000 -d /mnt

Using fs_mark -s 10240 -n 1000 -d /mnt again

&on, sizeof (on));