030-036 Tuning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

:FM<IJKFIP Performance Tuning Toolbox

KffcjXe[k\Z_e`hl\j]fig\i]fidXeZ\kle`e^`eC`elo

KLE@E>KFFC9FO
Tune up your systems and search out bottlenecks with these handy The next four columns under memory
show how much memory space is used.
performance tools. BY TIM CHEN, ALEX SHI, AND YANMIN ZHANG Frequently swapping memory in and out
of the disk swap space slows the system.

F
ver the past several years, the of tools for examining system use and The cache column gives the amount of
Linux Kernel Performance Proj- searching out bottlenecks. Some tools memory used as a page cache. A bigger
ect [1] has tracked the perfor- reveal the general health of the system, cache means more files cached in mem-
mance of Linux and tuned it for through- and other tools offer information about ory. The two columns under io, bi, and
put and power efficiency on Intel plat- specific system components. bo, indicate the number of blocks re-
forms. This experience has given us The vmstat utility offers a useful sum- ceived and sent to block devices, respec-
some insights into the best tools and mary of overall system performance. tively, which gives an idea of the level
techniques for tuning Linux systems. In Listing 1 shows vmstat data collected of disk activity. The two columns under
this article, we describe some of our fa- every two seconds for a CPU-intensive, system, in, and cs, reveal the number of
vorite Linux performance utilities and multi-threaded Java workload. The first interrupts and context switches.
provide a real-world example that shows two columns (r, b) describe how many If the interrupt rate is too high, you
how the Kernel Performance Project uses processes in the systems can be run if a can use an interrupt utility, like sar, to
these tools to hunt down and solve a real CPU is available and how many are help uncover the cause. The command
Linux performance issue. blocked. The presence of both blocked sar -I XALL 10 1000 will break down the
processes and idle time in the system is source of the interrupts every 10 seconds
=`e[`e^9fkkc\e\Zbj usually a sign of trouble. for 1000 seconds. A high number of con-
The first task in performance tuning
is to identify any bottlenecks that Listing 1: vmstat Output
might be slowing down system per- 01 #vmstat 2
formance.
02 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
The most common bottlenecks
03 r b swpd free buff cache si so bi bo in cs us sy id wa
occur in I/O, memory management,
04 7 0 34328 757464 2712 26416 0 0 0 0 12 616773 34 28 37 0
or the scheduler. Linux offers a suite

30 ISSUE 100 MARCH 2009


Performance Tuning Toolbox :FM<IJKFIP

To study the health of a run-time


workloads I/O, use iostat. For example,
Listing 2 shows how to use iostat for
dumping a workload. If %iowait is high,
CPUs are idle and waiting for outstand-
ing disk I/O requests. In that case, try
modifying the workloads to use asyn-
chronous I/O or dedicate a thread to file
I/O so workload execution doesnt stop.
The other parameter to check is the
number of queued I/O requests:
avgqu-sz. This value should be less than
1 or disk I/O will significantly slow
things down. The %util parameter also
indicates the percentage of time the disk
has requests and is a good indication of
how busy the disk is.

:GL:pZc\j
One important way to identify a perfor-
=`^li\(1Gif]`c`e^k_\b\ie\cn`k_fgif]`c\% mance problem is to determine how the
system is spending its CPU cycles. The
text switches relative to the number of see whether the disk mode is configured oprofile utility can help you study the
processes is undesirable because of properly. Also, you could check the hard CPU to this end. Oprofile usually is en-
flushing of cached data. disk parameter setting for an IDE disk abled by default. If you compile your
The next four columns in Listing 1, us, own kernel, then you need to make sure
sy, id, and wa, indicate the percentage of # hdparm -I /dev/hda that the kernel configs CONFIG_
time the CPU(s) has spent in userspace OPROFILE=y and CONFIG_HAVE_
applications, in the kernel, being idle, or or for a SCSI disk: OPROFILE=y are turned on.
waiting for I/O, respectively. This output The easiest way to invoke oprofile is
shows whether the CPUs are doing use- # sdparm /dev/sda with the oprofile GUI that wraps the
ful work or whether they are just idling
or being blocked. A high percentage of Listing 2: iostat
time spent in the OS could indicate a 01 #iostat -x sda 1
non-optimal system call. Idle time for a 02 avg-cpu: %user %nice %system %iowait %steal %idle
fully loaded system could point to lock 03 0.00 0.00 2.16 20.86 0.00 76.98
contentions. 04

05 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await
;`jbG\i]fidXeZ\ svctm %util

Hdparm is a good tool for determining 06 sda 17184.16 0.00 1222.77 0.00 147271.29 0.00 120.44 3.08 2.52
0.81 99.01
whether the disks are healthy and con-
figured:
Listing 3: Viewing Profile Data with oprofile
# hdparm -tT /dev/sda 01 CPU: Core 2, speed 2400 MHz (estimated)
/dev/sda: 02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
Timing buffered disk reads: 5 03 mask of 0x00 (Unhalted core cycles) count 1200000
184 MB in 3.02 seconds = 5 04 samples % app name symbol name
60.88 MB/sec
05 295397 63.6911 cc1 (no symbols)
Timing cached reads: 5 06 22861 4.9291 vmlinux-2.6.25-rc9 clear_page_c
11724 MB in 2.00 seconds = 5
07 11382 2.4541 libc-2.5.so memset
5870.80 MB/sec
08 10959 2.3629 genksyms yylex

09 9256 1.9957 libc-2.5.so _int_malloc


The preceding command displays the
10 6076 1.3101 vmlinux-2.6.25-rc9 page_fault
speed of reading through the buffer
11 5378 1.1596 libc-2.5.so memcpy
cache to the disk, with and without any
12 5178 1.1164 vmlinux-2.6.25-rc9 handle_mm_fault
prior caching of data. The uncached
13 3857 0.8316 genksyms yyparse
speed should be somewhat close to the
14 3822 0.8241 libc-2.5.so strlen
raw speed of the disk. If this value is too
15 ... ...
low, you should check in your BIOS to

MARCH 2009 ISSUE 100 31


:FM<IJKFIP Performance Tuning Toolbox

command-line options. To do so, use to obtain the output shown in Listing 4. line. Oprofile can diagnose this kind of
oprofile 0.9.3 or later for an Intel Core 2 Listing 4 shows that this workload has problem.
processor and install the oprofile-gui some very heavy memory allocation ac- Again, using the Intel Core 2 processor
package. Now invoke tivity associated with getting free mem- as an example, choose the event LLC_
ory pages and clearing them. MISSES to profile all the L2 cache re-
#oprof_start quests that miss the L2 cache. For the
KffDXep:XZ_\D`jj\j6 exact event to use, you should invoke
to bring up the Start profiler screen with The performance of the system is highly opcontrol --list-events to read about the
Setup and Configuration tabs (Figure 1). dependent on the effectiveness of the details of each event type available for
First, select the Configuration tab. If you cache. Any cache miss will degrade per- your CPU.
want to profile the kernel, enter the loca- formance and lead to a CPU stall. Listing 5 shows how to call up a cache
tion of the kernel image file (that is, the Sometimes a cache miss is caused by miss profile.
uncompressed vmlinux file if you com- frequently used fields located in data Oprofile is a very versatile tool. By
pile the kernel from source). Now return structures that span across the cache carefully choosing which events to mon-
to the Setup tab.
In the Events table, select Listing 4: opreport Output
the CPU_CLK_UNHALTED 01 CPU: Core 2, speed 2400 MHz (estimated)
event and the unit mask 02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00
Unhalted core cycles. Note: (Unhalted core cycles) count 1200000
Normally, you do not need 03 samples % image name app name symbol name
to sample the system any 04 -------------------------------------------------------------------------------
more often than the setting 05 295397 63.6911 cc1 cc1 (no symbols)
listed under in the Count 06 295397 100.000 cc1 cc1 (no symbols) [self]
field.
07 -------------------------------------------------------------------------------
A lower count means that
08 1 0.0044 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 path_walk
fewer events will need to
09 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 __alloc_pages
happen before a sample is
10 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 mntput_no_expire
taken, thus increasing the
11 22922 99.9782 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 get_page_from_freelist
sampling frequency. Now
12 22861 4.9291 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 clear_page_c
run the application you
want to profile, and start 13 22861 99.7121 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 clear_page_c [self]

oprofile by clicking on the 14 36 0.1570 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 apic_timer_interrupt

Start button. When the ap- 15 24 0.1047 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 ret_from_intr

plication has stopped run- 16 3 0.0131 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 smp_apic_timer_interrupt

ning, click the Stop button. 17 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 mntput_no_expire


To view the profile data, 18 1 0.0044 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 __link_path_walk
invoke: 19 -------------------------------------------------------------------------------

20 11382 2.4541 libc-2.5.so libc-2.5.so memset


#opreport -l
21 11382 100.000 libc-2.5.so libc-2.5.so memset [self]

22 -------------------------------------------------------------------------------
The output for this com-
23 10959 2.3629 genksyms genksyms yylex
mand is shown in Listing 3.
24 10959 100.000 genksyms genksyms yylex [self]
Listing 3 shows the per-
25 ... ...
centage of CPU time spent
in each application or ker-
nel, and it also shows the Listing 5: Cache Miss Profile
functions that are being ex- 01 #opreport -l
ecuted. This report reveals 02 CPU: Core 2, speed 1801 MHz (estimated)
the code the system is 03 Counted L2_RQSTS events (number of L2 cache requests) with a unit mask of 0x41
spending the most time in, 04 (multiple flags) count 90050
which should improve per- 05 samples % app name symbol name
formance if you can use this
06 2803 63.4163 cc1 (no symbols)
data as a basis for optimiza-
07 190 4.2986 vmlinux-2.6.25-rc9-ltop get_page_from_freelist
tion.
08 102 2.3077 as (no symbols)
If you have collected call
09 60 1.3575 vmlinux-2.6.25-rc9-ltop __lock_acquire
graph information, type the
10 53 1.1991 libc-2.7.so strcmp
command
11 39 0.8824 vmlinux-2.6.25-rc9-ltop unmap_vmas

12 38 0.8597 vmlinux-2.6.25-rc9-ltop list_del


#opreport -c

32 ISSUE 100 MARCH 2009


Anzeige
wird
separat
angeliefert
:FM<IJKFIP Performance Tuning Toolbox

Starting with the 2.6.25 kernel, you


can compile LatencyTOP support into
the kernel by enabling the CONFIG_
HAVE_LATENCYTOP_SUPPORT=y and
CONFIG_LATENCYTOP=y options in the
kernel configuration. After booting up
the kernel with LatencyTOP capability,
you can trace latency in the workload
with a userspace latency tracing tool
from the LatencyTOP website [2]. To
start, compile the tool, do a make install
of the LatencyTOP program, and run the
following as root:

#./latencytop

The LatencyTOP programs top screen


(Figure 2) provides a periodic dump of
the top causes that lead to processes
=`^li\)1Jkl[p`e^jpjk\dcXk\eZpn`k_CXk\eZpKFG% being blocked, sorted by the maximum
blocked time for each cause. Also, youll
itor, you can zero in on the CPU opera- to let you locate quickly where in the find information on the percentage of
tion that is causing the problem. kernel the lock occurs. time a particular cause contributed to
It is worth noting that the lock statis- the total blocked time. The bottom
CfZb`e^GifYc\dj tics infrastructure incurs overhead. Once screen provides similar information on
A high context switching rate, relative to you have finished hunting for locks, you a per-process basis.
the number of running processes, is un- should disable this feature to maximize
desirable and could indicate a lock con- performance. 8e<oXdgc\
tention problem. To determine the most Linux provides quick allocation and
contended locks, enable the lock statis- <oZ\jj`m\CXk\eZp deallocation of frequently used objects
tics in the kernel, which will give you in- Program throughput that is inconsistent in caches called slabs. To provide bet-
sight into what is causing the lock con- and sputters, applications that seem to ter performance, Christopher Lameter
tention. To do so, use the lock_stat fea- go to sleep before coming alive, and a lot introduced a new slabs manager called
ture in 2.6.23 or later kernels. First, of processes under the blocked column Slub.
youll need to recompile the kernel with in vmstat are often signs of latency in However, we found that the scheduler
the CONFIG_LOCK_STAT=y option. the system. LatencyTOP is a new tool performance benchmark known as hack-
Then, before running the workloads, that helps diagnose latency issues. bench reveals a big difference in run
clear the statistics with:
Listing 6: Starting with vmstat
#echo 0 > /proc/lock_stat
01 procs -----------memory---------- --swap--- ---io---- --system--- -----cpu-----

02 r b swpd free buff cache si so bi bo in cs us sy id wa st


After running the workload, review
03 360 0 0 15730644 17980 120336 0 0 0 0 320 140047 0 100 0 0 0
the lock statistics with the follow-
04 327 0 0 15739216 17980 120336 0 0 0 0 322 256259 1 99 0 0 0
ing command:
05 412 0 0 15743084 17988 120336 0 0 0 16 282 74537 0 100 0 0 0

06 421 0 0 15741076 17988 120336 0 0 0 0 311 51750 0 100 0 0 0


#cat /proc/lock_stat
07 334 0 0 15745048 17988 120332 0 0 0 0 295 95434 0 100 0 0 0

The output of the preceding com- 08 468 0 0 15747460 17988 120336 0 0 0 0 251 94440 0 100 0 0 0

mand is a list of locks in the kernel 09 373 0 0 15750844 17988 120336 0 0 0 0 268 104569 0 100 0 0 0

sorted by the number of conten- 01 procs -----------memory---------- --swap--- ---io---- --system--- -----cpu-----

tions. For each lock, you will see 02 r b swpd free buff cache si so bi bo in cs us sy id wa st

the number of contentions, as well 03 360 0 0 15730644 17980 120336 0 0 0 0 320 140047 0 100 0 0 0
as the shortest, maximum, and cu- 04 327 0 0 15739216 17980 120336 0 0 0 0 322 256259 1 99 0 0 0
mulative wait time for a contention. 05 412 0 0 15743084 17988 120336 0 0 0 16 282 74537 0 100 0 0 0
In addition, you will see the num- 06 421 0 0 15741076 17988 120336 0 0 0 0 311 51750 0 100 0 0 0
ber of acquisitions, as well as the 07 334 0 0 15745048 17988 120332 0 0 0 0 295 95434 0 100 0 0 0
minimum, maximum, and cumula- 08 468 0 0 15747460 17988 120336 0 0 0 0 251 94440 0 100 0 0 0
tive hold times for a lock. The top
09 373 0 0 15750844 17988 120336 0 0 0 0 268 104569 0 100 0 0 0
call sites of the lock are also given

34 ISSUE 100 MARCH 2009


Performance Tuning Toolbox :FM<IJKFIP

time with kernel 2.6.24/2.6.25-rc, be- tween processes and memory manage- The block objects, size 192 and 512,
tween a system with 16 CPU cores and a ment, and that is where the program is are actively used by hackbench mes-
system with eight CPU cores. Hackbench spending the most time. sages: One is for the socket buffer
is expected to be faster on the 16-core This result indicates the need to take a header and one is for the message body.
system than on the 8-core system, but closer look at what is going on with the Basically, the SLUB implementation
the testing result shows the first machine slabs. A utility called slabinfo provides a keeps a per-cpu cache for each slab type.
requires three times more run time than report on slab activity. (The source code When the kernel allocates an object, it
the second machine, which indicates a for the slabinfo utility is with the kernel checks the per-cpu cache first without
possible performance issue. source under Documents/vm/slabinfo.c.) locking. Such allocation is very fast and
The vmstat utility provides the output To obtain information about the most ac- is called a fast path. If the per-cpu cache
shown in Listing 6. tively used objects, invoke the slabinfo hasnt freed objects, the kernel allocates
Notice the high context switch (cs) utility (see Listing 8). from shared pages with a lock, which is
count and large number of running pro-
cesses. In this case, hackbench simulates Listing 7: Studying CPU Usage with oprofile
many chat rooms with a large number of 01 CPU: Core 2, speed 1602 MHz (estimated)
users passing messages back and forth 02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
in each room. The lack of idle time in of 0x00 (Unhalted core cycles) count 100000
the system indicates that the CPU is very 03 samples % image name app name symbol name
busy. 04 46746994 43.3801 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_alloc
The next step is to use oprofile to find 05 45986635 42.6745 linux-2.6.25-rc4 linux-2.6.25-rc4 add_partial
out where the CPU is spending its time. 06 2577578 2.3919 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_free
The oprofile data in Listing 7 shows that 07 1301644 1.2079 linux-2.6.25-rc4 linux-2.6.25-rc4 sock_alloc_send_skb
about 88% of the CPU time is spent in
08 1185888 1.1005 linux-2.6.25-rc4 linux-2.6.25-rc4 copy_user_generic_string
allocating slabs, adding to partially filled
09 969847 0.9000 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_recvmsg
slabs, and freeing slabs. It shows that
10 806665 0.7486 linux-2.6.25-rc4 linux-2.6.25-rc4 kmem_cache_alloc
the benchmark generates lots of mes-
11 731059 0.6784 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_sendmsg
sages that are allocated and passed be-

MARCH 2009 ISSUE 100 35


:FM<IJKFIP Performance Tuning Toolbox

To reduce the slow slum_min_objects is equal to a bigger


Listing 8: slabinfo path allocation, we value, the result doesnt provide much
01 #slabinfo -AD could ask for a big- improvement.
02 Name Objects Alloc Free %Fast ger sized slab to in- At this point, we went back to the
03 :0000192 3428 80093958 80090708 92 8 crease the per-cpu 8-core machine and did extensive testing
04 :0000512 374 80016030 80015715 68 7 object cache. To in- to confirm our findings. After we dis-
05 vm_area_struct 2875 224524 221868 94 20 crease the default cussed the problem with the SLUB main-
06 :0000064 12408 134273 122227 98 47 max_order of 1 and tainers, a patch that scales slub_min_ob-
07 :0004096 24 127397 127395 99 98 min_objects of 32, jects, as a function of the number of CPU
08 :0000128 4596 57837 53432 97 48 we add slub_max_ cores, was merged into the Linux kernel.
09 dentry 15659 51402 35824 95 64
order=3 slub_min_
10 :0000016 4584 29327 27161 99 76
objects=32 to the :feZclj`fej
kernel boot com- In this article, we provided a quick tour
11 :0000080 12784 33674 21206 99 97
mand line. This in- of some useful tools for diagnosing com-
12 :0000096 2998 26264 23757 99 93
creases the number mon performance issues. Of course, this
of objects that must brief introduction is not intended as a
slow. A slow path means more lock con- fit into one slab for an allocation to be comprehensive description of the perfor-
tentions. The free procedure also has a successful, which will reduce the chance mance tuning craft, but it should provide
fast path and a slow path. Because free that the kernel allocates objects by slow you with a good starting point for dis-
uses a distributed lock (page lock) and path. covering and fixing performance bottle-
the allocation process uses more exclu- This step improved the throughput necks on your Linux systems. p
sive locks, allocation by fast path is more significantly, requiring just one tenth the
important. time needed in the previous test. By ex- INFO
For these two objects, we noted that tensive testing with different slub_min_ [1] Linux Kernel Performance Project:
the free operation is quite slow; however, objects settings, we found the correlation http://kernel-perf.sourceforge.net
allocation is not fast, either. For example, between slub_min_objects and the CPU
[2] LatencyTOP: http://www.latencytop.
for objects of size 512, only 68% of allo- number.
org/index.php
cation is by fast path, and 7% of free is Mostly, we get the best result with
[3] PowerTOP: http://www.lesswatts.
by fast path. slum_min_objects=cpu_number*2. If
org/projects/powertop
[4] Less Watts:
Power Performance
http://www.lesswatts.org/
Power consumption is another aspect of cally recommended, whereas the perfor-
system performance. Most recent proces- mance governor will put the CPU at the
sors are equipped with processor perfor- maximum frequency and voltage. To Tim Chen is a staff engineer of the
mance states (P-states) and sleep states switch to the ondemand governor, issue Open Source Technology Center at
(C-states). If the system is not fully loaded, the following command: Intel Corporation. His current focus is
it is better to switch to a P-state that oper- mainly on Linux performance. Before
# echo ondemand > 5
ates the processor at a lower frequency working at Intel, he worked at Trillium
/sys/devices/system/cpu/cpu0/5
and voltage. If the processor is idle, the Digital Systems on telecommunica-
cpufreq/scaling_governor
system should switch to a sleep state. tions systems and at Hughes Space
To take advantage of these features, make To take advantage of the CPU C-states, and Communications on mobile sat-
sure the BIOS Speed Step and C-state fea- you need to enable the tickless idle feature ellite systems. He graduated from
tures are enabled. To take advantage of in the kernel. The Linux kernel has a peri- UCLA in 1995 with a Ph.D. degree in
THE AUTHORS

the P-state feature in the CPU, you need to odic timer tick that wakes up the CPU. This Electrical Engineering.
make sure that a suitable CPU frequency tick prevents the CPU from going into the Alex Shi joined Intels Open Source
governor is enabled for the system. To see sleep state. With the recent addition of the Technology as a software engineer in
what governors are available, use: tickless idle, the Linux kernel removed this 2005. He works on Linux performance
# cat /sys/devices/system/5 timer tick, which allows the CPU to sleep and power tuning.
cpu/cpu0/cpufreq/5
for a longer time in power-saving mode. If
Yanmin Zhang, from Open Source
you compile your own kernel, you should
scaling_available_governors Technology Center of Intel Corpora-
enable the option CONFIG_NO_HZ=y.
ondemand userspace performance
tion, has worked on Linux projects for
The PowerTOP utility [3] is a useful tool for five years, including processor and
With the following command, you can checking P-state and C-state status in the chipset enabling, which cover Intel
determine the current governor: system. PowerTOP will show the current i386, x86-64, and Itanium architec-
# cat /sys/devices/system/5
P-state and C-state, report on which appli- tures and PCI-Express. He is currently
cations wake up the CPU, and provide ad- working on the Linux Kernel Perfor-
cpu/cpu0/cpufreq/5
ditional power-saving hints tailored to mance project. Before joining Intel,
scaling_governor
your system. Yanmin worked for Bell Labs Lucent
The ondemand governor has the best Additional power-saving tips can be found Technology on network management
power-saving characteristics and is typi- at the Less Watts website [4]. system development.

36 ISSUE 100 MARCH 2009

You might also like