Pub - Computer Performance Evaluation and Benchmarking S PDF
Pub - Computer Performance Evaluation and Benchmarking S PDF
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
David Kaeli Kai Sachs (Eds.)
Computer Performance
Evaluation
and Benchmarking
13
Volume Editors
David Kaeli
Northeastern University
Department of Electrical and Computer Engineering
360 Huntington Ave., Boston, MA 02115, USA
E-mail: [email protected]
Kai Sachs
Technische Universität Darmstadt
Dept. of Computer Science
Schlossgartenstr. 73, 64289 Darmstadt, Germany
E-mail: [email protected]
CR Subject Classification (1998): B.2.4, B.2.2, B.3.3, B.8, C.1, B.1, B.7.1
ISSN 0302-9743
ISBN-10 3-540-93798-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-93798-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2009
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12603886 06/3180 543210
Preface
This volume contains the set of papers presented at the SPEC Benchmark Work-
shop 2009 held January 25 in Austin, Texas, USA. The program included eight
refereed papers, a keynote talk on virtualization technology benchmarking, an
invited paper on power benchmarking and a panel on multi-core benchmarking.
Each refereed paper was reviewed by at least four Program Committee members.
The result is a collection of high-quality papers discussing current issues in the
area of benchmarking research and technology.
A number of people contributed to the success of this workshop. Rudi Eigen-
mann served as General Chair and ably handled many of the details involved
with providing a high-quality meeting. We would like to thank the members of
the Program Committee for their time and effort in arriving at a high-quality
program. We would also like to acknowledge the guidance provided by the SPEC
Workshop Steering Committee.
We would like to thank the staff at Springer for their cooperation and support.
We want to particularly recognize Dianne Rice for her assistance and guidance,
and also Kathy Power, Cathy Sandifer and the whole SPEC office for their help.
And finally, we want to thank all SPEC members for their continued support
and sponsorship of this meeting.
Executive Committee
General Chair Rudi Eigenmann (Purdue University, USA)
Program Chair David Kaeli (Northeastern University, USA)
Publication Chair Kai Sachs (TU Darmstadt, Germany)
Program Committee
Jose Nelson Amaral University of Alberta, USA
Umesh Bellur Indian Institute of Technology Bombay, India
Anton Chernoff AMD, USA
Lieven Eeckhout University of Ghent, Belgium
Rudi Eigenmann Purdue University, USA
Jose Gonzalez Intel Barcelona, Spain
John L. Henning Sun Microsystems, USA
Lizy K. John University of Texas at Austin, USA
David Kaeli Northeastern University, USA
Helen Karatza Aristotle University of Thessaloniki, Greece
Samuel Kounev Universität Karlsruhe (TH), Germany
Tao Li University of Florida, USA
David Lilja University of Minnesota, USA
Christoph Lindemann University of Leipzig, Germany
John Mashey Consultant, USA
Jeffrey Reilly Intel Corporation, USA
Kai Sachs TU Darmstadt, Germany
Resit Sendag University of Rhode Island, USA
Erich Strohmaier Lawrence Berkeley National Laboratory, USA
Bronis Supinski Lawrence Livermore National Laboratory, USA
Petr Tůma Charles University in Prague, Czech Republic
Reinhold Weicker (formerly) Fujitsu Siemens, Germany
VIII Organization
Benchmark Suites
SPECrate2006: Alternatives Considered, Lessons Learned . . . . . . . . . . . . . 1
John L. Henning
SPECjvm2008 Performance Characterization . . . . . . . . . . . . . . . . . . . . . . . . 17
Kumar Shiv, Kingsum Chow, Yanping Wang, and
Dmitry Petrochenko
CPU Benchmarking
Performance Characterization of Itanium R
2-Based Montecito
Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Darshan Desai, Gerolf F. Hoflehner, Arun Kejariwal,
Daniel M. Lavery, Alexandru Nicolau,
Alexander V. Veidenbaum, and Cameron McNairy
A Tale of Two Processors: Revisiting the RISC-CISC Debate . . . . . . . . . . 57
Ciji Isen, Lizy K. John, and Eugene John
Investigating Cache Parameters of x86 Family Processors . . . . . . . . . . . . . 77
Vlastimil Babka and Petr Tůma
Power/Thermal Benchmarking
The Next Frontier for Power/Performance Benchmarking: Energy
Efficiency of Storage Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Klaus-Dieter Lange
Thermal Design Space Exploration of 3D Die Stacked Multi-core
Processors Using Geospatial-Based Predictive Models . . . . . . . . . . . . . . . . . 102
Chang-Burm Cho, Wangyuan Zhang, and Tao Li
John L. Henning
Sun Microsystems
[email protected]
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 1–16, 2009.
c Springer-Verlag Berlin Heidelberg 2009
2 J.L. Henning
The bottom line metrics mentioned thus far are called “speed” metrics, and
are analogous to speed of travel in the real world in that higher numbers are
better, and numbers are comparable. If a sports car takes 1/4 the time to get
to Cleveland as a truck, we routinely say that the sports car is 4x as fast; and if
a new laptop finishes a well defined task in 1/4 as much time as an old desktop
computer, it seems natural to call the laptop 4x as fast as the desktop.
Adding a Throughput Metric. The speed tests run only one copy of each
component benchmark at a time, leaving resources idle on multi-processor sys-
tems. SPEC addressed this problem in 1992 by adding throughput tests that
allow the tester to run multiple copies of identical benchmarks. For example, in
a 32-copy SPECint rate2006 test, the SPEC tool set starts 32 copies of 400.perl-
bench, waits for all of them to complete; records the time from start of first to
finish of last; then starts 32 copies of 401.bzip2, and so forth.
The fact that all copies are running the same workload is the reason that
SPECrate was originally known as the “Homogeneous Capacity Method” [1].
The details of the metric calculation have varied somewhat as the suites have
evolved, but in all cases a score is calculated for each benchmark which is pro-
portional to the number of copies run divided by the time required to complete
the copies. The bottom line metrics (e.g. SPECint rate95, SPECfp rate2000) are
the geometric means of the benchmark scores.
Interpretation of SPEC CPU throughput metrics is somewhat less intuitive
than the speed metrics. For example, if a laptop has a SPECint rate2006 score
of 10, and a server has a SPECint rate2006 score of 20, it is not immediately
obvious if the better result is achieved by running twice as many copies in the
same time, or by running the same number of copies in 1/2 the time, or by
some other method. The full reports provide the additional level of detail for the
motivated reader.
4.1 Heterogeneous
as implicitly containing 12 phases: between each row there is a pause to wait for
all of the row to finish. No such pause occurs with Table 2. In the heterogeneous
prototype, each queue runs independently.
Difficulties with the heterogeneous method. On the assumption that such resource
stresses are useful to study, reducing their levels in a heterogeneous workload
is bad, because it makes them less apparent and harder to analyze. A hetero-
geneous workload also makes it much more difficult to reproduce performance
conditions. For example, suppose that 255.vortex runs more slowly than desired.
To reproduce its conditions from Table 1, to a first approximation, one can sim-
ply run the 4 copies of vortex. To reproduce its conditions from Table 2, it is
necessary to run the whole suite. One cannot try to just run selected “rows”,
because the rows in Table 2 do not represent separate phases.
$ specinvoke -h
-S msecs sleep between spawning copies (in milliseconds)
$ runspec --stag
Option stag is ambiguous (stagger, staggeredhomogenousrate)
The specinvoke [11] utility provides a help message that tells us that stag-
gers are expressed in milliseconds. The first runspec command tricks the switch
parser into reminding us how to spell its undocumented switches, and the sec-
ond runspec command runs 2 copies of the test workload for the benchmark
473.astar, with a delay of 6 seconds between each copy.
As a reminder, the staggered homogeneous prototype is unsupported. If
the reader plays with it, you are reminded that anything you learn from it
cannot be represented as an official SPEC metric. If you do decide to use it,
you will probably find it easiest to discern what it did by looking in the run
directory:
1
With the notable exception of hardware and OS support for the instruction stream.
For SPECrate, each copy has its own data, but all use the same program binary,
allowing the OS the opportunity to load only one copy into physical memory. In a
heterogeneous context, obviously, multiple program binaries are active.
6 J.L. Henning
$ cd $SPEC/benchspec/CPU2006/473.astar/run/run*000
$ cat speccmds.out
timer ticks over every 1000 ns
running commands in speccmds.cmd 1 times
runs started at 1225226364, 29870000, Tue Oct 28 16:39:24 2008
run 1 started at 1225226364, 29876000, Tue Oct 28 16:39:24 2008
child started: 0, 1225226364, 29883000, pid=3147,
’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’
child started: 1, 1225226370, 30218000, pid=3148,
’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’
child finished: 0, 1225226376, 980432000, sec=12, nsec=950549000,
pid=3147, rc=0
child finished: 1, 1225226383, 556000, sec=12, nsec=970338000,
pid=3148, rc=0
run 1 finished at: 1225226383, 562000, Tue Oct 28 16:39:43 2008
run 1 elapsed time: 18, 970686000, 18.970686000
runs finished at 1225226383, 597000, Tue Oct 28 16:39:43 2008
runs elapsed time: 18, 970727000, 18.970727000
Notice above that the two copies were started 6 seconds apart (1225226364 and
1225226370 seconds after Jan. 1, 1970), each took just under 13 seconds, and
the total elapsed time was just under 19 seconds. The bottom line includes the
time for the stagger, as it is measured from start-of-first copy to finish-of-last.
One might want to consider other ways of calculating a bottom line. (Reminder:
any use of the prototype may not be represented as an official SPEC metric.)
Difficulties with the staggered homogeneous method. Should the metric include
the stagger time? If so, unless the staggers are very small, too much idle time
may be included. Alternatively, one might try to exclude the staggers by, for
example, calculating time from start-of-last to finish-of-first; a disadvantage of
this approach is that it could cause performance to be overstated if one copy has
more hardware resources than others (e.g. a 16-chip, 64-core system with 4 copies
on 15 of the chips, but only 1 copy on the last). Perhaps the most attractive
alternative would be to attempt to achieve a steady state of repeated execution,
with all processors busy, running staggered workloads; one would compute a
metric that sampled execution time for complete jobs during the steady state.
The primary disadvantage of this approach is that the suite is sometimes already
criticized as taking too long; running repeated workloads to ramp up to a steady
state was not viewed as attractive.
SPECrate2006: Alternatives Considered, Lessons Learned 7
5 Applying SPECrate2006
SPECrate provides a useful window into how systems perform when stressed
by multiple requests for similar resources such as program startup, data ini-
tialization, translation lookaside buffer (TLB) requests, and memory allocation.
It is understood that in real life, an OS is unlikely to get 128 simultaneous
identical requests, so one must be careful not to over optimize to this, or to
any other, benchmark. Nevertheless, the homogeneity may be its virtue: in
real life, systems do have to deal with intense requests, traffic jams do occur,
and SPECrate presents a compute-intensive workload that is repeatable and
analyzable.
In this section, three case studies are briefly summarized from applying
SPECrate2006 to Solaris systems.
Methods. In order to focus on the second part of the benchmark, the utility
convert_to_development [10] was applied to allow modifications to the ref
workload while still using the SPEC tools. The first workload was deleted, leav-
ing only ref.mps in the directory 450.soplex/data/ref/input. Then, 128 run di-
rectories were populated on a large server using runspec --action setup. The
actual runs were done using specinvoke -r [11]. In order to avoid unwanted file
caching effects (which would not be effective in a full reportable run), memory
was cleared between tests by running large copies of STREAM [17] and reading
a series of unrelated files. CPU and IO activity were observed using iostat 30.
8 J.L. Henning
Metrics. As each run began, CPU utilization was low, and disk activity high, as
128 copies of ref.mps were read. Eventually, the io kps fell to zero and the tested
processors achieved 100% utilization. Two metrics are reported: (1) Startup time
in minutes, determined by counting the 2-per-minute iostat records prior to 100%
utilization; (2) kps from the busy period (converted to MB/sec).
Baseline. When a single 10K RPM disk was used, startup required about 24
minutes, reading at about 24 MB/sec.
Software RAID. When Solaris Volume Manager was used with the default 16 KB
block size (known as an “interlace size” in the terminology of SVM) on an
A5200 Fiber Channel disk array with 6x 10K RPM disks, startup fell to about
20 minutes, reading about 30 MB/sec. With a block size of 256 KB, startup
improved to about 8 minutes and 72 MB/sec. For this read-intensive workload,
RAID-0 was not particularly faster than RAID-5. Increasing the number of disks
in the stripe set had little additional effect on performance, as the maximum
observed bandwidth for this somewhat older disk system was about 78 MB/sec.
Hardware RAID. A newer hardware RAID Array, the Sun StorageTek 2540 with
6x 15K RPM disks, did not show sensitivity to block size (called “segment size”
for this device) over the tested range of 16 KB through 512 KB. This insensitivity
may be viewed as a plus, since it may be hard to know in advance what block
size to choose. The bandwidth was about 97 MB/sec, roughly matching the
limit of the 1 Gb Host Bus Adapter (HBA) used in this test. Once again, read
performance was insensitive to use of RAID-0 vs. RAID-5. Further improvement
might be possible with a higher bandwidth HBA.
Divot summary. With hardware RAID, a performance divot of idle CPUs wait-
ing on I/O was reduced from 24 minutes to 6 minutes, which is a 4:1 improvement
over the original single-disk configuration.
Lessons for tuning other systems. Even in an allegedly CPU intensive environ-
ment, IO lurks. Hardware RAID may offload overhead from the server.
base peak
run #1 86.04 86.56
run #2 86.09 86.98
run #3 85.82 63.52
Peak Peak
Metric Run 2 Run 3
Analysis: Variation by copy. Recall from the metrics discussion at the beginning
of this paper that reported benchmark scores depend on the time from start of
first copy to completion of last. Therefore, a primary goal for the tester is to
attempt to achieve consistency across all tested copies – in this case, 127 copies
on a 2-chip system. Table 4 summarizes the copy-by-copy times in the second
and third peak runs.
In Table 4, times are normalized to the median time from Peak Run 2. Notice
the consistency in Peak Run 2, with the worst of the 127 copies needing only
1.04% more time than the median time. By contrast, the slowest copy in Peak
Run 3 needed 38.37% more time than the median of Peak Run 2. The problems
in Peak Run 3 are not widespread; in fact, only 6 of the 127 copies were slow.
If these 6 copies were eliminated, as shown in the second half of the table, the
two runs would match each other. Unfortunately for the tester the metrics do
not allow post-processing to eliminate the slow copies.
10 J.L. Henning
Considerable time was spent trying to trace the source of the occasional
poor copy time for 436.cactusADM, which sometimes was up to 2x worse than
the expected time. Analysis of experiment logs did not indicate any particular
pattern to the degraded performance. Sometimes, a handful of copies would be
slow; often, none would be slow. The slow performance did not appear to be tied
to system state, nor to particular virtual processors, as it would move around
from one CPU to another. Attempts to instrument the tests were often met by
a failure to reproduce the slow performance.
Smoking gun. Eventually, a bad run was caught with trapstat -T [14]:
cpu dtsb-miss %tim
7 4138331 59.8
11 4117256 60.2
14 4135205 59.9
21 4114273 60.4
23 4139823 59.5
In the trapstat output, it can be seen that various copies (on virtual pro-
cessors 7, 11, 14, 21, 23) are estimated to be spending about 60% of their time
processing TSB misses. Once this was found, the solution to the variability of
436.cactusADM was straightforward. As mentioned above, the hardware allows
TSBs to be expanded, and Solaris supports the hardware feature with a pair
of tunables: enable tsb rss sizing and tsb rss factor [16]. The former is on by
default; the latter provides a measure of how full TSBs have to be before they
become candidates for resizing. As can be seen in SPEC CPU submissions from
early 2008, this Solaris tuning parameter has been used, and 436.cactusADM
performance has been steady. For example, in a large SPECrate submission
with 630 copies, the three runs differed from each other by less than 1% [8].
If per-copy results are analyzed (as in Table 4), the worst time across all 1890
copies differs from the median by only 1.52%.
Lessons for tuning other systems. The default TSB sizing is adequate for most
applications, especially if large pages are employed. If it is suspected that large
applications (e.g. more than 1 GB, with 4 MB pages) may be running more
slowly than desired, trapstat -T can be used to check for TSB activity, and if
it is found, tsb rss factor can be decreased.
seconds
1200
1000
800
600
400
200
0
0 16 32 48 64 80 96 112 128 144
processor
seconds
3000
2500
2000
1500
1000
500
0
0 16 32 48 64 80 96 112 128 144
processor
Figure 1 is from a large 72-chip, 144 processor server, running 143 copies of
the benchmarks. The server has 18 system boards, each with 8 virtual CPUs. In
the graph, the vertical grid delimits system boards. Notice that most copies of
429.mcf completed in about 800 seconds, except for those on the second system
board. Attempts to trace the problem showed that generally a single system board
would be slow, but it was, at first, hard to predict which board. In Figure 2, taken
from a different large server, notice that it is the 4th from the last that is slower.
Graphical analysis. Edward R. Tufte suggests that graphs should be used only
if one has large amounts of data needing analysis, and they should contain only
pixels that are essential to the analysis, avoiding “chartjunk” [18]. The situation
at hand has over 14,000 benchmark observations in each 143-copy reportable
SPECfp rate2006 run, and many more from tuning runs. To ease graphical
analysis, a perl procedure was written that extracted data from log files, drove
gnuplot with what was viewed as a minimal amount of chartjunk (as in the
above graphs), and joined them into a webpage.
12 J.L. Henning
NUMA Hypothesis. Because the graphs showed that problems would tend to
occur on a single system board, and because it is known that local system board
memory access has shorter latency than remote memory, NUMA (Non Uniform
Memory Access) differences were suspected. Solaris supports NUMA using a
concept of Memory Placement Optimization (MPO) [2], which attempts to place
process resources into “latency groups”. A latency group is a set of resources
which are within some latency of each other. Systems can have multiple latency
groups, and multiple levels of groups.
Tools. NUMA activity can be seen on Solaris 10 systems with the opensolaris.org
“NUMA Observability Tools” [15]. Two useful tools are the extended pmap and
lgrpinfo. The first is easily installed from the tools binary distribution:
$ gunzip -c ptools-bin-0.1.7.tar.gz | tar xf -
$ cd ptools-bin-0.1.7/
$ ./pmap -Ls $$ | head -10
Address Bytes Pgsz Mode Lgrp Mapped File
00010000 640K 64K r-x-- 2 /usr/bin/bash
000C0000 64K 64K rwx-- 1 /usr/bin/bash
000E0000 128K 64K rwx-- 1 [ heap ]
FF0F4000 8K 8K rwxs- 1 [ anon ]
In the pmap example above, note that -L tells us the locality group for each
memory segment, and -s displays the page size. (In the interest of space, various
output is truncated in both the examples in this section.)
To install lgrpinfo requires a couple of extra steps, because a customization
is needed for the local version of perl:
$ gunzip -c Solaris-Lgrp-0.1.4.tar.gz | tar xf -
$ cd Solaris-Lgrp-0.1.4/
$ perl Makefile.PL
Writing Makefile for Solaris::Lgrp
$ make
$ make test
All tests successful.
$ su
Password:
# make install
# exit
$ bin/lgrpinfo
lgroup 0 (root):
Children: 1 2
CPUs: 0-127
Memory: installed 130848 Mb, allocated 3924 Mb, free 126924 Mb
Lgroup resources: 1 2 (CPU); 1 2 (memory)
lgroup 1 (leaf):
CPUs: 0-63
Memory: installed 65312 Mb, allocated 1675 Mb, free 63637 Mb
Lgroup resources: 1 (CPU); 1 (memory)
SPECrate2006: Alternatives Considered, Lessons Learned 13
lgroup 2 (leaf):
CPUs: 64-127
Memory: installed 65536 Mb, allocated 2249 Mb, free 63287 Mb
Lgroup resources: 2 (CPU); 2 (memory)
$
In the lgrpinfo example, the output describes a system with 128 virtual pro-
cessors and 128 GB memory, divided into two latency groups. (For the sake of
brevity, this example is from a simpler system than the one in the graphs.)
Diagnosis. Use of pmap showed that the benchmarks running in the slower local-
ity group were receiving memory of the requested page size (4 MB) but not the
desired location. It was also noted that the slow locality group was the one where
the SPEC tool suite itself (runspec) was started. Observations with lgrpinfo
showed that during the benchmark setup phase, when runspec writes 143 run
directories for each of the benchmarks in the suite, physical memory was used
up in runspec’s locality group, apparently for file system caches.
Workarounds attempted. It was hypothesized that the setup phase may have
fragmented memory on runspec’s system board; and that the operating system
might not be able (or, might not be willing) to coalesce fragmented 8 KB pages
into 4 MB pages. Asking for smaller page sizes (such as 64 KB or 512 KB) some-
times appeared to succeed, but this compromise was not considered desirable
since the benchmarks are large enough that 4 MB pages are known to be help-
ful. The size of file system caches was reduced using system tuning parameters
such as bufhwm and segmap percent, and memory cleanup was encouraged with
reasonably active settings for autoup and tune t fsflushr [16]. To improve pre-
dictability, runspec was initiated in a known location, namely the system board
that is also used by Solaris itself, and the amount of physical memory on that
board was doubled.
These workarounds were usually helpful, and memory availability usually im-
proved, but the workarounds were viewed as less than completely satisfactory
on the grounds that in real life, customers may not have the degree of control
that the benchmark tester has.
Colloquially, the problem can be simply summarized as: “Dear Operating
System: If I ask for local bigpages, and you don’t have them handy, please don’t
give me remote bigpages instead. Please try harder to create local bigpages.”
Given this simple summary, a simple suggestion arises: why not just change the
default policy to always try harder?
There are several reasons to hesitate to change the default policy: (1) Co-
alescing pages may be expensive, as it requires relocating pages for running
processes. (2) For processes that run quickly, it may be better to allocate mem-
ory quickly rather than spending extra effort. (3) It is unknown how frequently
the problem may occur in real life: how often do long-running programs ask for
large memory regions with large pages, which are then used intensely enough to
amortize any extra cost required to coalesce pages? Given insufficient data to
14 J.L. Henning
seconds
6000
Peak run 2 Run 2 has a slow lgroup
5000 Peak run 3
4000
2000
1000
0
0 16 32 48 64 80 96 112 128 144
processor
answer questions such as these, the operating system policies must be approached
with care.
Changes to Solaris. Over the course of the investigations of these issues, the
Solaris development group responded by implementing two changes. First, the
algorithm for coalescing pages was made more efficient. Second, a tunable param-
eter was introduced to allow users to increase the priority of local page allocation:
lpg alloc prefer [16]. If you have single threaded, long running, large memory ap-
plications, then consider setting lpg alloc prefer=1. This causes Solaris to spend
more CPU time defragmenting memory to allocate local large pages, versus allo-
cating readily available remote large pages. The long term savings from accessing
local rather than remote memory may offset the higher allocation cost.
This tunable parameter is used in the 256 virtual processor Sun SPARC En-
terprise T5440 SPECfp rate2006 result [9]. When the graphical analysis tools
are applied to this result, NUMA effects are not seen.
Divot summary. An early version of lpg alloc prefer was applied to a system in
the middle of a SPECrate run. The effect was to remove a NUMA performance
divot that would sometimes slow down a single system board. The largest effect
was on the benchmark 433.milc, as shown in Figure 3.
Because the tools report the time from start-of-first to finish-of-last, the bot-
tom line improved by 52%:
Success 433.milc peak ref ratio=226.74, runtime=5789.629458
Success 433.milc peak ref ratio=344.64, runtime=3809.035023
Lessons for tuning other systems. Systems that tend to run large single-threaded
programs may benefit from setting lpg alloc prefer.
SPECrate2006: Alternatives Considered, Lessons Learned 15
6 Summary
Although it is widely understood that “all software has bugs”, it may
not be as widely understood that all systems have performance divots.
A repeatable, analyzable workload allows divots to be analyzed. Cherish
your divots.
References
1. Carlton, A.: CINT92 and CFP 92 Homogeneous Capacity Method Offers Fair Mea-
sure of Processing Capacity, http://www.spec.org/cpu92/specrate.txt
2. Chew, J.: Memory Placement Optimization (MPO),
http://opensolaris.org/os/community/performance/mpo overview.pdf
3. Gove, D.: CPU2006 Working Set Size. ACM SIGARCH Computer Architecture
News 35(1), 90–96 (2007), http://www.spec.org/cpu2006/publications/
4. Henning, J.L.: SPEC CPU Suite Growth: An Historical Perspective. ACM
SIGARCH Computer Architecture News 35(1), 65–68 (2007),
http://www.spec.org/cpu2006/publications/
5. McGhan, H.: Niagara 2 Opens the Floodgates. Microprocessor Report (November
6, 2006),
http://www.sun.com/processors/niagara/M45 MPFNiagara2 reprint.pdf
6. SPEC CPU2000 published results,
http://www.spec.org/osg/cpu2000/results/res2000q2/
cpu2000-20000511-00104.html,
http://www.spec.org/osg/cpu2000/results/res2000q2/
cpu2000-20000511-00105.html
7. SPEC CPU2000 published results,
http://www.spec.org/osg/cpu2000/results/res2002q2/
cpu2000-20020422-01329.html,
http://www.spec.org/osg/cpu2000/results/res2002q1/
cpu2000-20020211-01256.html
8. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/
res2008q2/cpu2006-20080408-04064.html
9. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/
res2008q4/cpu2006-20080929-05409.html
10. SPEC CPU2006 Documentation, http://www.spec.org/cpu2006/docs/utility.
html#convert to development
16 J.L. Henning
Intel Corporation
{kumar.shiv,kingsum.chow,yanping.wang,
dmitry.petrochenko}@intel.com
1 Introduction
The release of SPECjvm98 [6] as a client side workload stirred up a lot of interest in
performance analysis of Java workloads. Dieckmann and Holzle [1] studied the allo-
cation behavior of the SPECjvm98 Java benchmarks. Radhakrishna [2] did an in
depth analysis of micro-architectural techniques to enable efficient Java execution.
The benchmarks were also used to go beyond Java code as Li and John [3] character-
ized operating system activity in SPECjvm98 Benchmarks. However, modern ma-
chines are too fast [4] for the 10 year old benchmark and an overhaul had been long
overdue.
Now ten years later, the release of SPECjvm2008 is expected to stir a lot of interest in
how the latest overhaul of the benchmark is going to enable and encourage Java perform-
ance analysis on modern architectures. The designers of the new SPECjvm2008 have
kept that in mind and the benchmark is intended to take advantage of multiple cores,
higher frequencies, bigger caches, and larger memory bandwidths.
In this work we have performed several experiments with SPECjvm2008.
Our analysis of the workload running on the latest modern processors is intended to
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 17–35, 2009.
© Springer-Verlag Berlin Heidelberg 2009
18 K. Shiv et al.
2 Description of SPECjvm2008
Latest advances in processor and Java technologies have necessitated an overhaul of
the SPECjvm98 benchmark [6]. Now, 10 years later, Standard Performance Evalua-
tion Corporation (SPEC) has updated it with a new version, SPECjvm2008 [7]. An
overview of the comparison between SPECjvm98 and SPECjvm2008 is summarized
in Table 1.
SPECjvm2008 comprises many multithreaded workloads that represent a broad col-
lection of Java applications for both servers and clients. It can be used to evaluate per-
formance of Java Virtual Machines (JVM) and the underlying hardware systems. It can
stress various components inside the JVM, such as the Java Runtime Environment
(JRE), Just-in-time (JIT) code generation, the memory management system, threading
and synchronization features. SPECjvm2008 is also designed with modern multicore
processors in mind. A single JVM instance running the workload will generate
enough threads to stress the underlying hardware systems. It is expected to be useful
in the evaluation many hardware features such as the impact of the number of cores
and processors, the frequency of the processors, integer and floating point operations,
cache hierarchy and memory sub systems.
SPECjvm2008 comes with a set of analysis tools such as a plug-in analysis frame-
work that can gather run time information such as heap and power usage. It also
comes with a reporter that displays a summary graph of test runs. It is easy to config-
ure and run and provides quick feedback for performance analysis. SPECjvm2008 is
perhaps a little biased towards server performance as the minimum memory require-
ment is 512MB per hardware thread.
SPECjvm2008 can be run in 2 modes: base and peak runs. The base run simulates
environments in which users do not tune software to increase performance. No con-
figuration or hand tuning of the JVM is allowed. The base run has fixed run dura-
tions: 120 seconds warm-up, followed up by 240 seconds measurement interval. The
peak run simulates environments in which users are allowed to tune the JVM to in-
crease performance. It also allows feedback optimizations and code caching. The
JVM can be configured to obtain the best score possible, using command line parame-
ters and property files, which must be explained in the submission. In addition, the
peak run has no restrictions on either the warm-up time or the measurement interval.
But only 1 measurement iteration is allowed for each workload. A base submission is
required for a peak submission.
SPECjvm2008 is available for free. It can be downloaded from the SPEC website.
SPECjvm2008 is composed of 11 groups of Java SE applications for both
clients and servers. Each group represents a unique area of Java applications. The
overall score is computed by nested geometric means as described by Richard M Yoo
et al [5].
Score = k n1
X 11... X 1n1 ...nk Xk1... Xknk
The overall SPECjvm2008 score is computed by substituting k by 11 and n1..k by
the corresponding numbers of workloads in each group. Each of 11 groups has an
equal weight of the 11th root of the final composite score.
The compositions of the 11 groups of workloads are summarized in Table 2.
Tests are run in order, i.e., starting with startup.helloworld and ending with
xml.validation. A new JVM instance is launched for each “startup” workloads. After
all the “startup” workloads are run, a single JVM is launched to run the rest of the
workloads, i.e., from compiler.compiler to xml.validation. Thus, the environment left
from running each workload may impact the performance of the workloads coming
after it.
20 K. Shiv et al.
2.1 Startup
The startup group of workloads measures the JVM startup time of each workload by
starting each on of them with a new instance of JVM. Each workload in this group is
SPECjvm2008 Performance Characterization 21
2.2 Compiler
The compiler group has two workloads: compiler and sunflow. The com-
piler.compiler workload measures the compilation time for the OpenJDK compiler.
The compiler.sunflow workload measures the compilation of the sunflow benchmark.
As the goal of these workloads is to evaluate the performance of the compiler, the
impact of I/O is reduced by storing input data in memory, or file cache.
2.3 Compress
2.4 Crypto
The crypto group contains three different workloads to represent on different impor-
tant areas of cryptography. They test vendor implementations of the protocols as well
as JVM execution. The three workloads are crypto.aes, crypto.rsa and
crypto.signverify.
The crypto.aes workload encrypts and decrypts using the AES and DES protocols,
applying CBC/PKCS5Padding and CBC/NoPadding. The input data sizes are 100
bytes and 713 KB, respectively.
The crypto.rsa workload encrypts and decrypts using the RSA protocol for input
data sizes of 100 bytes and 16 KB.
The crypto.signverify workload signs and verifies using MD5withRSA,
SHA1withRSA, SHA1withDSA and SHA256withRSA protocols for input data sizes
of 1KB, 65KB and 1MB.
Different crypto providers can be used.
2.5 Derby
The derby workload uses an open-source database, derby [8], written in pure Java.
Multiple databases are instantiated when the workload is started. Every 4 threads
share one database instance. Synchronization is exercised in this workload. This work-
load extended IBM’s telco benchmark [12] to synthesize business logic and to stress the
use of the BigDecimal operations. These BigDecimal computations are mostly longer
22 K. Shiv et al.
than 64-bit to examine not only 'simple' BigDecimal, which can be implemented
using the long type, but also BigDecimal values that have to be stored in larger data
sizes. Thus this workload exercise both database and BigDecimal operations.
2.6 Mpegaudio
As the source for the mpegaudio workload from SPECjvm98 cannot be made avail-
able, a new version of mpegaudio is created in SPECjvm2008. It uses the MP3 library
called JLayer [9], an Mpeg Audio decoder. This workload is floating-point computa-
tion centric. Its input data set contains six MP3 files sized from 20KB to 3MB.
2.7 Scimark
Scimark, as the name implies, is based on the well known Scimark benchmark devel-
oped by NIST [10]. This group of workloads evaluates floating-point operations and
data access patterns for intensive mathematical computations. Scimark is modified for
multi-threading with different dataset sizes in SPECjvm2008.
Scimark is actually composed of two groups in SPECjvm2008, scimark.large and
scimark.small, for large and small data sets. Each thread in the workload consumes
one data set. The “large” group runs with 32MB data set to simulate out of cache
access performance while the “small” group runs with 512K data set to simulate in-
cache access performance.
Each group is composed of 5 workloads: fft, lu, sor, sparse and monte_carlo.
Scimark.monte_carlo is run once but counted twice in both scimark large and scimark
small as the workload does not work on different data set sizes.
2.7.1 Scimark.FFT
Scimark.fft simulates Fast Fourier Transformation of one-dimensional, in-place algo-
rithm with bit-reversal and Nlog(N) complexity for large (2MB) and small (512KB)
data sets.
2.7.2 Scimark.SOR
Scimark.sor simulates Jacobi Successive Over-relaxation for large (2048x2048 grid)
and small (250x250 grid) data sets. It exercises typical access patterns in finite differ-
ence applications. The algorithm exercises basic "grid averaging" memory patterns,
where each A(i,j) is assigned an average weighting of its four nearest neighbors.
2.7.3 Scimark.Sparse
Scimark.sparse matrix multiplication uses an unstructured sparse matrix stored in
compressed-row format with a prescribed sparsity structure. It exercises indirection
addressing and non-regular memory references for large and small data sets. The large
data set contains a 200000x200000 matrix in compressed form with 4000000 non-
zeros in it. The small data set contains a 25000x25000 matrix in compressed form
with 62500 non-zeros in it.
SPECjvm2008 Performance Characterization 23
2.7.4 Scimark.lu
Scimark.lu computes the LU factorization of a dense, in-place, matrix using partial
pivoting. It solves a linear system of equations using a prefactored matrix in LU form.
It exercises linear algebra kernels (BLAS) and dense matrix operations for large
(2048x2048) and small (100x100) data sets.
2.7.5 Monte-carlo
Scimark.monte_carlo approximates the value of Pi by computing the integral of the
quarter circle y = sqrt(1 - x^2) on [0,1]. It chooses random points with the unit square
and computes the ratio of those within the circle versus those outside the circle. The
algorithm exercises random-number generators, synchronized function calls and func-
tion inlining. This workload is counted once in each of the Scimark large and Scimark
small groups.
2.8 Serial
2.9 Sunflow
2.10 XML
The XML group contains two workloads: xml.transform and xml.validate. The
xml.transform workload exercises the JAXP implementation by executing XSLT
transformations with DOM, SAX, Stream sources. It uses XSLTC engine, which
compiles xsl style sheets into java classes. 10 real life use cases are implemented. The
xml.validation workload exercises JAXP implementation by validating XML instance
documents against XML schema. 6 real life use cases are implemented.
Both XML workloads have high object allocation rate, high level of contended
locks. They also heavily exercise string operations. Each use case has approximately
the same influence on workload score.
Workload Score
compiler.compiler 937.23
compiler.sunflow 1119.25
Compress 614.14
crypto.aes 214.77
crypto.rsa 2012.82
crypto.signverify 1173.08
Derby 174
mpegaudio 350.44
scimark.fft.large 15.49
scimark.lu.large 5.14
scimark.sor.large 25.99
scimark.sparse.large 18.93
scimark.fft.small 4384.62
scimark.lu.small 4903.85
scimark.sor.small 713.61
scimark.sparse.small 509.41
scimark.monte_carlo 4903.85
Sunflow 195.7
xml.transform 1540.12
xml.validation 1117.91
on systems running at 2.92 GHz with a 1066 MHz front-side bus. Each socket has 2x4M
last-level cache (LLC). The system had 16 GB of memory. We also used a platform
using pre-release i7 processors for some additional experiments. As the i7 processors
have not yet been released, we are not able to share raw performance numbers at this
time. Nevertheless, we are able to show some interesting data. Also, unless specified
otherwise, the data were collected on the core-2 duo based platform.
On the software side, we used Sun’s Hotspot JVM for Java 6, jre-6u4-perf build X64,
and ran the benchmark with a heap size of 14 GB. The garbage collector was genera-
tional, stop-the-world and parallel, and the JVM allocated data and code into large pages.
The XML components used the Xerces parser from Apache. The Operating System was
Linux, RHEL5.
Table 3 presents one set of baseline data. We have observed a fair amount, occasion-
ally more than 5%, of run-to-run variation, and this will be the cause of some differences
in data in subsequent tables. The score of each component has a metric of operations per
minute, and can be seen to vary from a low of 5 ops/min for scimark.lu.large to a high of
4904 ops/min for scimark.lu.small and scimark.monte_carlo. This wide range motivated
the benchmark developers to choose to use a geometric mean; clearly using an arithmetic
mean would have been skewed by the higher scoring components.
Table 4 looks at the effect of optimizing the benchmark performance through tun-
ing and configuring of the system. SPECjvm2008 facilitates this by requiring the
SPECjvm2008 Performance Characterization 25
reporting of two scores, base and peak, whenever peak scores are reported. We see
that the benchmark’s performance is boosted by almost 7%. The performance in-
crease is higher for some components, and is almost 40% for compiler.compiler. A
few components, however, actually lose performance. The same set of configuration
parameters need to be used for all the components, and the options that work best for
the benchmark as a whole may not be optimal for a few of the components. Bringing
up the workloads with all of the configuration parameters, specifically an option re-
ferred to in Hotspot as AggressiveOpt, which turns on more sophisticated compiler
optimizations in the JIT, now takes longer thus hurting the performance of Startup.
Performance degradation is highest for scimark.fft.large. This workload computes
the FFT of a large set of data and has a 2^N stride through the data; the data access
pattern is such that the performance is actually hurt by the use of large pages due to
ineffective cache utilization, and suffers to the extent of 30%. Most of the other com-
ponents do enjoy the use of large pages.
In Table 5, we look at some basic metrics. As points of comparison, the correspond-
ing data for SPECjbb2005 and SPECjAppServer2004 are also included. It can be seen
SPECjAppServer2004 for comparison. Table 6 presents this data, and we can immediately
see that apart from two components, compiler.compiler and compiler.sunflow, the rest have
a very demand on the garbage collection infrastructure of the JVM. Two of the compo-
nents in fact make no demands on the garbage collector at all, and eleven others spend less
than 0.1% of time in garbage collection. SPECjbb2005 on the other hand spends more than
2% of time in GC and SPECjAppServer2004 spends 7.5% of time in GC.
Not surprisingly, the object allocation data shows the same pattern, with com-
piler.compiler and compiler.sunflow having allocation rates lying between the alloca-
tion rates of SPECjbb2005 and SPECjAppServer2004, while most other components
have relative low allocation rates. Four components, though, diverge from this pat-
tern, and show high allocation rates and low garbage collection usage. Since the rate
at which GC is invoked is directly related to the allocation rate, it follows that these
components are spending less time in GC, because each garbage collection goes faster
for these components. We can theorize that these components have far fewer live
objects when GC is invoked, but we have not yet fully tested this theory. Derby, for
instance, has a high object allocation rate due to the frequent allocation of immutable
long BigDecimal objects, which do not stay alive very long.
Turning our attention to hardware performance metrics, we look at the CPI and Path-
length (Instructions Retired per Operation) for each component of SPECjvm2008, and
once again provide the data for SPECjbb2005 and SPECjAppServer2004 as well. The
CPI data shows a very wide range all the way from 0.35 for a couple of the scimark
workloads, sparse.small and monte_carlo, to 37 for scimark.fft.large. While the range is
large, only a few of the components have CPI values that are close to values seen for the
established benchmarks, SPECjbb2005 and SPECjAppServer2004.
Pathlength, the number of instructions executed per benchmark operation, shows a
similarly wide range from 860 million instructions for scimark.fft.small to 35 billion
instructions for scimark.lu.large. It is interesting to note that the SPECjvm2008 com-
ponent pathlengths are much larger than the pathlengths of SPECjAppServer2004 and
SPECjbb2005. The developers of the new benchmark have defined each compo-
nent’s operation to be a bundle of the underlying component transactions thus leading
to significantly higher pathlengths.
We next look deeper at the CPI data. Since there is a wide range of CPI, and high
values of CPI for workloads are frequently due a strong memory dependency, we
compared the memory requirements for each component, and we present that data in
Table 8. Interestingly, while there is indeed a correlation that can be seen and several
of the workloads without lower memory bandwidth requirements have lower CPIs,
the workloads with the highest CPIs display very low bandwidth demand. Specifi-
cally, scimark.fft.large has a CPI of 37 and memory bandwidth requirement of 9
MB/s. SPECjbb2005, as a point of comparison, has a CPI of 1.22 and a memory
bandwidth requirement of 6 GB/s. This is true to a somewhat lesser extent for
scimark.sor.large, scimark.lu.large, and scimark.sparse.large. These workloads do not
have a high CPI because of excessive memory bandwidth demands. However, this
does not necessarily rule out memory latency as a cause of their high CPI.
Intel processors provide a range of performance event counters. In Table 9 we ex-
amine some key metrics, last-level cache (LLC) MPI (Misses per Instruction), ITLB
and DTLB misses, number of floating-point instructions, and the HITM metric to
understand the sharing behavior of each component.
The LLC MPI data gives us a clear pointer to the cause of the very high CPIs suf-
fered by four of the scimark components. Scimark.fft.large, the workload with the
highest CPI, has an MPI of 0.05, or 1 cache miss every 20 instructions, a rate that is
approximately 20 times the rate of cache misses in SPECjbb2005 and
SPECjAppServer2004. The memory latency seen by these cache misses causes the
30 K. Shiv et al.
high CPI. The high CPI restricts the performance of the workload strongly, and the
resulting low throughput creates an appearance of low memory bandwidth require-
ment. The performance of these four workloads is therefore strongly dependent on
memory latency. Of the remaining components, several have negligible cache misses,
while the few (compiler.*, xml.*, sunflow, derby) with moderate CPI have MPIs of
the same order of magnitude seen in SPECjbb2005 and SPECjAppServer2004. It is
not surprising that Derby, with its high allocation rate of immutable BigDecimal ob-
jects, has a significant MPI of 0.0057.
One criticism that can perhaps be leveled at SPECjbb2005 and
SPECjAppServer2004 is the low usage of floating-point. Some of the components of
SPECjvm2008 on the other hand can be seen to have significant levels of floating-
point usage. Derby, especially, has a floating-point instruction usage rate of 0.01, or 1
out of every 100 instructions.
Most of the components have small code footprints. Once again, Derby stands out
as the exception, suffering an ITLB miss every 2500 instructions. None of the work-
loads face much DTLB pressure. Some of the DTLB-miss metrics look high until we
recall the high pathlengths of these workloads.
Both SPECjbb2005 and SPECjAppServer2004 have negligible HITM rates indicat-
ing low sharing of data between the LLCs on the sockets. While these benchmarks
inherently have low sharing, the HITM metric is also lowered by the benchmarks
being run with multiple JVMs, 1 JVM per LLC. SPECjvm2008 run-rules preclude
the use of multiple JVMs which allows us to see the level of sharing amongst the
threads. The more significant cases are the components which have both higher MPI
and high HITM rates. Derby, the xml workloads, and some of the scimark compo-
nents, all exhibit high levels of cache-to-cache memory traffic.
As the number of cores in a chip continues to increase, the number of cores in even
clients and small servers is increasing rapidly. Therefore the scaling of these work-
loads with the number of processors is of some interest.
Table 10 presents the scaling data, by showing the relative performance of each
component to its performance on 1 processor. Since our system has 16 processors,
we have tested the scaling from 2 to 16 processors.
The start-up time of the workloads is unaffected by the number of processor cores
available, since much of the JVM initialization code is single threaded. Several other
workloads, such as compress, crypto.* and mpegaudio exhibit excellent scaling, while
a few show super-linear behavior. Since there is sufficient run-to-run variation, this
data should be treated with caution. While these data points may well be noisy, it
must also be noted that this kind of scaling is not theoretically impossible; the avail-
ability of more cores allows the JVM to use more threads for compilation and optimi-
zation, and this can allow the generation of better code. This, we must emphasize, is
just a theory. This benchmark is still new, and it will take some time and additional
experiments to filter out the noisy data.
Intel recently announced that it would release a new Xeon micro-architecture, the
i7. We performed a few experiments on a pre-release platform and present those re-
sults next. Table 11 presents the ratio between Peak and Base for the I7 is similar to
that seen with the Core2-Duo in most respects. One notable exception is
scimark.fft.small which now suffers 14% degradation whereas our earlier results
showed a 7% gain. This workload is sensitive to data layout. Data layout change
causes different effects since 2 processors having very different cache architectures.
The Core2 has a two-level cache system while the i7 has a three-level cache. The
second level cache on the i7 is much smaller (only 256K) relying on the large (8M)
third level cache to reduce accesses to memory. However, as a result, the cost of
accessing a line from the last-level third level cache in i7 is more than the cost of
accessing a line from the second level cache. For most workloads, the bigger third
Peak/Base
compiler.compiler 1.448
compiler.sunflow 1.096
compress 1.036
crypto.aes 0.997
crypto.rsa 1.000
crypto.signverify 1.003
derby 1.103
mpegaudio 1.013
scimark.fft.large 0.628
scimark.lu.large 1.029
scimark.sor.large 1.033
scimark.sparse.large 1.257
scimark.fft.small 0.859
scimark.lu.small 1.005
scimark.sor.small 1.001
scimark.sparse.small 0.997
scimark.monte_carlo 1.661
serial 1.112
sunflow 1.087
xml.transform 1.015
xml.validation 1.155
Composite Score 1.073
SPECjvm2008 Performance Characterization 33
SMT Gain
compiler.compiler 1.161
compiler.sunflow 1.173
Compress 1.254
crypto.aes 1.387
crypto.rsa 1.189
crypto.signverify 1.059
Derby 1.398
Mpegaudio 1.205
Scimark.fft.large 1.061
Scimark.lu.large 1.043
Scimark.sor.large 1.508
Scimark.sparse.large 1.085
Scimark.fft.small 1.018
Scimark.lu.small 0.890
Scimark.sor.small 1.925
Scimark.sparse.small 1.039
Scimark.monte_carlo 1.011
Serial 1.184
Sunflow 1.254
xml.transform 1.199
xml.validation 1.219
Composite Score 1.216
34 K. Shiv et al.
Freq Gain
compiler.compiler 1.044
compiler.sunflow 1.041
compress 1.045
crypto.aes 1.055
crypto.rsa 1.054
crypto.signverify 1.048
derby 1.045
mpegaudio 1.046
scimark.fft.large 1.214
scimark.lu.large 1.043
scimark.sor.large 1.040
scimark.sparse.large 1.055
scimark.fft.small 1.000
scimark.lu.small 1.009
scimark.sor.small 1.048
scimark.sparse.small 1.048
scimark.monte_carlo 0.880
serial 1.047
sunflow 1.052
xml.transform 1.050
xml.validation 1.042
Composite Score 1.042
memory-dependent than the other components benefits fully from the frequency increase.
The i7 platform uses QPI (Quick-Path Interconnect) and has lower memory access laten-
cies. Actual memory latency is also reduced by improved hardware prefetchers.
change the reported SPECjvm08 performance by just 6%. We expect therefore that keen
interest will be focused on individual component scores as much as the reported score.
The workloads in SPECjvm2008 present many opportunities for the JVM to im-
prove code generation, threading, memory management, and lock algorithm tuning.
Many such changes could impact all components though to different degrees. For
example, improvements in object allocation will benefit all, but some components like
Derby will benefit more. Other changes will benefit in a more localized manner. Of
particular interest is floating-point behavior. Previous benchmarks had not stressed
floating-point, and there was no generally-accepted way of studying the platform and
JVM improvements in this regard. Workloads like Derby should mitigate that. Similar
comments can be made about string and XML behavior due to the XML components.
Many workloads in SPECjvm2008 can also be used to evaluate current and future
hardware features especially on memory subsystem and lock optimization. Our con-
clusion based on our first analysis of this new benchmark is that it appears to be a
valuable addition to our toolkit. While it cannot replace SPECjbb2005 or
SPECjAppServer2004, and it may never be as important and as representative as
those two, it provides behavior that is different enough to make it attractive to the
performance analyst.
References
1. Dieckmann, S., Holzle, U.: The allocation behavior of the SPECjvm98 Java benchmarks.
In: Performance evaluation and benchmarking with realistic applications, pp. 77–108. MIT
Press, Cambridge (2001)
2. Radhakrishnan, R.: Microarchitectural Techniques to Enable Efficient Java Execution, Ph.
D. Dissertation, University of Texas at Austin (2000)
3. Li, T., John, L.K.: Characterizing Operating System Activity in SPECjvm98 Benchmarks.
In: John, L.K., Maynard, A.M.G. (eds.) Characterization of Contemporary Workloads, pp.
53–82. Kluwer Academic Publishers, Dordrecht (2001)
4. Excelsior JET Benchmarks,
http://web.archive.org/web/20071217043141,
http://www.excelsior-usa.com/jetbenchspecjvm.html
5. Yoo, R.M., Lee, H.-H.S., Lee, H., Chow, K.: Hierarchical Means: Single Number Bench-
marking with Workload Cluster Analysis. In: IEEE International Symposium on Workload
Characterization (IISWC 2007), Boston, MA, USA, September 27-29 (2007)
6. SPECjvm98 Benchmarks, http://www.spec.org/jvm98/
7. SPECjvm2008 Benchmarks, http://www.spec.org/jvm2008
8. Apache derby, http://db.apache.org/derby/
9. JLayer, http://www.javazoom.net/javalayer/javalayer.html
10. Scimark 2.0 Benchmark, http://math.nist.gov/scimark2/
11. Sunflow, http://sunflow.sourceforge.net/
12. IBM Telco Benchmark,
http://www2.hursley.ibm.com/decimal/telco.html
13. SPECjApp Server 2004 Benchmark, http://www.spec.org/jAppServer2004
14. SPECjbb 2005 Bechmark, http://www.spec.org/jbb2005
Performance Characterization of
ItaniumR
2-Based Montecito
Processor
1 Introduction
The Itanium 2 family of processors, including the Itanium 2-based dual core
(also known as Montecito), provide a fast, wide, and in-order execution core
coupled to a fast, wide, out of order memory sub-system and system interface
[1]. The processor has two dual-threaded cores integrated on die with more than
26.5MB of cache in a 90nm process with 7 layers of copper interconnect. Other
improvements over its predecessor include the integration of 2 cores on-die, each
with a dedicated 12MB L3 cache, a 1MB L2I cache and dual-threading [2]. In
this paper we analyze the key features of Montecito’s microarchitecture which
yield better performance than its predecessor (Madison) on both integer and
floating-point applications.
The main contributions of the paper are as follows:
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 36–56, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Performance Characterization of Itanium
R
2-Based Montecito Processor 37
given in Table 1. The OS Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
kernel 2.6.9-36.EL #1 SMP
benchmarks were com-
piled using the Intel
Fortran/C++ optimizing compiler (version 9.1). The compiler supports a
wide variety of optimizations such as software pipelining, predication, software
prefetching and whole program optimizations.
The events monitored for each metric such as IPC (instructions per cycle) are
listed at the start of the corresponding sections. The event monitoring process is
non-intrusive as it is in-built in the hardware and does not require any special setup.
The data collected provides valuable insights of the system behavior, especially, the
role played by buses, I/O and disc, which are typically not modeled in simulators.
The rest of the paper is organized as follows: Section 2 presents an overview
of the Montecito microarchitecture. Sections 3–10 provide an in-depth perfor-
mance characterization results for Montecito and compares it with the previous
generation Madison processor. Finally, we conclude in Section 11.
2 Processor Description
In the following subsections we briefly introduce the core and then the memory
sub-system of the Intel’s Montecito processor. A high level block diagram of
Montecito is shown in Figure 1.
The front-end, with two levels of branch prediction, two levels of translation
look-aside buffers (TLBs) and a zero-cycle branch predictor, feeds two bundles
(with 3 instructions each) into the 8 bundle deep instruction buffer every cy-
cle. Instruction fetch and branch prediction require only two pipe-stages (the
Montecito pipeline is shown in Figure 2) — the IPG and ROT stages.
The instruction buffer allows the front-
end to continue to deliver instructions to
the back-end even when the back end is
stalled and can be completely bypassed
adding no pipe stages to execution. The
instruction buffer delivers two bundles
of any alignment to the remaining six
pipeline stages. The dispersal logic de-
termines issue groups from the two old-
est bundles in the instruction buffer and
allocates up to six instructions to the
11 available functional units (two inte-
ger, four memory, two floating point, and
three branch) in the EXP stage. These
instructions form an issue group and travel
down the back-end pipeline and experi-
ence stall conditions in unison.
The register renaming logic maps vir-
tual registers in the instruction to physi-
cal registers in the REN stage to support
software pipelining and stacked registers Fig. 1. Block diagram of a single core of
which are managed by the Register Save Montecito
Engine (RSE) (which provides seemingly IPG ROT EXP REN REG EXE FP1 FP2 FP3
DET WRB
FP4
results. Some instructions may fault or trap, while branch instructions may be
mis-predicted.
Fig. 3. IPC
cache and the memory to obtain the page entry. If the HPW does not find the
page, it will generate a page fault.
3 IPC
The IPC (instructions per cycle) value signifies the amount of instruction level
parallelism (ILP) that can be achieved using a given compiler and processor. The
IPC was computed by taking the ratio of the number of events corresponding to
the following hardware performance counters:
IA64 INST RETIRED: This event counts the number of retired Itanium instruc-
tions. This also includes the NOPS instructions and instructions which were
squashed due to predicate off. We subtract the latter (which are measured using
the counters NOPS RETIRED and PREDICATE SQUASHED RETIRED) to compute the
effective IPC.
CPU OP CYCLES: This event counts the number CPU operating cycles.
From Figure 3 we observe that Montecito achieves higher IPC than its prede-
cessor Madison, across the entire CPU2006 suite. To compare the performance
of Montecito and Madison, we first compute the ratio of IPC on Montecito and
Madison for each benchmark and then compute the geometric mean of the ra-
tios. Our analyis shows that Montecito achieves 1.14× and 1.16× higher IPC on
CINT2006 and CFP2006 respectively. The higher IPC value can be attributed
to a number of reasons: larger caches and other cache-related microarchitectural
enhancements, discussed further in Section 4 and better TLB performance, dis-
cussed further in Section 6. The low IPC value of applications such as 429.mcf,
471.omnetpp, 450.soplex and 459.GemsFDTD can, in part, be ascribed to the
large amount of L3 cache misses (see Figure 4). Also, note that in applications
such as 456.hmmer, 444.namd, an IPC of more than 4 is achieved.
4 Cache Performance
Fig. 4. Number of L1D, L2D and L3 data misses per 1000 retired instructions
the L3 misses which occur due to data reads and writes only. We measured
the performance of the data cache using the following hardware performance
counters:
❶ L1D READ MISSES.ALL: This event counts the number of L1D read misses.
❷ L2D MISSES: This event counts the number of L2D misses (in terms of the
L2D cache line requests sent to L3).
❸ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses.
❹ L3 WRITES.DATA WRITE.MISS: This event counts the number of L3 store
misses (excludes L2D write backs, includes L3 read for ownership requests
that satisfy stores).
The total number of L3 data misses is computed as the sum of ❸ and ❹. This
does not include the L3 instruction misses. From Figure 41 we observe that, on
an average, Montecito incurs fewer data cache misses as compared to Madison
at any level of the cache hierarchy. This can be attributed to the larger caches
on Montecito. For example, A reduction of up to 1.38× (429.mcf) and 1.76×
(470.lbm) in L1D and L2D cache misses is achieved respectively. In general,
we note that the reduction in data cache misses is higher in CINT2006 than in
CFP2006. Even then, the L1D miss rate is higher in CINT2006 than CFP2006.
1
L3D in the figure refers to (unified) L3 data read and write misses. It should not be
interpreted as misses corresponding to a separate L3 data cache.
42 D. Desai et al.
the issue logic should look for new operations to send down the L2 pipeline.
A consequence of this head and tail organization is that holes may appear in
the OzQ from operations that have issued (OzQ entries between head and tail
that are no longer valid). The OzQ is not compressed when these holes develop.
Without compression, these holes are not available to new L1D requests. Thus,
there may be instances where the OzQ control logic indicates that there is no
more room for new L1D requests, despite the fact that only a few OzQ entries
are valid. Every cycle the L2 OzQ searches 16 requests, starting at head, for
requests to issue to the L2 data array (L2 hits), the system bus/L3 (L2 misses),
or back to the L1D for another L2 tag lookup (recirculate).
The L2 OzQ control logic allocates up to four contiguous entries per cycle
starting from the last entry allocated in the previous cycle (the tail). If there
are too few entries available (between 4 and 12), the L1D pipeline is stalled to
prohibit any additional operations being passed to the L2. Requests are removed
from the L2 OzQ when they complete at the L2 - that is when a store updates
the data array, or when a load returns correct data to the core, or when an L2
miss request is accepted by the system bus/L3.
Whenever the OzQ is full, there is an increased L2D back pressure which
results in back-end stalls. Figure 5 reports the time (as percentage of the total
execution time) for which OzQ was full. We measured the number of times the
L2D OzQ was full using the L2D OZQ FULL hardware performance counter. From
the figure we see that the OzQ is rarely (> 2% on an average) full in the case of
CINT2006. On the contrary, in the case of applications such as 410.bwaves and
433.milc of CFP2006, the OzQ is full for more than 50% of the total execu-
tion time. This results in a high percentage of data stalls, see Figure 12, which
adversely affects the overall performance. Support for elimination/minimization
of the number of holes in the OzQ can potentially reduce the number of data
stalls. Alternatively, a larger OzQ may yield better performance.
Stores that miss in the L2 record data in the 24 entry L2 Oz Data buffer and their
address in the OzQ. The data needs to be merged with the 128 bytes delivered
44 D. Desai et al.
Fig. 7. Number of L1I, L2I and L3 data misses per 1000 retired instructions
from the L3/system interface.2 When the buffer is full for a missing store request,
the processor stalls until entries can be freed. We measured the number of times
the L2D Oz data buffer was full using the L2D OZD FULL hardware performance
counter. From Figure 5 we note that the OzD buffer is rarely (> 1% on an
average) full in both CINT2006 and CFP2006. This suggests that the OzD buffer
is not a performance bottleneck for CPU2006.
The victim buffer holds L2 dirty victim data until it can be issued to the
L3/system interface. Operations are issued, up to four at a time, to access the L2
data array when the conflicts are resolved and resources are available. The buffer
can hold up to 16 entries. If the buffer is full for a request that misses the L2, the
request will recirculate. This in turn increases the L2D back pressure and can
cause back-end stalls. We measured the number of times the L2D victim buffer
was full using the L2D VICTIMB FULL hardware performance counter. From Fig-
ure 5 we note that the victim buffer is rarely (> 1%) full in both CINT2006 and
CFP2006. From this we conclude that victim buffer getting full does not impact
the overall performance in a significant manner.
Recall that Montecito has a unified L3 cache. Therefore, while evaluating the
L3 performance w.r.t. the instruction stream, it very critical to measure the L3
misses which occur due to instruction reads only. We measured the performance
of the instruction cache using the following hardware performance counters:
➀ L2I DEMAND READS: This event counts the number of L1I and ISB (instruc-
tion stream buffer) misses regardless of whether they hit or miss in the RAB
(Request Address Buffer).
2
Assume there is L2 miss and L3 hit. The L3 cache line size is 128 byte. The memory
system reads 128 byte out of the L3, merges the data from the L2 Oz data buffer
and writes it back to the L3.
Performance Characterization of Itanium
R
2-Based Montecito Processor 45
➁ L2I PREFETCHES: This event counts the number of prefetch requests issued
to the L2I.
➂ L2I READS.MISS.ALL: This event counts the fetches which miss the L2I-cache.
➃ L3 READS.ALL.MISS: This event counts the L3 read misses.
➄ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses.
The L1I misses are computed as the sum of ➀ and ➁, whereas the L3 instruc-
tion misses are computed as difference of ➃ and ➄. From Figure 73 we see that
the integer program incur higher number of L1I misses than the floating-point
programs. This is due to the fact that integer codes are very control-flow inten-
sive and thus very irregular in nature, which results in higher instruction cache
misses. Except 483.xalancbmk, the number of L2I misses are negligible in both
CINT2006 and CFP2006. This is primarily due to the presence of large L2I cache
(1MB, see Table 1 for the detailed configuration).
5 Data Speculation
Itanium supports data speculation for Source: Assembly:
scheduling a load in advance of one int &g;
int &h;
ld4.a
add
rx=[ra] ;;
ry=rx,1 // t = *h+1
or more stores. The advanced load foo() {
int t;
...
st4 [rb] = 1 // *g = 1
records information including mem- *g = 1;
t = *h + 1;
chk.a rx, rec_code
resume: ...
ory address, size and target register }
...
rec_code:
number into a hardware structure, ld4 rx=[ra] ;;
add ry=rx,1
the Advanced Load Address Table br resume ;;
the above example, assuming the compiler does not have enough information
whether or not the addresses of g and h overlap. In this case one can use data
speculation to hoist the load above the store. We measured the data misspecu-
lation rate using the following counters:
INST FAILED CHKA LDC ALAT.ALL: This provides information on the number of
failed advanced check load (chk.a) and check load (ld.c) instructions that reach
retirement.
INST CHKA LDC ALAT.ALL:This provides information on the number of all advanced
check load (chk.a) and check load (ld.c) instructions that reach retirement.
Figure 8 shows the data misspeculation percentage for CPU2006 on Montecito.
From the figure we see that only two applications, viz., 435.gromacs and
454.calculix incur data misspeculation for more than 5%. On an average,
CINT2006 and CFP2006 incur a data misspeculation rate of 0.65% and 2.62%
respectively. Since chk.a and ld.c constitute only 0.28% and 0.46% of the to-
tal number of retired instructions in CINT2006 and CFP2006 respectively, data
misspeculation does not play a key role in determining the overall performance.
6 TLB Performance
Akin to other processor parameters, the TLB performance is also directly de-
pendent on the nature of the applications [5]. In this section we report the data
and instruction TLB performance using CPU2006.
6.1 DTLB
This subsection compares the DTLB performance of Montecito and Madison.
Both the processors have a 2-level DTLB. We measured the performance of each
DTLB level using the following hardware performance counters:
L1DTLB TRANSFER: This event counts the number of times an L1DTLB miss hits
in the L2DTLB for an access counted in L1D READS.
L2DTLB MISSES: This event counts the number of L2DTLB misses (which is the
same as references to HPW (hardware page walker); DTLB HIT=0) for demand
requests [6].
6.2 ITLB
This subsection compares the ITLB performance of CPU2000 and CPU2006. The
Itanium 2-based Montecito has a 2-level ITLB. We measured the performance
of each ITLB level using the following hardware performance counters:
ITLB MISSES FETCH.L1ITLB: This event counts the number of misses in L1ITLB,
even if L1ITLB is not updated for an access (Uncacheable/nat page/not present
page/faulting/some flushed).
ITLB MISSES FETCH.L2ITLB: This event counts the total number of misses in
L1ITLB which also missed in L2ITLB.
Unlike DTLB, from Figure 10 we note that the ITLB performance of Mon-
tecito is the same as that of Madison for both CINT2006 and CFP2006. Akin
to the DTLB behavior, we observe that integer application incur higher num-
ber of DTLB misses than floating-point applications. This suggests that integer
48 D. Desai et al.
8 Stalls
In this section we analyze the relative impact of the various resource and data
stalls.
❐ Data stalls correspond to full pipe bubbles in the main pipe caused by the
L1D or the execution unit (discussed further in the next subsection).
❐ RSE stalls correspond to full pipe bubbles in the main pipe caused by the
Register Stack Engine.We measured this using the BE RSE BUBBLE.ALL hard-
ware performance counter.
❐ Branch misprediction stalls correspond to full pipe bubbles in the main pipe
due to flushes. We measure this using the BE FLUSH BUBBLE.ALL hardware
performance counter.
❐ Front end stalls in the figure correspond to full pipe bubbles in the main
pipe due to the front end. The front end can in turn be stalled due to the
following reasons: FEFLUSH, TLBMISS, IMISS, branch, FILL-RECIRC,
BUBBLE, IBFULL (listed in priority from high to low). We measured this
using BACK END BUBBLE.FE hardware performance counter.
❐ Scoreboarding corresponds to full pipe bubbles in the main pipe due to the
FPU. We measured the above using the BE L1D FPU BUBBLE.ALL hardware
performance counter.
From the figure we see that data stalls are most prominent amongst the different
types of stalls mentioned above. More importantly, note that in applications
such as 429.mcf, 471.omnetpp and 433.milc, data stalls account for more than
50%(!) of the total execution time. This can, in part, be attributed to their high
L3 cache miss rate (refer to Figure 4). This highlights the high sensitivity of the
performance of the emerging applications, represented by CPU2006, w.r.t. the
cache sub-system. Further, we note that stalls due to branch mispredictions are
second to data stalls. Specifically, the stalls due to branch mispredictions account
for 5% and 1.6% of the total execution time, on an average, in CINT2006 and
CFP2006. On the other hand, front end stalls account for 4.6% and 1.4% of the
total execution time, on an average, in CINT2006 and CFP2006.
potentially occur due to several reasons such as: store buffer being full, due to
a recirculate, due to a hardware page walker, due to a store in conflict with a
returning fill, due to L2D back pressure or due to L2DTLB to L1DTLB transfer.
Register load stalls were measured using the following hardware performance
counters and are computed as ①-②+③:
① BE EXE BUBBLE.GRALL: This corresponds to the case when the back-end was
stalled by EXE due to GR/GR or GR/load dependency.
② BE EXE BUBBLE.GRGR: This corresponds to the case when the back-end was
stalled by EXE due to GR/GR dependency.
③ BE EXE BUBBLE.FRALL: This corresponds to the case when the back-end was
stalled by EXE due to FR/FR or FR/load dependency.
Other data stall components were using the following hardware performance
counters:
BE L1D FPU BUBBLE.L1D HPW: This measures the back-end stalls due to Hardware
Page Walker.
BE L1D FPU BUBBLE.L1D PIPE RECIRC: This measures the back-end stalls due to
a recirculate. The most predictable reason for a request to recirculate is that the
request misses a line that is already being serviced by the system bus/L3, but has
not yet returned to the L2. The L2 only retires L2 hits and primary L2 misses to
an L2 line. It does not retire multiple L2 miss requests; additional misses remain
in the L2 OzQ and recirculate until the tag lookup returns a hit. The request
then issues from the L2 OzQ and returns data (for a load) or updates the array
(for a store) as a normal L2 hit request.
BE L1D FPU BUBBLE.L1D L2BPRESS: This measures the back-end stalls due to
L2D Back Pressure (L2BP).
BE L1D FPU BUBBLE.L1D TLB: This measures the back-end stalls due to L2DTLB
to L1DTLB transfer.
Note that the various component of data stalls are not mutually exclusive.
In other words, there may be overlap between the different components. From
Figure 13 we note that the register load stalls dominate CINT2006, except
for 462.libquantum and 456.hmmer, in which recirculates dominate the data
stalls. On the other hand, in CFP2006, 11 out of 17 benchmarks are dominated
by register load stalls while others such as 433.milc and 459.GemsFDTD are
Performance Characterization of Itanium
R
2-Based Montecito Processor 51
dominated by either the L2BP and/or L2 recirculates. The latter stems from a
high number of L3 data cache misses (see Figure 4). From the data we conjecture
that applications in which register-load stalls are not the dominating component
are memory bandwidth bound.
9 Branch Prediction
The Itanium 2 processors branch prediction performance relies on a two-level
prediction algorithm and two levels of branch history storage. The first level
of branch prediction storage is tightly coupled to the L1I cache. This coupling
allows a branches taken/not taken history and a predicted target to be delivered
with every L1I demand access in one cycle. The branch prediction logic uses
the history to access a pattern history table and determine a branches final
taken/not taken prediction, or trigger, according to the Yeh-Patt algorithm [8].
The L2 branch cache saves the histories and triggers of branches evicted from
the L1I so that they are available when the branch is revisited, providing the
second storage level.
We measured the branch misprediction rate using the following hardware per-
formance counters:
BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of retired
branches, regardless of the prediction result. We denote this by ➀.
BR MISPRED DETAIL.ALL.CORRECT PRED: This event counts the number of cor-
rectly predicted (both outcome and target) retired branches. We denote this
by ➁.
The branch misprediction percentage is computed as follows:
➀−➁
Branch Misprediction % = × 100
➀
Figure 14 shows the branch misprediction percentage on Montecito and Madi-
son for the applications in CPU2006. From the figure we see that, as expected,
CINT2006 incurs a higher branch misprediction rate than CFP2006. This ex-
plains the higher number of stalls caused due to branch misprediction for integer
codes as compared to floating-point codes (refer to Figure 12). the performance
of the branch predictor on the two machines is almost the same. In the rest of
52 D. Desai et al.
the section, we present the classification of branches and report results for the
prediction accuracy for each type of branch.
For better readability, we only show the percentage of the latter two in Fig-
ure 16. From the figure, we note that a high prediction accuracy is achieved
for the IP relative branches. Specifically, an accuracy of 95.7% and 98.39% is
achieved, on an average, for CINT2006 and CFP2006 applications respectively.
Improving the prediction accuracy for IP relative branches can potentially boost
the performance of integer codes, albeit by a small amount.
Indirect Branches. Indirect branches are predicted on the basis of the current
value in the referenced branch register. There is always a 2 cycle penalty for cor-
rectly predicted indirect branch. We measured the misprediction rate of indirect
branches in CPU2006 on Montecito using the following hardware performance
counters:
BR MISPRED DETAIL.NRETIND.CORRECT PRED: This event counts the number of
correctly predicted (outcome and target) non-return indirect branches.
BR MISPRED DETAIL.NRETIND.WRONG PATH: This event counts the number of mis-
predicted non-return indirect branches due to wrong branch direction.
BR MISPRED DETAIL.NRETIND.WRONG TARGET: This event counts the number of
mispredicted non-return indirect branches due to wrong target for taken branches.
54 D. Desai et al.
For better readability, we only show the percentage of the latter two in Fig-
ure 18. From the figure we note that indirect branches incur a large misprediction
rate. Specifically, 50.79% and 54.2% of the total indirect branches are mispre-
dicted in CINT2006 and CFP2006 respectively. In each case, the misprediction
occurs due to the wrong target. On the other hand, from Figures 15 and 19, we
note that indirect branches constitute a small (> 1%) percentage of the total
number of branches. From this, we conclude that improving the prediction ac-
curacy of indirect branches is unlikely to benefit the overall performance in a
significant fashion.
Return Branches. All predictions for return branches come form an eight-
entry return stack buffer (RSB). A branch call pushes both the caller’s IP and
its current function state onto the RSB. A return pops off this information.
There is always a 1 cycle penalty for a correctly predicted return. We measured
the misprediction rate of return branches in CPU2006 on Montecito using the
following hardware performance counters:
BR MISPRED DETAIL.RETURN.CORRECT PRED: This event counts the number of
correctly predicted (outcome and target) return type branches. This occurs when
the return address popped from the RSB does not match the actual return
address. The RSB has 8 entries, so in applications with call stacks deeper than
8 such mispredicts are likely to occur.
BR MISPRED DETAIL.RETURN.WRONG PATH: This event counts the number of mis-
predicted return type branches due to wrong branch direction.
BR MISPRED DETAIL.RETURN.WRONG TARGET: This event counts the mispredicted
return type branches due to wrong target for taken branches. This can happen
in two cases. First, for predicated returns [(qp) br.ret], e.g., the return is
predicted taken although the qualifying predicate (qp) is clear. Second, when
the return branch inherits a “wrong” predication hint from another branch that
has been issued within a 2 bundle window of the return.
For better readability, we only show the percentage of the latter two in Fig-
ure 18. From the figure we observe that, on average, RET incur misprediction of
> 1% in both CINT2006 and CFP2006. From this and Table 2 we conclude that
reduction in mispredictions due to RETs will not yield significant performance
gains.
Performance Characterization of Itanium
R
2-Based Montecito Processor 55
10 Instruction Breakdown
Figure 19 presents the instruction breakdown for both CINT2006 and CFP2006.
We measured this using the following hardware performance counters:
LOADS RETIRED: The event counts the number of retired loads, excluding pred-
icated off loads. The count includes integer, floating-point, RSE, semaphores,
VHPT, uncacheable loads and check loads (ld.c) which missed in ALAT and
L1D (because this is the only time this looks like any other load). Also included
are loads generated by squashed HPW walks.
STORES RETIRED: The event counts the number of retired stores, excluding those
that were predicated off. The count includes integer, floating-point, semaphore,
RSE, VHPT, uncacheable stores.
NOPS RETIRED: This event provides information on number of retired nop.i,
nop.m, and nop.b and nop.f instructions, excluding nop instructions that were
predicated off.
BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of branches
retired of all types, regardless of the prediction result.
PREDICATE SQUASHED RETIRED: This event provides information on number of
instructions squashed due to a false qualifying predicate. Includes all non-B-
syllable instructions which reached retirement with a false predicate.
FP OPS RETIRED: This event provides information on number of retired floating-
point operations, excluding all predicated off instructions.
From the figure we see that loads and stores constitute, on an average, 18%
and 17% of the total number of retired instructions in CINT2006 and CFP2006
respectively. Also, we note that the percentage of NOPs is higher, on an average,
in CFP2006 (32%) than in CINT2006 (18.2%). This is due to the longer latency
of the floating-point instructions, e.g., the floating-point multiply add (fma) has
a 5 cycle latency [6].
11 Conclusion
This paper presented detailed performance characterization, using the built-in
hardware performance counters, of the of the dual-core dual-threaded Itanium
56 D. Desai et al.
Montecito processor. To the best of our knowledge, this is the first work which
uses the SPEC CPU2006 benchmark suite for evaluation of an IA-64 architec-
ture. It also compared the performance of Montecito with the previous generation
Madison processor.
Based on our analysis we make the following conclusions:
❐ First, Montecito achieves, on a geometric mean basis, 14% and 16% higher
IPC for the integer and floating-point applications respectively. These gains
are primarily due to the better cache design on Montecito as compared to
Madison.
❐ Second, a relatively low IPC value is achieved for the C++ benchmarks and
429.mcf in CINT2006 and 5 applications in CFP2006. This is primarily due
to a high cache miss rate and/or a high DTLB miss rate.
❐ Third, the performance gain achievable using an oracle branch predictor
on Itanium is only 5% and 1.5%, on an average, for integer and floating-
point applications respectively. From this, we conclude that the performance
potential for a “better” branch predictor on an Itanium-based platform is
relatively low for the SPEC CPU2006 benchmarks.
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable
feedback.
References
1. Naffziger, S., Stackhouse, B., Grutkowski, T., Josephson, D., Desai, J., Alon, E.,
Horowitz, M.: The implementation of a 2-core multi-threaded itaniumR -family
processor. IEEE Journal of Solid-State Circuits 41(1), 197–209 (2006)
2. McNairy, C., Bhatia, R.: Montecito: A dual-core, dual-thread Itanium processor.
IEEE Micro. 25(2), 10–20 (2005)
3. SPEC CPU (2006), http://www.spec.org/cpu2006
4. Caliper, http://ieeexplore.ieee.org/iel5/4434/19364/00895108.pdf
5. Kandiraju, G.B., Sivasubramaniam, A.: Characterizing the d-TLB behavior of
SPEC CPU 2000 benchmarks. In: Proceedings of the 2002 ACM SIGMETRICS
International Conference on Measurement and Modeling of Computer Systems, Ma-
rina Del Rey, CA, pp. 129–139 (2002)
6. Dual-Core Update to the IntelRItaniumR2 Processor Reference Manual, Revision
0.9 (January 2006),
http://download.intel.com/design/Itanium2/manuals/30806501.pdf
7. Cvetanovic, Z., Bhandarkar, D.: Performance characterization of the Alpha 21164
microprocessor using TP and SPEC workloads. In: Proceedings of the 2nd Interna-
tional Symposium on High-Performance Computer Architecture, San Jose, CA, pp.
270–280 (February 1996)
8. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch
prediction. In: Proceedings of the 19th International Symposium on Computer Ar-
chitecture, Queensland, Australia, pp. 124–134 (1992)
A Tale of Two Processors: Revisiting the RISC-CISC
Debate
Abstract. The contentious debates between RISC and CISC have died down,
and a CISC ISA, the x86 continues to be popular. Nowadays, processors with
CISC-ISAs translate the CISC instructions into RISC style micro-operations
(eg: uops of Intel and ROPS of AMD). The use of the uops (or ROPS) allows
the use of RISC-style execution cores, and use of various micro-architectural
techniques that can be easily implemented in RISC cores. This can easily allow
CISC processors to approach RISC performance. However, CISC ISAs do have
the additional burden of translating instructions to micro-operations. In a 1991
study between VAX and MIPS, Bhandarkar and Clark showed that after cancel-
ing out the code size advantage of CISC and the CPI advantage of RISC, the
MIPS processor had an average 2.7x advantage over the studied CISC proces-
sor (VAX). A 1997 study on Alpha 21064 and the Intel Pentium Pro still
showed 5% to 200% advantage for RISC for various SPEC CPU95 programs. A
decade later and after introduction of interesting techniques such as fusion of
micro-operations in the x86, we set off to compare a recent RISC and a recent
CISC processor, the IBM POWER5+ and the Intel Woodcrest. We find that the
SPEC CPU2006 programs are divided between those showing an advantage on
POWER5+ or Woodcrest, narrowing down the 2.7x advantage to nearly 1.0.
Our study points to the fact that if aggressive micro-architectural techniques for
ILP and high performance can be carefully applied, a CISC ISA can be imple-
mented to yield similar performance as RISC processors. Another interesting
observation is that approximately 40% of all work done on the Woodcrest is
wasteful execution in the mispredicted path.
1 Introduction
Interesting debates on CISC and RISC instruction set architecture styles were fought
over the years, e.g.: the Hennessy-Gelsinger debate at the Microprocessor Forum [8]
and Bhandarkar publications [3, 4]. In the Bhandarkar and Clark study of 1991 [3],
the comparison was between Digital's VAX and an early RISC processor, the MIPS.
As expected, MIPS had larger instruction counts (expected disadvantage for RISC)
and VAX had larger CPIs (expected disadvantage for CISC). Bhandarkar et al. pre-
sented a metric to indicate the advantage of RISC called the RISC factor. The average
RISC factor on SPEC89 benchmarks was shown to be approximately 2.7. Not even
one of the SPEC89 program showed an advantage on the CISC.
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 57–76, 2009.
© Springer-Verlag Berlin Heidelberg 2009
58 C. Isen, L.K. John, and E. John
The Microprocessor forum debate between John Hennessy and Pat Gelsinger in-
cluded the following 2 quotes:
"Over the last five years, the performance gap has been steadily diminishing. It
is an unfounded myth that the gap between RISC and CISC, or between x86 and
everyone else, is large. It's not large today. Furthermore, it is getting smaller."
- Pat Gelsinger, Intel
"At the time that the CISC machines were able to do 32-bit microprocessors,
the RISC machines were able to build pipelined 32-bit microprocessors. At the time
you could do a basic pipelining in CISC machine, in a RISC machine you could do
superscalar designs, like the RS/6000, or superpipelined designs like the R4000. I
think that will continue. At the time you can do multiple instruction issue with rea-
sonable efficiency on an x86, I believe you will be able to put second-level caches, or
perhaps even two processors on the same piece of silicon, with a RISC machine."
- John Hennessy, Stanford
Many things have changed since the early RISC comparisons such as the VAX-
MIPS comparison in 1991 [3]. The debates have died down in the last decade, and
most of the new ISAs conceived during the last 2 decades have been mainly RISC.
However, a CISC ISA, the x86 continues to be popular. It translates the x86 macro-
instructions into micro-operations (uops of Intel and ROPS of AMD). The use of the
uops (or ROPS) allows the use of RISC-style execution cores, and use of various mi-
cro-architectural techniques that can be easily implemented in RISC cores. A 1997
study of the Alpha and the Pentium Pro [4] showed that the performance gap was nar-
rowing, however the RISC Alpha still showed significant performance advantage.
Many see CISC performance approaching RISC performance, but exceeding it is
probably unlikely. The hardware for translating the CISC instructions to RISC-style is
expected to consume area, power and delay. Uniform-width RISC ISAs do have an
advantage for decoding and runtime translations that are required in CISC are defi-
nitely not an advantage for CISC.
Fifteen years after the heated debates and comparisons, and at a time when all the
architectural ideas in Hennessy's quote (on chip second level caches, multiple proc-
essors) have been put into practice, we set out to compare a modern CISC and RISC
processor. The processors are Intel's Woodcrest (Xeon 5160) and IBM's POWER5+
[11, 16]. A quick comparison of key processor features can be found in Table 1.
Though the processors do not have identical micro-architectures, there is a signifi-
cant similarity. They were released around the same time frame and have similar
transistor counts (276 million for P5+ and 291 million for x86). The main differ-
ence between the processors is in the memory hierarchy. The Woodcrest has larger
L2 cache while the POWER5+ includes a large L3 cache. The SPEC CPU2006 re-
sults of Woodcrest (18.9 for INT/17.1 for FP) are significantly higher than that of
POWER5+ (10.5 for INT/12.9 for FP). The Woodcrest has a 3 GHz frequency
while the POWER5 has a 2.2 GHz frequency. Even if one were to scale up the
POWER5+ results and compare the score for CPU2006 integer programs, it is clear
that even ignoring the frequency advantage, the CISC processor is exhibiting an ad-
vantage over the RISC processor. In this paper, we set out to investigate the per-
formance differences of these 2 processors.
A Tale of Two Processors: Revisiting the RISC-CISC Debate 59
Table 1. Key Features of the IBM POWER5+ and Intel Woodcrest [13]
2.1 POWER5+
The IBM POWER5+ is an out of order superscalar processor. The core contains one
instruction fetch unit, one decode unit, two load/store pipelines, two fixed-point exe-
cution pipelines, two floating-point execution pipelines, and two branch execution
pipelines. It has the ability to fetch up to 8 instructions per cycle and dispatch and re-
tire 5 instructions per cycle. POWER5+ is a multi-core chip with two processor cores
per chip. The core has a 64KB L1 instruction cache and a 32KB L1 data cache. The
chip has a 1.9MB unified L2 cached shared by the two cores. An additional 36MB L3
cache is available off-chip with its controller and directory on the chip.
The POWER5+ memory management unit has 3 types of caches to help address
translation: a translation look-aside buffer (TLB), a segment look-aside buffer (SLB)
and an effective-to-real address table (ERAT). The translation processes starts its
search with the ERAT. Only on that failing does it search the SLB and TLB. This
processor supports simultaneous multithreading.
60 C. Isen, L.K. John, and E. John
2.2 Woodcrest
3 Methodology
In this study we use the 12 integer and 17 floating-point programs of the SPEC
CPU2006 [18] benchmark suite and measure performance using the on chip perform-
ance counters. Both POWER5+ and Woodcrest microprocessors provide on-chip
logic to monitor processor related performance events. The POWER5+ Performance
A Tale of Two Processors: Revisiting the RISC-CISC Debate 61
Monitor Unit contains two dedicated registers that count instructions completed and
total cycles as well as four programmable registers, which can count more than 300
hardware events occurring in the processor or memory system. The Woodcrest archi-
tecture has a similar set of registers, two dedicated and two programmable registers.
These registers can count various performance events such as, cache misses, TLB
misses, instruction types, branch misprediction and so forth. The perfex utility from
the Perfctr tool is used to perform the counter measurements on Woodcrest. A tool
from IBM was used for making the measurements on POWER5+.
The Intel Woodcrest processor supports both 32-bit as well as 64-bit binaries. The
data we present for Woodcrest corresponds to the best runtime for each benchmark
(hence is a mix of 64-bit and 32-bit applications). Except for gcc, gobmk, omnetpp,
xalancbmk and soplex, all other programs were in the 64-bit mode. The benchmarks
for POWER5+ were compiled using Compilers: XL Fortran Enterprise Edition 10.01
for AIX and XL C/C++ Enterprise Edition 8.0 for AIX. The POWER5+ binaries were
compiled using the flags:
C/C++ -O5 -qlargepage -qipa=noobject -D_ILS_MACROS -qalias=noansi -
qalloca + PDF (-qpdf1/-qpdf2)
FP - O5 -qlargepage -qsmallstack=dynlenonheap -qalias=nostd + PDF (-qpdf1/-
qpdf2).
The OS used was AIX 5L V5.3 TL05. The benchmarks on Woodcrest were com-
piled using Intel’s compilers - Intel(R) C Compiler for 32-bit applications/ EM64T-
based applications Version 9.1 and Intel(R) Fortran Compiler for 32-bit applications/
EM64T-based applications, Version 9.1. The binaries were compiled using the flag:
-xP -O3 -ipo -no-prec-div / -prof-gen -prof-use.
Woodcrest was configured to run using SUSE LINUX 10.1 (X86-64).
According to the traditional RISC vs. CISC tradeoff, we expect POWER5+ to have a
larger instruction count and a lower CPI compared to Intel Woodcrest, but we observe
that this distinction is blurred. Figure 3 shows the path length (dynamic instruction
count) of the two systems for SPEC CPU2006. As expected, the instruction counts in
the RISC POWER5+ is more in most cases, however, the POWER5+ has better in-
struction counts than the Woodcrest in 5 out of 12 integer programs and 7 out of 17
floating-point programs (indicated with * in Figure 3). The path length ratio is de-
fined as the ratio of the instructions retired by POWER5+ to the number of instruc-
tions retired by Woodcrest. The path length ratio (instruction count ratio) ranges
from 0.7 to 1.23 for integer programs and 0.73 to 1.83 for floating-point programs.
The lack of bias is evident since the geometric mean is about 1 for both integer and
floating-point applications. Figure 4 presents the CPIs of the two systems for SPEC
CPU2006. As expected, the POWER5+ has better CPIs than the Woodcrest in most
cases. However, in 5 out of 12 integer programs and 7 out of 17 floating-point pro-
grams, the Woodcrest CPI is better (indicated with * in Figure 4). The CPI ratio is the
62 C. Isen, L.K. John, and E. John
ratio of the CPI of Woodcrest to that of POWER5+. The CPI ratio ranges from 0.78
to 4.3 for integer programs and 0.75 to 4.4 for floating-point applications. This data is
a sharp contrast to what was observed in the Bhandarkar-Clark study. They obtained
an instruction count ratio in the range of 1 to 4 and a CPI ratio ranging from 3 to 10.5.
In their study, the RISC instruction count was always higher than CISC and the CISC
CPI was always higher than the RISC CPI.
A Tale of Two Processors: Revisiting the RISC-CISC Debate 63
Figure 5 illustrates an interesting metric, the RISC factor and its change from the
Bhandarkar-Clark study to our study. Bhandarkar–Clark defined RISC factor as the ratio
of CPI ratio to path length (instruction count) ratio. The x-axis indicates the CPI ratio
(CISC to RISC) and the y-axis indicates the instruction count ratio (RISC to CISC).
The SPEC 89 data-points from the Bhandarkar-Clark study are clustered to the
right side of the figure, whereas most of the SPEC CPU2006 points are located closer
to the line representing RISC factor=1 (i.e. no advantage for RISC or CISC). This line
represents the situation where the CPI advantage for RISC is cancelled out by the path
length advantage for CISC. The shift highlights the sharp contrast between the results
observed in the early days of RISC and the current results.
64 C. Isen, L.K. John, and E. John
4.2 Micro-operations
Per Instruction
(uops/inst)
In this section, we present the instruction mix to help the reader better understand the
later sections on branch predictor performance, and cache performance. The instruc-
tion mix can give us an indication of the difference between the benchmarks. It is far
from a clear indicator of bottlenecks but it can still provide some useful information.
Table 3 contains the instruction mix for the integer programs while Table 4
A Tale of Two Processors: Revisiting the RISC-CISC Debate 65
POWER5+ Woodcrest
BENCHMARK Branches Stores Load Others Branches Stores Loads other
400.perlbench 18% 15% 25% 41% 23% 11% 24% 41%
401.bzip2 15% 8% 23% 54% 15% 9% 26% 49%
403.gcc 19% 17% 18% 46% 22% 13% 26% 39%
429.mcf 17% 9% 26% 48% 19% 9% 31% 42%
445.gobmk 16% 11% 20% 53% 21% 14% 28% 37%
456.hmmer 14% 11% 28% 47% 8% 16% 41% 35%
458.sjeng 18% 6% 20% 56% 21% 8% 21% 50%
462.libquantum 21% 8% 21% 50% 27% 5% 14% 53%
464.h264ref 7% 16% 35% 42% 8% 12% 35% 45%
471.omnetpp 19% 17% 26% 38% 21% 18% 34% 27%
473.astar 13% 8% 27% 52% 17% 5% 27% 52%
483.xalancbmk 20% 9% 23% 47% 26% 9% 32% 33%
contains the same information for floating-point benchmarks. In comparing the com-
position of instructions in the binaries of POWER5+ and Woodcrest, the instruction
mix seems to be largely similar for both architectures. We do observe that some
Woodcrest binaries have a larger fraction of load instructions compared to their
POWER5+ counterparts. For example, the execution of hmmer on POWER5+ has
28% load instruction while the Woodcrest version has 41% loads. Among integer
programs, gcc, gobmk and xalancbmk are other programs where the percentage of
loads in Woodcrest is higher than that of POWER5+.
66 C. Isen, L.K. John, and E. John
POWER5+ Woodcrest
BENCHMARK Branches Stores Load Others Branches Stores Loads Others
410.bwaves 1% 7% 46% 46% 1% 8% 47% 44%
416.gamess 8% 8% 31% 53% 8% 9% 35% 48%
433.milc 3% 18% 34% 46% 2% 11% 37% 50%
434.zeusmp 2% 11% 26% 61% 4% 8% 29% 59%
435.gromacs 4% 14% 28% 54% 3% 14% 29% 53%
436.cactusADM 0% 14% 38% 48% 0% 13% 46% 40%
437.leslie3d 1% 12% 28% 59% 3% 11% 45% 41%
444.namd 5% 6% 28% 61% 5% 6% 23% 66%
447.dealII 15% 9% 32% 45% 17% 7% 35% 41%
450.soplex 15% 6% 26% 53% 16% 8% 39% 37%
453.povray 12% 14% 31% 44% 14% 9% 30% 47%
454.calculix 4% 6% 25% 65% 5% 3% 32% 60%
459.GemsFDTD 2% 10% 31% 57% 1% 10% 45% 43%
465.tonto 6% 13% 29% 52% 6% 11% 35% 49%
470.lbm 1% 9% 18% 72% 1% 9% 26% 64%
481.wrf 4% 11% 31% 54% 6% 8% 31% 56%
482.sphinx3 8% 3% 31% 59% 10% 3% 30% 56%
We also find a difference in the fraction of branch instructions, though not as sig-
nificant as the differences observed for load instructions. For example, xalancbmk has
20% branches in a POWER5+ execution and 26% branches in the case of Woodcrest.
A similar difference exists for gobmk and libquantum. In the case of hmmer, unlike
the previous cases, the number of branches is lower for Woodcrest (14% for
POWER5+ and only 8% for Woodcrest). Similar examples for difference in the frac-
tion of load and branch instructions can be found in the floating-point programs. A
few examples are cactusADM, leslie3d, soplex, gemsFDTD and lbm. FP programs
have traditionally had a lower fraction of branch instructions, but three of the pro-
grams exhibit more than 12% branches. This observation holds for both POWER5+
and Woodcrest. Interestingly these three programs (dealII, soplex and povray) are
C++ programs.
Branch prediction is a key feature in modern processors allowing out of order execu-
tion. Branch misprediction rate and misprediction penalty significantly influence the
stalls in the pipeline, and the amount of instructions that will be executed specula-
tively and wastefully in the misprediction path. In Figure 6 we present the branch
misprediction statistics for both architectures. We find that Woodcrest outperforms
POWER5+ in this aspect. The misprediction rate for Woodcrest among integer
benchmarks ranges from a low 1% for xalancbmk to a high 14% for astar. Only
A Tale of Two Processors: Revisiting the RISC-CISC Debate 67
gobmk and astar have a misprediction rate higher than 10% for Woodcrest. On the
other hand, the misprediction rate for POWER5+ ranges from 1.74% for xalancbmk
and 15% for astar. On average the misprediction for integer benchmarks is 7% for
POWER5+ and 5.5% for Woodcrest. In the case of floating-point benchmarks this is
5% for POWER5+ and 2% for Woodcrest. We see that, in the case of the floating-
point programs, POWER5+ branch prediction performs poorly relative to Woodcrest.
This is particularly noticeable in programs like games, dealII, tonto and sphinx.
16%
14%
12%
10%
8%
6%
4%
2%
0%
f
cc
ng
ta r
2
f
k
mc
p
h
k
m
re
z ip
e tp
nc
bm
3 .g
je
n tu
as
64
9.
ob
lbe
8 .s
1 .b
nc
mn
3.
40
42
ua
h2
5 .g
47
er
45
a la
40
4.
1 .o
ib q
44
0 .p
46
3 .x
2 .l
47
40
48
46
12%
10%
8%
6%
4%
2%
0%
M
TD
p
em ix
3
s
s
44 3d
47 o
s
y
II
d
ex
m
c
rf
43 usm
ac
es
41 ave
45 vra
nx
m
al
43 AD
nt
G lcul
43 .mil
48 1.w
lb
D
e
pl
de
to
na
m
m
hi
sli
sF
0.
so
po
us
bw
45 4.ca
3
5.
ze
43 gro
sp
ga
48
7.
4.
le
43
0.
3.
46
ct
44
0.
4.
7.
2.
6.
45
5.
ca
45
41
6.
9.
The cache hierarchy is one of the important micro-architectural features that differ
between the systems. POWER5+ has a smaller L2 cache (1.9M instead of 4M in
Woodcrest), but it has a large shared L3 cache. This makes the performance of the
cache hierarchies of the two processors of particular interest. Figure 7 shows the L1
data cache misses per thousand instructions for both integer and floating-point bench-
marks. Among integer programs mcf stands out, while there are no floating-point pro-
grams with a similar behavior. POWER5+ has a higher L1 D cache miss rate for gcc,
milc and lbm even though both processors have the same L1 D cache size. In general,
the L1 data cache miss rates are under 40 misses per 1k instructions. In spite of the
small L2 cache, the L2 miss ratio on POWER5+ is lower than that on Woodcrest.
While no data is available to further analyze this, we suspect that differences in the
160
140
120
100
80
60
40
20
0
f
ch
p
k
r
cf
k
re
2
g
c
ta
tp
bm
bm
gc
ip
en
tu
m
en
64
as
ne
bz
an
3.
9.
sj
go
nc
h2
rlb
3.
om
1.
8.
40
42
qu
5.
la
47
pe
4.
40
45
xa
li b
1.
44
46
0.
47
2.
3.
40
46
48
160
140
120
100
80
60
40
20
0
ga s
so I
45 4.c ay
44 md
s p rf
46 TD
us s
p
47 to
m
d
43 ss
3
em ulix
ze c
lI
e
w
m
e3
nx
45 ple
il
45 ea
n
41 wav
lb
43 AD
45 ovr
e
ca m a
m
48 8 1.
D
na
to
us
m
0.
sli
hi
d
3.
sF
al
5.
7.
p
o
4.
le
4
b
3.
gr
0.
0.
44
ct
7.
6.
4.
2.
5.
41
43
G
43
6.
9.
43
40
35
30
25
20
15
10
f
h
p
m
k
r
cf
k
2
re
c
ta
nc
tp
bm
bm
gc
ip
en
tu
m
64
as
ne
bz
e
an
3.
9.
sj
go
nc
rlb
h2
3.
1.
8.
om
40
42
qu
5.
la
47
pe
4.
40
45
xa
1.
44
lib
46
0.
47
2.
3.
40
46
48
P5 L2 miss/1k inst WC L2 miss/1k inst
60
50
40
30
20
10
0
ga s
so I
s p rf
44 md
46 TD
us s
p
47 to
m
3d
s
3
em ulix
ilc
lI
e
a
es
ac
w
m
nx
45 ple
45 ea
n
41 wav
lb
45 ovr
43 AD
m
ie
48 8 1.
D
na
to
us
m
om
0.
hi
d
3.
sl
sF
al
5.
7.
p
4.
ze
le
4
b
43
3.
43 .g r
0.
.c
0.
44
ct
7.
6.
4.
2.
4
41
ca
43
G
43
6.
9.
45
amount of loads in the instruction mix (as discussed earlier), differences in the
instruction cache misses (POWER5+ has a bigger I-cache) etc. can lead to this.
Over the years out-of-order processors have achieved significant performance gains
from various speculation techniques. The techniques have primarily focused on con-
trol flow prediction and memory disambiguation. In Figure 9 we present speculation
percentage, a measure of the amount of wasteful execution, for different benchmarks.
We define the speculation % as the ratio of instructions that are executed specula-
tively but not retired to the number of instructions retired (i.e. (dispatched_inst_cnt /
retired_inst_cnt) -1). We find the amount of speculation in integer benchmarks to be
70 C. Isen, L.K. John, and E. John
Speculation %
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
cf
c
f
2
k
h
p
m
re
ta
gc
ip
bm
nc
bm
en
tp
m
tu
as
64
bz
ne
3.
9.
e
sj
an
nc
go
3.
rlb
h2
1.
40
42
8.
om
qu
la
5.
47
40
45
pe
4.
xa
44
1.
lib
46
0.
47
3.
2.
40
48
46
P5+(inst disp/compl) WC(UOPS disp/retired)
Speculation %
0.3
0.25
0.2
0.15
0.1
0.05
0
le M
45 54.c ay
so I
47 o
sp rf
em ul ix
43 g r o p
m
es
x
4 4 md
D
ilc
4 4 e 3d
3
s
lI
43 s
nt
m
45 ple
nx
45 ea
43 sAD
ac
es
vr
DT
lb
43 3.m
41 wav
to
48 81.
43 e u s
na
0.
hi
m
po
m
d
sli
al
5.
sF
7.
u
4.
ga
4
3.
b
0.
46
z
ct
0.
4.
7.
2.
6.
ca
5.
41
G
4
6.
9.
higher than floating-point benchmarks, not surprising considering the higher percent-
age of branches and branch mispredictions in integer programs.
In general, the Woodcrest micro-architecture speculates much more aggressively
compared to POWER5+. On an average, an excess of 40% of instructions in Wood-
crest and 29% of instructions in POWER5+ are speculative for integer benchmarks.
The amount of speculations for FP programs on average is 20% for Woodcrest and
9% for POWER5+. Despite concerns on power consumption, the fraction of instruc-
tions spent in mispredicted path has increased from the average of 20% (25% for INT
and 15% for FP) seen in the 1997 Pentium Pro study. Among the floating-point pro-
grams, POWER5+ speculates more than Woodcrest in four of the benchmarks: dealII,
soplex, povray and sphinx. It is interesting to note that 3 of these benchmarks are C++
A Tale of Two Processors: Revisiting the RISC-CISC Debate 71
programs. With limitation on power and energy consumption, wastage from execution
in speculative path is of great concern.
Hypothetically, not having fusion would increase the uops/inst for floating-point
programs from 1.07 uops/inst to 1.23 uops/inst and for integer programs from 1.03
uops/inst to 1.23 uops/inst. It is clear that this micro-architectural technique has
played a significant part in blunting the advantage of RISC by reducing the number of
uops that are executed per instruction.
The cost of memory access has been accentuated by the higher performance of the
logic unit of the processor (the memory wall). The Woodcrest architecture is said to
perform an optimization aimed at reducing the load latencies of operations with re-
gards to the stack pointer [2]. The work by Bekerman et al. [2] proposes tracking the
ESP register and simple operations on it of the form reg±immediate, to enable quick
resolutions of the load address at decode time. The ESP register in IA32 holds the
stack pointer and is almost never used for any other purpose. Instructions such as
CALL/RET, PUSH/POP, and ENTER/LEAVE can implicitly modify the stack
pointer. There can also be general-purpose instructions that modify the ESP in the
fashion ESP←ESP±immediate. These instructions are heavily used for procedure
calls and are translated into uops as given below in Table 7. The value of the immedi-
ate operand is provided explicitly in the uop.
These ESP modifications can be tracked easily after decode. Once the initial ESP
value is known later values can be computed after each instruction decode. In essence
this method caches a copy of the ESP value in the decode unit. Whenever a simple
modification to the ESP value is detected the cached value is used to compute the ESP
value without waiting for the uops to reach execution stage. The cached copy is also
updated with the newly computed value. In some cases the uops cause operations that
are not easy to track and compute; for example loads from memory into the ESP or
computations that involve other registers. In these cases the cached value of ESP is
flagged and it is not used for computations until the uop passes the execution stage
and the new ESP value is obtained. In the mean while, if any other instruction that
follows attempts to modify the ESP value, the decoder tracks the change operation
and the delta value it causes. Once the new ESP value is obtained from the uop that
passed the execution stage, the delta value observed is applied on it to bring the ESP
register up-to-date. Having the ESP value at hand allows quick resolution of the load
addresses there by avoiding any stall related to that. This technique is expected to bear
fruit in workloads where there is a significant use of the stack, most likely for func-
tion calls. Further details on this optimization can be found in Bekerman et al. [2].
74 C. Isen, L.K. John, and E. John
Table 8. Percentage of instructions on which early load address resolutions were applied
% ESP % ESP % ESP % ESP
BENCHMARK SYNCH ADDITIONS BENCHMARK SYNCH ADDITIONS
400.perlbench 0.90% 6.88% 433.milc 0.00% 0.04%
401.bzip2 0.30% 1.41% 434.zeusmp 0.00% 0.00%
403.gcc 1.80% 7.99% 435.gromacs 0.03% 0.14%
429.mcf 0.17% 0.24% 436.cactusADM 0.00% 0.00%
445.gobmk 1.81% 8.45% 437.leslie3d 0.00% 0.00%
456.hmmer 0.00% 0.11% 444.namd 0.00% 0.01%
458.sjeng 0.41% 3.19% 447.dealII 0.20% 3.05%
462.libquantum 0.12% 0.13% 450.soplex 0.11% 0.54%
464.h264ref 0.12% 1.44% 453.povray 0.67% 2.77%
471.omnetpp 3.06% 7.60% 454.calculix 0.03% 0.09%
473.astar 0.01% 0.14% 459.GemsFDTD 0.08% 0.33%
483.xalancbmk 3.76% 11.30% 465.tonto 0.26% 0.77%
470.lbm 0.00% 0.00%
481.wrf 0.19% 0.35%
482.sphinx3 0.17% 0.90%
410.bwaves 0.03% 0.04%
416.gamess 0.15% 0.76%
INT - geomean 1.04% 4.07% FP - geomean 0.12% 0.60%
A Tale of Two Processors: Revisiting the RISC-CISC Debate 75
On average the benefit from ESP based optimization is 4% for integer programs
and 0.6% for FP programs. Each ESP based addition that is avoided amounts to
avoiding execution of one uop. Although the average benefit is low, some of the ap-
plications benefit significantly in reducing unnecessary computations and there by
helping performance of those applications in relation to their POWER5+ counter
parts.
6 Conclusion
Using the SPEC CPU2006 benchmarks, we analyze the performance of a recent CISC
processor, the Intel Woodcrest (Xeon 5160) with a recent RISC processor, the IBM
POWER5+. In a CISC RISC comparison in 1991, the RISC processor showed an ad-
vantage of 2.7x and in a 1997 study of the Alpha 21064 and the Pentium Pro, the
RISC Alpha showed 5% to 200% advantage on the SPEC CPU92 benchmarks. Our
study shows that the performance difference between RISC and CISC has further nar-
rowed down. In contrast to the earlier studies where the RISC processors showed
dominance on all SPEC CPU programs, neither the RISC nor CISC dominates in this
study. In our experiments, the Woodcrest shows advantage on several of the SPEC
CPU2006 programs and the POWER5+ shows advantage on several other programs.
Various factors have helped the Woodcrest to obtain its RISC-like performance.
Splitting the x86 instruction into micro-operations of uniform complexity has helped,
however, interestingly the Woodcrest also combines (fuses) some micro-operations to
a single macro-operation. In some programs, up to a third of all micro-operations are
seen to benefit from fusion, resulting in chained operations that are executed in a
single step by the relevant functional unit. Fusion also reduces the demand on reserva-
tion station and reorder buffer entries. Additionally, it reduces the net uops per in-
struction. The average uop per instruction for Woodcrest in 2007 is 1.03 for integer
programs and 1.07 for floating-point programs, while in Bhandarkar and Ding’s 1997
study [5] using SPEC CPU95 programs, the average was around 1.35 uops/inst. Al-
though the POWER5+ has smaller L2 cache than the Woodcrest, it is seen to achieve
equal or better L2 cache performance than the Woodcrest. The Woodcrest has better
branch prediction performance than the POWER5+. Approximately 40%/20% (int/fp)
of instructions in Woodcrest and 29%/9% (int/fp) of instructions in the POWER5+
are seen to be in the speculative path.
Our study points out that with aggressive micro-architectural techniques for ILP,
CISC and RISC ISAs can be implemented to yield very similar performance.
Acknowledgement
We would like to acknowledge Alex Mericas, Venkat R. Indukuru and Lorena
Pesantez at IBM Austin for their guidance. The authors are supported in part by NSF
grant 0702694, and an IBM Faculty award. Any opinions, findings and conclusions
expressed in this paper are those of the authors and do not necessarily reflect the
views of the National Science Foundation (NSF) or other research sponsors.
76 C. Isen, L.K. John, and E. John
References
1. Agerwala, T., Cocke, J.: High-performance reduced instruction set processors. Technical
report, IBM Computer Science (1987)
2. Bekerman, M., Yoaz, A., Gabbay, F., Jourdan, S., Kalaev, M., Ronen, R.: Early load ad-
dress resolution via register tracking. In: Proceedings of the 27th Annual international
Symposium on Computer Architecture, pp. 306–315
3. Bhandarkar, D., Clark, D.W.: Performance from architecture: comparing a RISC and a
CISC with similar hardware organization. In: Proceedings of ASPLOS 1991, pp. 310–319
(1991)
4. Bhandarkar, D.: A Tale of two Chips. ACM SIGARCH Computer Architecture
News 25(1), 1–12 (1997)
5. Bhandarkar, D., Ding, J.: Performance Characterization of the Pentium® Pro Processor.
In: Proceedings of the 3rd IEEE Symposium on High Performance Computer Architecture,
February 01-05, 1997, pp. 288–297 (1997)
6. Chow, F., Correll, S., Himelstein, M., Killian, E., Weber, L.: How many addressing modes
are enough. In: Proceedings of ASPLOS-2, pp. 117–121 (1987)
7. Cmelik, et al.: An analysis of MIPS and SPARC instruction set utilization on the SPEC
benchmarks. In: ASPLOS 1991, pp. 290–302 (1991)
8. Hennessy, Gelsinger Debate: Can the 386 Architecture Keep Up? John Hennessy and Pat
Gelsinger Debate the Future of RISC vs. CISC: Microprocessor Report
9. Hennessy, J.: VLSI Processor Architecture. IEEE Transactions on Computers C-33(11),
1221–1246 (1984)
10. Hennessy, J.: VLSI RISC Processors. VLSI Systems Design, VI:10, pp. 22–32 (October
1985)
11. Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient
Performance,
http://www.intel.com/technology/architecture-silicon/core/
12. Smith, J.E., Weiss, S.: PowerPC 601 and Alpha 21064. A Tale of Two RISCs, IEEE Com-
puter
13. Microprocessor Report – Chart Watch - Server Processors. Data as of (October 2007)
http://www.mdronline.com/mpr/cw/cw_wks.html
14. Patterson, D.A., Ditzel, D.R.: The case for the reduced instruction set computer. Computer
architecture News 8(6), 25–33 (1980)
15. Patterson, D.: Reduced Instruction Set Computers. Communications of the ACM 28(1), 8–
21 (1985)
16. Kanter, D.: Fall Processor Forum 2006: IBM’s POWER6,
http://www.realworldtech.com/
17. Kanter, D.: Intel’s Next Generation Microarchitecture Unveiled. Real World Technologies
(March 2006), http://www.realworldtech.com
18. SPEC Benchmarks, http://www.spec.org
19. Wong, M.: C++ benchmarks in SPEC CPU 2006. SIGARCH Computer Architecture
News 35(1), 77–83 (2007)
Investigating Cache Parameters of x86 Family
Processors
1 Introduction
The memory architecture of the x86 processor family has evolved over more than
a quarter of a century – by all standards, an ample time to achieve consider-
able complexity. Equipped with advanced features such as translation buffers
and memory caches, the architecture represents an essential contribution to the
overall performance of the contemporary x86 family processors. As such, it is a
natural target of performance engineering efforts, ranging from software perfor-
mance modeling to computing kernel optimizations.
Among such efforts is the investigation of the performance related effects
caused by sharing of the memory architecture among multiple software com-
ponents, carried out within the framework of the Q-ImPrESS project1 . The
Q-ImPrESS project aims to deliver a comprehensive framework for multicrite-
rial quality of service modeling in the context of software service development.
The investigation, necessary to achieve a reasonable modeling precision, is based
on evaluating a series of experiments that subject the memory architecture to
various workloads.
In order to design and evaluate the experiments, a detailed information about
the memory architecture exercised by the workloads is required. Lack of infor-
mation about features such as hardware prefetching, associativity or inclusivity
1
This work is supported by the European Union under the ICT priority of the Seventh
Research Framework Program contract FP7-215013 and by the Czech Academy of
Sciences project 1ET400300504.
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 77–96, 2009.
c Springer-Verlag Berlin Heidelberg 2009
78 V. Babka and P. Tůma
could result in naive experiment designs, where the workload behavior does not
really target the intended part of the memory architecture, or in naive exper-
iment evaluations, where incidental interference between various parts of the
memory architecture is interpreted as the workload performance.
Within the Q-ImPrESS project, we have carried out multiple experiments on
both AMD and Intel processors. Surprisingly, the documentation provided by
both vendors for their processors has turned out to be somewhat less complete
and correct than necessary – some features of the memory architecture are only
presented in a general manner applicable to an entire family of processors, other
details are buried among hundreds of pages of assorted optimization guidelines.
To overcome the lack of detailed information, we have constructed additional
experiments intended specifically to investigate the parameters of the memory
architecture. These experiments are the topic of this paper.
We believe that the experiments investigating the parameters of the memory
architecture can prove useful to other researchers – some performance relevant
aspects of the memory architecture are extremely sensitive to minute details,
which makes the investigation tedious and error prone. We present both an
overview of some of the more interesting experiments and an overview of the
framework used to execute the experiments – Section 2 focuses on the parameters
of the translation buffers, Section 3 focuses on the parameters of the memory
caches, Section 4 presents the framework.
After a careful consideration, we have decided against providing an overview of
the memory architecture of the x86 processor family. In the following, we assume
familiarity with the x86 processor family on the level of the vendor supplied user
guides [1,2], or at least on the general programmer level [3].
For the experiments, we have chosen two platforms that represent common
servers with both Intel and AMD processors, further referred to as Intel Server
and AMD Server.
Intel Server. A server configuration with an Intel processor is represented by
the Dell PowerEdge 1955 machine, equipped with two Quad-Core Intel Xeon
CPU E5345 2.33 GHz (Family 6 Model 15 Stepping 11) processors with inter-
nal 32 KB L1 caches and 4 MB L2 caches, and 8 GB Hynix FBD DDR2-667
synchronous memory connected via Intel 5000P memory controller.
AMD Server. A server configuration with an AMD processor is represented
by the Dell PowerEdge SC1435 machine, equipped with two Quad-Core AMD
Opteron 2356 2.3 GHz (Family 16 model 2 stepping 3) processors with internal
64 KB L1 caches, 512 KB L2 caches and 2 MB L3 caches, integrated memory
controller with 16 GB DDR2-667 unbuffered, ECC, synchronous memory.
To collect the timing information, the RDTSC processor instruction is used.
In addition to the timing information, we collect the values of the performance
counters for events related to the experiments using the PAPI library [4] running
Investigating Cache Parameters of x86 Family Processors 79
on top of perfctr [5]. The performance events supported by the platforms are
described in [1, Appendix A.3] and [6, Section 3.14]. For overhead incurred by
the measurement framework, see [7].
Although mostly irrelevant, both platforms are running Fedora Linux 8 with
kernel 2.6.25.4-10.fc8.x86 64, gcc-4.1.2-33.x86 64, glibc-2.7-2.x86 64. Only 4 level
paging with 4 KB pages is investigated.
(L1 DTLB, L2 DTLB), a cache of the third level paging structures (PDE cache),
a cache of the second level paging structures (PDPTE cache), and a cache of the
first level paging structures (PML4TE cache). The following table summarizes
the basic parameters of the translation buffers on the two platforms, with the
parameters not available in vendor documentation emphasized.
We begin our translation buffers investigation by describing experiments tar-
geted at the translation miss penalties, which are not available in vendor
documentation.
(*ptr) = next;
ptr = (uintptr_t **) next;
}
Fig. 1. DTLB0 miss penalty and related performance events on Intel Server
all accesses should hit, afterwards the accesses should start missing, depend-
ing on the replacement policy. For ITLBs, we analogically use a jump emitting
version of code from Listing 1.3 with the code from Listing 1.2.
Since the plots that illustrate the results for each TLB are similar in shape,
we include only representative examples and comment the results in writing. All
plots are available in [7].
Starting with an example of a well documented result, we choose the experi-
ment with DTLB0 on Platform Intel Server, which requires pageStride set to 4
and numAccesses varying from 1 to 32. The results on Fig. 1 contain both the
average access duration and the counts of the related performance events. We
see that the access duration increases from 3 to 5 cycles at 5 accessed pages. At
the same time, the number of misses in DTLB0 (DTLB MISSES.L0 MISS LD
events) increases from 0 to 1, but there are no DTLB1 misses (DTLB MISSES-
:ANY events). The experiment therefore confirms the well documented parame-
ters of DTLB0 such as the 4-way associativity and the miss penalty of 2 cycles [1,
page A-9]. It also suggests that the replacement policy behavior approximates
LRU for our access pattern.
Experimenting with DTLB1 on Platform Intel Server requires changing the
pageStride parameter to 64 and yields an increase in the average access du-
ration from 3 to 12 cycles at 5 accessed pages. Figure 2 shows the counts of
the related performance events, attributing the increase to DTLB1 misses and
confirming the 4-way associativity. Since there are no DTLB0 misses that would
hit in the DTLB1, the figure also suggests non-exclusive policy between DTLB0
and DTLB1. The experiment therefore estimates the miss penalty, which is not
available in vendor documentation, at 7 cycles. Interestingly, the counter of cy-
cles spent in page walks (PAGE WALKS:CYCLES events) reports only 5 cycles
per access and therefore does not fully capture this penalty.
As an additional information not available in vendor documentation, we can
see that exceeding the DTLB1 capacity increases the number of L1 data cache
references (L1D ALL REF events) from 1 to 2. This suggests that page tables
are cached in the L1 data cache, and that the PDE cache is present and the page
table accesses hit there, since only the last level page walk step is needed.
Experimenting with L1 DTLB on Platform AMD Server requires changing
pageStride to 1 for full associativity. The results show a change from 3 to 8
84 V. Babka and P. Tůma
Fig. 2. Performance event counters related to L1 DTLB misses on Intel Server (left)
and L2 DTLB misses on AMD Server (right)
cycles at 49 accessed pages, which confirms the full associativity and 48 entries
in the L1 DTLB, the replacement policy behavior approximates LRU for our
access pattern. The performance counters show a change from 0 to 1 in the
L1 DTLB miss and L2 DTLB hit events, the L2 DTLB miss event does not
occur. The experiment therefore estimates the miss penalty, which is not avail-
able in vendor documentation, at 5 cycles. Note that the value of L1 DTLB
hit counter (L1 DTLB HIT:L1 4K TLB HIT) is always 1, indicating a possible
problem with this counter on the particular experiment platform.
For L2 DTLB on Platform AMD Server, pageStride is set to 128. The results
show an increase from 3 to 43 cycles at 49 accessed pages, which means that we
observe L2 DTLB misses and also indicates a non-exclusive policy between L1
DTLB and L2 DTLB. The L2 associativity, however, is difficult to confirm due
to full L1 associativity. The event counters on Fig. 2 show a change from 0 to 1 in
the L2 miss event (L1 DTLB AND L2 DTLB MISS:4K TLB RELOAD event).
The penalty of the L2 DTLB miss is thus estimated at 35 cycles in addition to
the L1 DTLB miss penalty, or 40 cycles in total.
On Platform AMD Server, the paging structures are not cached in the L1
cache. The value of the REQUESTS TO L2:TLB WALK event counter shows
that each L2 DTLB miss in this experiment results in one page walk step that
accesses the L2 cache. This means that a PDE cache is present, as is further
examined in the next experiment. Note that the problem with the value of the
L1 DTLB HIT:L1 4K TLB HIT event counter persists, it is always 1 even in
presence of L2 DTLB misses.
Our experiments targeted at the translation miss penalties indicate that a TLB
miss can be resolved with only one additional memory access, rather than as
many accesses as there are levels in the paging structures. This means that
that a cache of the third level paging structures is present on both investigated
platforms, and since the presence of such additional translation caches mentioned
only discussed in general terms in vendor documentation [8], we investigate these
caches next.
Investigating Cache Parameters of x86 Family Processors 85
Acc. dur. [cycles − 1000 walks Avg Trim]
5
Stride [pages] Stride [pages]
40
512 512
4K 4K
4
8K 8K
30
64 K 64 K
3
128 K 128 K
20
256 K 256 K
2
10
1
5 10 15 5 10 15
Fig. 3. Extra translation caches miss penalty (left) and related L1 data cache references
events related (right) on Intel Server
With the presence of the third level paging structure cache (PDE cache) already
confirmed, we focus on determining the presence of caches for the second level
(PDPTE cache) and the first level (PML3TE cache).
The experiments use the set collision pointer walk from Listing 1.1 and 1.3.
The numAccesses and pageStride parameters are initially set to values that
make each access miss in the last level of DTLB and hit in the PDE cache. By
repeatedly doubling pageStride, we should eventually reach a point where only
a single associativity set in the PDE cache is accessed, triggering misses when
numAccesses exceeds the associativity. This should be observed as an increase of
the average access duration and an increase of the data cache access count during
page walks. Eventually, the accessed memory range pageStride × numP ages
exceeds the 512 × 512 pages translated by a single third level paging structure,
making the accesses map to different entries in the second level paging structure
and thus different entries in the PDPTE cache, if present. Further increase of
pageStride extends the scenario analogically to the PML4TE cache.
The change of the average access durations and the corresponding change
in the data cache access count for different values of pageStride on Platform
Intel Server are illustrated in Fig. 3. Only those values of pageStride that lead
to different results are displayed, the results for the values that are not displayed
are the same as the results for the previous value.
For the 512 pages stride, the average access duration changes from 3 to 12 at 5 ac-
cessed pages, which means we hit the PDE cache as in the previous experiment. We
also observe an increase of the access duration from 12 to 23 cycles and a change in
the L1 cache miss (L1D REPL event) counts from 0 to 1 at 9 accessed pages. These
misses are not caused by the accessed data but by the page walks, since with this
particular stride and alignment, we always read the first entry of a page table and
therefore the same cache set. We see that the penalty of this miss is 11 cycles, also
reflected in the value of the PAGE WALKS:CYCLES event counter, which changes
from 5 to 16. Later experiments will show that an L1 data cache miss penalty for
data load on this platform is indeed 11 cycles, which means that the L1 data cache
miss penalty simply adds up with the DTLB miss penalty.
86 V. Babka and P. Tůma
Fig. 4. Extra translation caches miss penalty (left) and related page walk requests to
L2 cache (right) on AMD Server
As we increase the stride, we start to trigger misses in the PDE cache. With
the stride of 8192 pages, which spans 16 PDE entries, and 5 or more accessed
pages, the PDE cache misses on each access. The L1 data cache misses event
counter indicates that there are three L1 data cache references per memory
access, two of them are therefore caused by the page walk. This means that a
PDP cache is also present and the PDE miss penalty is 4 cycles.
Further increasing the stride results in a gradual increase of the PDP cache
misses. With the 512 × 512 pages stride, each access maps to a different PDP
entry. At 5 accessed pages, the L1D ALL REF event counter increases to 5
L1 data cache references per access. This indicates that there is no PML4TE
cache, since all four levels of the paging structures are traversed, and that the
PDP cache has at most 4 entries. Compared to the 8192 pages stride, the PDP
miss adds approximately 19 cycles per access. Out of those, 11 cycles are added
by an extra L1 data cache miss, as both PDE and PTE entries miss the L1
data cache due to being mapped to the same set. The remaining 8 cycles is
the cost of walking two additional levels of page tables due to the PDPTE
miss.
The standard deviation of the results exceeds the limit of 0.5 cycles only when
the L1 cache associativity is about to be exceeded – up to 3.5 cycles, and when
the translation cache level is about to be exhausted – up to 8 cycles.
The observed access durations and the corresponding change in the data cache
access count from an analogous experiment on Platform AMD Server are shown
in Fig. 4. We can see that for a stride of 128 pages, we still hit the PDE cache
as in the previous experiment. Strides of 512 pages and more need 2 page walk
steps and thus hit the PDPTE cache. Strides of 256 K pages need 3 steps and
thus hit the PML4TE cache. Finally, strides of 128 M pages need all 4 steps.
The access duration increases by 21 cycles for each additional page walk step.
With a 128 M stride, we see an additional penalty due to page walks triggering
L2 cache misses.
The standard deviation of the results exceeds the limit of 0.5 cycles only when
the L2 cache capacity is exceeded – up to 18 cycles, and when the translation
cache level is about to be exhausted – up to 10 cycles.
Investigating Cache Parameters of x86 Family Processors 87
In order to determine the cache line size, the experiment executes a measured
workload that randomly accesses half of the cache lines, interleaved with an inter-
fering workload that randomly accesses all the cache lines. For data caches, both
workloads use a pointer emitting version of code from Listing 1.4 to initialize the
access pattern and code from Listing 1.1 to traverse the pattern. For instruction
caches, both workloads use a jump emitting version of code from Listing 1.4 to
initialize the access pattern and code from Listing 1.2 to traverse the pattern.
The measured workload uses the smallest possible access stride, which is 8 B for
64 bit aligned pointer variables and 16 B for jump instructions. The interfering
workload varies its access stride. When the stride exceeds the cache line size,
the interfering workload should no longer access all cache lines, which should
be observed as a decrease in the measured workload duration, compared to the
situation when the interfering workload accesses all cache lines.
The results from both platforms and all cache levels and types, except the L2
cache on Platform Intel Server, show a decrease in the access duration when the
access stride of the interfering workload increases from 64 B to 128 B. The counts
of the related cache miss events confirm that the decrease in access duration is
caused by the decrease in cache misses. Except for the L2 cache on Platform
Investigating Cache Parameters of x86 Family Processors 89
Fig. 5. The effect of interfering workload access stride on the L2 cache eviction (left);
streamer prefetches triggered by the interfering workload during the L2 cache eviction
on Intel Server (right)
Intel Server, we can therefore conclude that the line size is 64 B for all cache
levels, as stated in the vendor documentation.
Figure 5 shows the results for the L2 cache on Platform Intel Server. These
results are peculiar in that they would indicate the cache line size of the L2
cache is 128 B rather than 64 B, a result that was already reported in [10]. The
reason behind the observed results is the behavior of the streamer prefetcher
[11, page 3-73], which causes the interfering workload to fetch two adjacent lines
to the L2 cache on every miss, even though the second line is never accessed.
The interfering workload with a 128 B stride thus evicts two 64 B cache lines.
Figure 5 contains values of the L2 prefetch miss (L2 LINES IN:PREFETCH)
event counter collected from the interfering workload rather than the measured
workload, and confirms that L2 cache misses triggered by prefetches occur.
Because the vendor documentation does not explain the exact behavior of
the streamer prefetcher when fetching two adjacent lines, we have performed a
slightly modified experiment to determine which two lines are fetched together.
Both workloads of the experiment access 4 MB with 256 B stride, the measured
workload with offset 0 B, the interfering workload with offsets 0, 64, 128 and
192 B. The offset therefore determines whether both workloads access the same
cache associativity sets or not. The offset of 0 B should always evict lines accessed
by the measured code, the offset of 128 B should always avoid them. If the
streamer prefetcher fetches a 128 B aligned pair of cache lines, using the 64 B
offset should also evict the lines of the measured workload, while the 192 B offset
should avoid them. If the streamer prefetcher fetches any pair of consecutive
cache lines, using both the 64 B offset and the 192 B offset should avoid the lines
of the measured workload.
The results on Fig. 6 indicate that the streamer prefetcher always fetches
128 B aligned pair of cache lines, rather than any pair of consecutive cache lines.
Additional experiments also show that the streamer prefetcher does not
prefetch the second line of a pair when the L2 cache is saturated with another
workload. Running two workloads on cores that share the cache therefore results
in fewer prefetches than running the same two workloads on cores that do not
share the cache.
90 V. Babka and P. Tůma
Fig. 6. Access duration (left) and L2 cache misses by accesses only (right) investigating
streamer prefetch on Intel Server
We measure the average access time in a set collision pointer walk from List-
ing 1.1 and 1.3, with the buffer allocated using either the standard allocation or
the colored allocation. The number of accessed pages is selected to exceed the
cache associativity. If a particular cache is virtually indexed, the results should
show an increase in access duration when the number of accesses exceeds asso-
ciativity for both modes of allocation. If the cache is physically indexed, there
should be no increase in access duration with the standard allocation, because
the stride in virtual addresses does not imply the same stride in physical ad-
dresses.
Investigating Cache Parameters of x86 Family Processors 91
The results from Platform Intel Server show that colored allocation is needed
to trigger L2 cache misses, as illustrated in Fig. 7. The L2 cache is therefore
physically indexed. Without colored allocation, the standard deviation of the
results grows when the L1 cache misses start occuring, staying below 3.2 cycles
for 8 accessed pages and below 1 cycle for 9 and more accessed pages. Similarly
with colored allocation, the standard deviation stays below 5.5 cycles for 7 and
8 accessed pages when the L1 cache starts missing, and below 10.5 cycles for 16
and 17 accessed pages when the L2 cache stats missing.
The results from Platform AMD Server on Fig. 8 also show that colored allo-
cation is needed to trigger L2 cache misses with 19 and more accesses. Colored
allocation also seems to make a difference for the L1 data cache, but values of
the event counters on Fig. 8 show that the L1 data cache misses occur with both
modes of allocation, the difference in the observed duration therefore should not
be attributed to indexing. The standard deviation of the results exceeds the limit
of 0.5 cycles for small numbers of accesses, with a maximum standard deviation
of 2.1 cycles at 3 accesses.
Finally, we measure the memory cache miss penalties, which appear to include
effects not described in vendor documentation.
Fig. 9. L2 cache miss penalty when accessing single cache line set (left); dependency
on cache line set selection in pages of color 0 (right) on Intel Server
The experiment determines the penalties of misses in all levels of the cache
hierarchy and their possible dependency on the offset of accesses triggering the
misses. We rely again on the set collision access pattern from Listing 1.1 and 1.3,
increasing the number of repeatedly accessed addresses and varying the offset
within a cache line to determine its influence on the access duration. The results
are summarized in Table 2, more can be found in [7].
On Platform Intel Server, we observe an unexpected increase in the average
access duration when about 80 different addresses mapped to the same cache
line set. The increase, visible on Fig. 9, is not reflected by any of the relevant
event counters. Further experiments, also illustrated on Fig. 9, reveal a difference
between accessing odd and even cache line sets within a page. We see that the
difference varies with the number of accessed addresses, with accesses to the even
cache lines faster than odd cache lines for 32 and 64 addresses, and the other
way around for 128 addresses. The standard deviation in these results is under
3 clocks.
On Platform AMD Server, we observe an unusually high penalty for the L1
data cache miss, with an even higher peak when the number of accessed addresses
just exceeds the associativity, as illustrated in Fig. 10. Determined this way, the
Fig. 10. L1 data cache miss penalty when accessing a single cache line set (left) and
random sets (right) on AMD Server
Investigating Cache Parameters of x86 Family Processors 93
Fig. 11. ependency of L2 cache miss penalty on access offset in a cache line when
accessing random cache line sets (left) and 20 cache lines in the same set (right) on
AMD Server
penalty would be 27 cycles, 40 cycles for the peak, which is significantly more
than the stated L2 access latency of 9 cycles [9, page 223]. Without additional
experiments, we speculate that the peak is caused by the workload attempting
to access data that is still in transit from the L1 data cache to the L2 cache.
More light is shed on the unusually high penalty by another experiment,
one which uses the random access pattern from Listing 1.4 rather than the set
collision pattern from Listing 1.3. The workload allocates memory range twice
the cache size and varies the portion that is actually accessed. Accessing the full
range triggers cache misses on each access, the misses are randomly distributed
to all cache sets. With this approach, we observe a penalty of approximately
12 cycles per miss, as illustrated on Fig. 10. We have extended this experiment
to cover all caches on Platform AMD Server, the differences in penalties when
accessing a single cache line set and when accessing multiple cache line sets is
summarized in Table 2.
For the L2 cache, we have also observed a small dependency of the access
duration on the access offset within the cache line when accessing random cache
sets, as illustrated on Fig. 11. The access duration increases with each 16 B of the
offset and can add almost 3 cycles to the L2 miss penalty. A similar dependency
was also observed when accessing multiple addresses mapped to the the same
cache line set, as illustrated on Fig. 11.
Again, we believe that illustrating the many variables that determine the
cache miss penalties is preferable to the incomplete information available in
vendor documentation, especially when results of more complex experiments
which include such effects are to be analyzed.
4 Experimental Framework
Besides providing the execution environment for the benchmarks, the frame-
work bundles utility functions, such as the colored allocation used in experiments
with physically indexed caches in Section 3.
The colored allocation is based on page coloring [13], where the bits deter-
mining the associativity set are the same in virtual and physical address. The
number of the associativity set is called a color. As an example, the L2 cache on
Platform Intel Server has a size of 4 MB and 16-way associativity, which means
that addresses with a stride of 256 KB will be mapped to the same cache line set
[11, page 3-61]. With 4 KB page size, this yields 64 different colors, determined
by the 6 least significant bits of the page address.
Although the operating system on our experimental platforms does not sup-
port page allocation with coloring, it does provide a way for the executed
program to determine its current mapping. Our colored allocation uses this in-
formation together with the mremap function to allocate a continuous virtual
memory area, determine its mapping and remap the allocated pages one by one
to a different virtual memory area with the target virtual addresses matching
the color of the physical addresses. This way, the allocator can construct a con-
tinuous virtual memory area with virtual pages having the same color as the
physical frames that the pages are mapped to.
5 Conclusion
We have described a series of experiments designed to investigate some of the
detailed parameters of the memory architecture of the x86 processor family. Al-
though the knowledge of the detailed parameters is of limited practical use in
general software development, where it is simply too involved and too specialized,
we believe it is of significant importance in designing and evaluating research
experiments that exercise the memory architecture. Without this knowledge, it
is difficult to design experiments that target the intended part of the memory
Investigating Cache Parameters of x86 Family Processors 95
References
1. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer Manual,
Volume 3: System Programming, Order Nr. 253668-027 and 253669-027 (July 2008)
2. Advanced Micro Devices, Inc.: AMD64 Architecture Programmer’s Manual Volume
2: System Programming, Publication Number 24593, Revision 3.14. (September
2007)
3. Drepper, U.: What every programmer should know about memory (2007),
http://people.redhat.com/drepper/cpumemory.pdf
4. PAPI: Performance application programming interface,
http://icl.cs.utk.edu/papi
5. Pettersson, M.: Perfctr, http://user.it.uu.se/∼ mikpe/linux/perfctr/
6. Advanced Micro Devices, Inc.: AMD BIOS and Kernel Developer’s Guide For
AMD Family 10h Processors, Publication Number 31116, Revision 3.06 (March
2008)
7. Babka, V., Bulej, L., Děcký, M., Kraft, J., Libič, P., Marek, L., Seceleanu, C.,
Tůma, P.: Resource usage modeling, Q-ImPrESS deliverable 3.3 (September 2008),
http://www.q-impress.eu
8. Intel Corporation: Intel 64 and IA-32 Architectures Application Note: TLBs,
Paging-Structure Caches, and Their Invalidation, Order Nr. 317080-002 (April
2008)
9. Advanced Micro Devices, Inc.: AMD Software Optimization Guide for AMD Family
10h Processors, Publication Number 40546, Revision 3.06 (April 2008)
10. Yotov, K., Pingali, K., Stodghill, P.: Automatic measurement of memory hierarchy
parameters. In: Proceedings of the 2005 ACM SIGMETRICS International Con-
ference on Measurement and Modeling of Computer Systems, pp. 181–192. ACM,
New York (2005)
11. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Man-
ual, Order Nr. 248966-016 (November 2007)
12. R: The R Project for Statistical Computing, http://www.r-project.org/
13. Kessler, R.E., Hill, M.D.: Page placement algorithms for large real-indexed caches.
ACM Trans. Comput. Syst. 10(4), 338–359 (1992)
14. Yotov, K., Jackson, S., Steele, T., Pingali, K.K., Stodghill, P.: Automatic measure-
ment of instruction cache capacity. In: Ayguadé, E., Baumgartner, G., Ramanujam,
J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 230–243. Springer, Hei-
delberg (2006)
The Next Frontier for Power/Performance
Benchmarking: Energy Efficiency of Storage Subsystems
Klaus-Dieter Lange
1 Introduction
Today’s challenge for datacenters is their high energy consumption [1]. The demand
for efficient real estate in datacenters has moved to more power efficient datacenters.
This increasing concern of energy usage in datacenters has drastically changed how
the IT industry evaluates servers. In response, the Standard Performance Evaluation
Corporation (SPEC) [2] has developed and released SPECpower_ssj2008 [3], the first
industry-standard benchmark that evaluates the power and performance characteris-
tics of server class computers. The need for this type of measurement was so urgent
and necessary that the US Environmental Protection Agency (US EPA) included it in
their ENERGY STAR® Program Requirements for Computer Servers [4]. The
SPECpower_ssj2008 results [5] are also already being utilized for energy conscious
purchase decisions. With the competitive marketplace driving server innovation even
further, the next logical phase is adopting an energy conscious evaluation of storage
subsystems.
In order to show the significant impact on the power consumption of the storage sub-
system we configured a server with external storage, similar to the publicly released
SPECweb2005 result [6]. Two AC power analyzers were connected to separately
measure the power consumption of the server and the external storage. The config-
ured system was then benchmarked with the SPECweb2005 (Banking) workload from
idle to 100% in 10% increments and the power measurements were automatically
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 97–101, 2009.
© Springer-Verlag Berlin Heidelberg 2009
98 K.-D. Lange
recorded in 1s intervals. The server power consumption ranged from ~286W at idle to
~312W at 100% performance; while the external storage ranged from ~305W at idle
to ~400W at 100% performance. Figure 1 represents a graphical view of these data.
This test configuration shows that the power consumption of the external storage
subsystem can be significantly higher than the server itself; most of the current public
SPECweb2005 results exhibit similar tendencies. Another recent study [7] on the
energy cost of datacenters shows that in database setups, 63% of power is consumed
by the storage systems. For at least these application areas (web serving and database)
an industry standard method to measure the energy usage for storage subsystems is
necessary.
Another interesting discovery was the range in power consumption between idle
and 100% performance. For our baseline benchmark configuration this equated to
~9% range for the server and ~30% range for the storage. For comparison, in only one
year after its release, SPECpower_ssj2008 results show that companies pushed the
server range as far as 50%.
350
300
250
200
150
100
50
0
idle 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
SPECweb2005 (Banking) Performance
To demonstrate the energy savings when using the latest technology, two generations
of storage enclosures, both with a standard rack form factor of 3U, were compared.
The older generation storage enclosure holds 14 large form factor (LFF) 3.5” SCSI
drives and the current generation storage enclosure holds 25 small form factor (SFF)
2.5” SAS drives. A drive capacity of 32GB was chosen for each drive. Each empty
enclosure was attached to a server and then loaded with drives, one drive at a time
every 66 seconds. The idle power of the empty SAS enclosure (72W) was slightly
The Next Frontier for Power/Performance Benchmarking 99
200
Idle Power Consumption [W]
150
100
0
1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 22 23 24 25
Drive Count
Fig. 2. Large Form Factor SCSI vs. Small Form Factor SAS
higher than the SCSI enclosure (58W). Nevertheless, with an idle power of ~12W for
an individual LFF SCSI drive and ~6.25 for an individual SFF SAS drive, this advan-
tage was surpassed after the third drive was added. When reaching the maximum
drive capacity, the 14 LFF drive enclosure used ~227W; the SFF SAS enclosure with
14 drives used only ~156W, approximately 71W power savings. The SFF SAS en-
closure needed to be fully equipped with 25 drives before it would reach the power
consumption of the LFF SCSI drive enclosure.
The power consumption of a server is dependent on the CPU stress pattern. Different
stress patterns cause different consumptions of power (Figure 3).
To demonstrate a similar behavior on storage subsystems, the same hardware
configuration was utilized as in section 3 and different workloads were applied.
Five different workloads were selected for this experiment: 100% random write,
75% random write with 25% random read, 100% random read, 100% sequential
write and 100% sequential read. The resulting power consumptions are shown in
Figure 4.
The findings indicate that random access causes higher power consumption than
sequential access – this could be caused by the additional head movement of the
drives.
For these workloads the SFF SAS enclosure needed to be fully equipped with 25
drives before it would reach the power consumption of the LFF SCSI drive
enclosure.
100 K.-D. Lange
perlbmk
equake
sixtrac
parser
facere
vortex
wupwi
ammp
fma3d
galgel
crafty
mgrid
mesa
bzip2
applu
lucas
swim
twolf
apsi
gzip
eon
gap
mcf
gcc
vpr
art
310
Power Consumption [W]
290
270
250
230
0:00
0:10
0:20
0:30
0:40
0:50
1:00
1:10
1:20
1:30
1:40
1:50
2:00
2:10
2:20
2:30
2:40
2:50
3:00
3:10
3:20
3:30
3:40
Fig. 3. CPU – power consumption for various workloads
325
300
Power Consumption [W]
275
250
225
200
175
150
100% RW 75% RW / 25% 100% RR 100% SW 100% SR
5 Conclusion
The power consumption of the external storage subsystem has been identified to be
significantly higher than the server itself in the application areas of web serving and
database.
The experiments in sections 3 and 4 show that modern storage subsystems signifi-
cantly save more energy than their predecessors; however as of December 2008 there
The Next Frontier for Power/Performance Benchmarking 101
is no industry standard benchmark available that can demonstrate these or similar real
energy savings.
There will be many challenges along the way to create benchmarks that measure the
power/performance of server storage subsystems. As in the development of
SPECpower_ssj2008, I am convinced that SPEC will again step up to these challenges
and convene the best talents from the industry to lead the exploration in this next frontier.
6 Future Work
Preliminary measurements of the power/performance characteristics of solid-state
drives (SSD) show very promising results which warrant further investigation. An-
other area of interest is to analyze the impact of energy preserving storage enclosures
and advanced power supplies. Once we have studied these measurements, we will
provide the results to SPEC to support their benchmark development.
The active support of the augmentation of a power component to all applicable
SPEC’s benchmarks will be in the industry’s best interest, since it will enable the fair
evaluation of servers and their subsystems under a wide variety of workloads.
Acknowledgement
The author would like to acknowledge Richard Tomaszewski and Steve Fairchild for
their guidance; Kris Langenfeld, Jonathan Koomey, Roger Tipley, Mark Thompson
and Raghunath Nambiar for their comments and feedback; Bryon Georgson, David
Rogers, Daniel Ames and David Schmidt for their support conducting the power and
performance measurements; Dwight Barron, Mike Nikolaiev and Tracey Stewart for
their continuous support.
SPEC and the benchmark names SPECpower_ssj2008 and SPECweb2005 are reg-
istered trademarks of the Standard Performance Evaluation Corporation.
References
1. Koomey, J.: Worldwide electricity used in data centers. Environmental Research Let-
ters 3(034008) (September 23, 2008),
http://www.iop.org/EJ/abstract/1748-9326/3/3/034008/
2. Standard Performance Evaluation Corporation (SPEC), http://www.spec.org
3. SPECpower_ssj2008, http://www.spec.org/power_ssj2008
4. US EPA’s Energy Star for Enterprise Servers,
http://www.energystar.gov/
index.cfm?c=new_specs.enterprise_servers
5. SPECpower_ssj2008 results,
http://www.spec.org/power_ssj2008/results/power_ssj2008.html
6. SPECweb2005 result,
http://www.spec.org/web2005/results/res2006q4/
web2005-20061019-00048.html
7. Poess, M., Nambiar, R.: Energy Cost, The Key Challenge of Today’s Data Centers: A
Power Consumption Analysis of TPC-C Results,
http://www.vldb.org/pvldb/1/1454162.pdf
Thermal Design Space Exploration of 3D Die Stacked
Multi-core Processors Using Geospatial-Based Predictive
Models
1 Introduction
Three-dimensional (3D) integrated circuit design [1] is an emerging technology that
greatly improves transistor integration density and reduces on-chip wire communica-
tion latency. It places planar circuit layers in the vertical dimension and connects
these layers with a high density and low-latency interface. In addition, 3D offers the
opportunity of binding dies, which are implemented with different techniques to en-
able integrating heterogeneous active layers for new system architectures. Leveraging
3D die stacking technologies to build uni-/multi-core processors has drawn an in-
creased attention to both chip design industry and research community [2- 8].
The realization of 3D chips faces many challenges. One of the most daunting of
these challenges is the problem of inefficient heat dissipation. In conventional 2D
chips, the generated heat is dissipated through an external heat sink. In 3D chips, all
of the layers contribute to the generation of heat. Stacking multiple dies vertically
increases power density and dissipating heat from the layers far away from the heat
sink is more challenging due to the distance of heat source to external heat sink.
Therefore, 3D technologies not only exacerbate existing on-chip hotspots but also
create new thermal hotspots. High die temperature leads to thermal-induced perform-
ance degradation and reduced chip lifetime, which threats the reliability of the whole
system, making modeling and analyzing thermal characteristics crucial in effective
3D microprocessor design.
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 102–120, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 103
CPU
MEM
MIX
Fig. 1. 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors
Fig. 2. 2D thermal variation on die 4 under different microarchitecture and floor-plan configu-
rations
Previous studies [5, 6] show that 3D chip temperature is affected by factors such as
configuration and floor-plan of microarchitectural components. For example, instead
of putting hot components together, thermal-aware floor-planning places the hot com-
ponents by cooler components, reducing the global temperature. Thermal-aware floor-
planning [5] uses intensive and iterative simulations to estimate the thermal effect of
microarchitecture components at early architectural design stage. However, using
detailed yet slow cycle-level simulations to explore thermal effects across large de-
sign space of 3D multi-core processors is very expensive in terms of time and cost.
To achieve thermal efficient 3D multi-core processor design, architects and chip
designers need models with low computation overhead, which allow them to quickly
explore the design space and compare different design options. One challenge in
modeling the thermal behavior of 3D die stacked multi-core architecture is that the
manifested thermal patterns show significant variation within each die and across
different dies (as shown in Fig. 1). The results were obtained by simulating a 3D die
stacked quad-core processors running multi-programmed CPU (bzip2, eon,
gcc,perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk)
workloads. Each program within a multi-programmed workload was assigned to a die
104 C.-B. Cho, W. Zhang, and T. Li
that contains a processor core and caches. More details on our experimental method-
ologies can be found in Section 4.
Figure 2 shows the 2D thermal variation on die 4 under different microarchitecture
and floor-plan configurations. On the given die, the 2-dimensional thermal spatial
characteristics vary widely with different design choices. As the number of architec-
tural parameters in the design space increases, the complex thermal variation and
characteristics cannot be captured without using slow and detailed simulations. As
shown in Figs. 1 and 2, to explore the thermal-aware design space accurately and
informatively, we need computationally effective methods that not only predict ag-
gregate thermal behavior but also identify both size and geographic distribution of
thermal hotspots. In this work, we aim to develop fast and accurate predictive models
to achieve this goal.
Prior work has proposed various predictive models [9, 10, 11, 12, 13, 14, 15] to
cost-effectively reason processor performance and power characteristics at the design
exploration stage. A common weakness of existing analytical models is that they
assume centralized and monolithic hardware structures and therefore lack the ability
to forecast the complex and heterogeneous thermal behavior across large and distrib-
uted 3D multi-core architecture substrates. In this paper, we addresses this important
and urgent research task by developing novel, 2D multi-scale predictive models,
which can efficiently reason the geo-spatial thermal characteristics within die and
across different dies during the design space exploration stage without using detailed
cycle-level simulations. Instead of quantifying the complex geo-spatial thermal char-
acteristics using a single number or a simple statistical distribution, our proposed
techniques employ 2D wavelet multiresolution analysis and neural network non-linear
regression modeling. With our schemes, the thermal spatial characteristics are de-
composed into a series of wavelet coefficients. In the transform domain, each individ-
ual wavelet coefficient is modeled by a separate neural network. By predicting only a
small set of wavelet coefficients, our models can accurately reconstruct 2D spatial
thermal behavior across the design space.
The rest of the paper is organized as follows: In Section 2, we briefly describe
the wavelet transform, especially for 2D wavelet transform and the principles of
neural networks are also presented. Section 3 provides our wavelet based neural
networks for 2D thermal behavior prediction and system details. Section 4 intro-
duces our experimental setup. Section 5 highlights our experimental results on 2D
thermal behavior prediction and analyzes the tradeoff between model complexity,
configuration, and prediction accuracy. Section 6 discusses related work. Section 7
concludes the paper.
2 Background
To familiarize the reader with the general methods used in this paper, we provide a
brief overview of wavelet multiresolution analysis and neural network regression
prediction in this section. To learn more details about wavelets and neural networks,
the reader is encouraged to read [16, 17].
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 105
Wavelets are mathematical tools that use a simple, fixed prototype function (called
the analyzing or mother wavelet) to transform data of interest into different frequency
components and study each component with a resolution that matches its scale. A
wavelet transform, which decomposes data of interest by wavelets, provides a com-
pact and effective mathematical representation of the original data. In contrast to
Fourier transforms, which only offer frequency representations, wavelets are capable
of providing time and frequency localizations simultaneously. Wavelet analysis em-
ploys two functions, often referred to as the scaling filter ( H ) and the wavelet filter
( G ), to generate a family of functions that break down the original data. The scaling
filter is similar in concept to an approximation function, while the wavelet filter quan-
tifies the differences between the original data and the approximation generated by
the scaling function. Wavelet analysis allows one to choose the pair of scaling and
wavelet filters from numerous functions. In this section, we provide a quick primer on
wavelet analysis using the Haar wavelet, which is the simplest form of wavelets [18].
Equation (1) shows the scaling and wavelet filters for Haar wavelets, respectively.
H = (1 / 2 ,1 / 2 ) G = (−1 / 2 ,1 / 2 ) . (1)
The Haar discrete wavelet transform (DWT) works by averaging two adjacent val-
ues on a series of data at a given scale to form smoothed, lower-dimensional data (i.e.
approximations), and the resulting coefficients (i.e. details), which are the differences
between the values and their averages. By recursively repeating the decomposition
process on the averaged sequence, we achieve multi-resolution decomposition. The
process continues by decomposing the scaling coefficient (approximation) vector
repeating the same steps, and completes when only one coefficient remains. As a
result, wavelet decomposition is the collection of average and detail coefficients at all
scales.
H * = (1 / 2 ,1 / 2 ) G* = (1 / 2 ,−1 / 2 ) . (2)
The original data can be reconstructed from wavelet coefficients using a pair of
wavelet synthetic filters ( H * and G * ), as shown in (2). With the Haar wavelets, this
inverse wavelet transform can be achieved by adding difference values back or sub-
tracting differences from the averages. This process can be performed recursively
until the finest scale is reached. The original data can be perfectly recovered if all
wavelet coefficients are involved. Alternatively, an approximation of the data can be
reconstructed using a subset of wavelet coefficients. Using a wavelet transform gives
time-frequency localization of the original data. As a result, the original data can be
accurately approximated using only a few wavelet coefficients since they capture
most of the energy of the input data. Thus, keeping only the most significant coeffi-
cients enables us to represent the original data in a lower dimension. Note that in (1)
and (2) we use 2 instead of 2 as a scaling factor since just averaging cannot preserve
Euclidean distance in the transformed data.
106 C.-B. Cho, W. Zhang, and T. Li
As shown in Fig. 3, the 1D analysis filter bank is first applied to the rows (horizon-
tal filtering) of the data and then applied to the columns (vertical filtering). This kind
of 2D DWT leads to a decomposition of approximation coefficients at level j in four
components: the approximation (LL) at level j+1, and the details in three orientations,
e.g., horizontal (LH), vertical (HL), and diagonal (HH).
LL1
346
345
LL LH
2 2
344
343
HL HH2 LH1
2
342
341
340
HL1 HH
1
(LL=2)). Since a small set of wavelet coefficients provide concise yet insightful in-
formation on 2D thermal spatial characteristics, we use predictive models (i.e. neural
networks) to relate them individually to various design parameters. Through inverse
2D wavelet transform, we use the small set of predicted wavelet coefficients to syn-
thesize 2D thermal spatial characteristics across the design space. Compared with a
simulation-based method, predicting a small set of wavelet coefficients using analyti-
cal models is computationally efficient and is scalable to explore the large thermal
design space of 3D multi-core architecture.
The most common type of neural network (shown as Fig. 5) consists of three layers
of units: a layer of input units is connected to a layer of hidden units, which is con-
nected to a layer of output units. The input is fed into network through input units.
Each hidden unit receives the entire input vector and generates a response. The output
of a hidden unit is determined by the input-output transfer function that is specified
for that unit. Commonly used transfer functions include the sigmoid, linear threshold
function and radial basis function (RBF) [19]. The RBF is a special class of function
with response decreasing monotonically with distance from a central point. The cen-
ter, the distance scale, and the precise shape of the radial function are parameters of
the model. A typical radial function is the Gaussian which, in the case of a scalar
input, is
⎛ ( x − c) 2 ⎞ .
h( x ) = exp⎜⎜ − ⎟ (3)
⎝ r 2 ⎟⎠
Its parameters are its center c and its radius r. A neural network that uses RBF can
be expressed as
108 C.-B. Cho, W. Zhang, and T. Li
n
f ( x) = ∑ w j h j ( x) . (4)
j =1
r
where w ∈ ℜ n is adaptable or trainable weight vector and {h j (⋅)} nj =1 are radial basis
functions of the hidden units. As shown in (4), the ANN output, which is determined
by the output unit, is computed using the responses of the hidden units and the
weights between the hidden and output units. Neural networks outperform linear
models in capturing complex, non-linear relations between input and output, which
make them a promising technique for tracking and forecasting complex thermal be-
havior.
Previous work [9, 10, 11, 12] shows that neural networks can accurately predict the
aggregated workload behavior across varied architecture configurations. Nevertheless,
monolithic global neural network models lack the ability to reveal complex thermal
behavior on a large scale. To overcome this disadvantage, we propose combining 2D
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 109
wavelet transforms and neural networks that incorporate multiresolution analysis into
a set of neural networks for spatial thermal characteristics prediction of 3D die
stacked multi-core design.
The 2D wavelet transform is a very powerful tool for characterizing spatial behav-
ior since it captures both global trend and local variation of large data sets using a
small set of wavelet coefficients. The local characteristics are decomposed into lower
scales of wavelet coefficients (high frequencies) which are utilized for detailed analy-
sis and prediction of individual or subsets of components, while the global trend is
decomposed into higher scales of wavelet coefficients (low frequencies) that are used
for the analysis and prediction of slow trends across each die. Collectively, these
wavelet coefficients provide an accurate interpretation of the spatial trend and details
of complex thermal behavior at a large scale. Our wavelet neural networks use a sepa-
rate RBF neural network to predict individual wavelet coefficients. The separate
predictions of wavelet coefficients proceed independently. Predicting each wavelet
coefficient by a separate neural network simplifies the training task (which can be
performed concurrently) of each sub-network. The prediction results for the wavelet
coefficients can be combined directly by the inverse wavelet transforms to synthesize
the 2D spatial thermal patterns across each die. Fig. 6 shows our hybrid neuro-wavelet
scheme for 2D spatial thermal characteristics prediction. Given the observed spatial
thermal behavior on training data, our aim is to predict the 2D thermal behavior of
each die in 3D die stacked multi-core processors under different design configura-
tions. The hybrid scheme involves three stages. In the first stage, the observed spatial
thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In
the second stage, each wavelet coefficient is predicted by a separate ANN. In the third
stage, the approximated 2D thermal characteristics are recovered from the predicted
wavelet coefficients. Each RBF neural network receives the entire architecture design
space vector and predicts a wavelet coefficient. The training of an RBF network in-
volves determining the center point and a radius for each RBF, and the weights of
each RBF, which determine the wavelet coefficients.
4 Experimental Methodology
In this study, we model four floor-plans that involve processor core and cache struc-
tures as illustrated in Fig. 7.
As can be seen, the processor core is placed at different locations across the differ-
ent floor-plans. Each floor-plan can be chosen by a layer in the studied 3D die stack-
ing quad-core processors. The size and adjacency of blocks are critical parameters for
deriving the thermal model. The baseline core architecture and floorplan we modeled
is an Alpha processor, closely resembling the Alpha 21264.
RF
Window
(ROB+IQ) ALU
L2 L2
BPRED LSQ
il1 dl1
L2
Table 1 lists the detailed processor core and cache configurations. We use Hotspot-
4.0 [20] to simulate thermal behavior of a 3D quad-core chip shown as Fig. 9. The
Hotspot tool can specify the multiple layers of silicon and metal required to model a
three dimensional IC. We choose grid-like thermal modeling mode by specifying a set
of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x
32um) is represented by a value. Hotspot takes power consumption data for each
component block, the layer parameters and the floor-plans as inputs and generates the
steady-state temperature for each active layer.
We use both integer and floating-point benchmarks from the SPEC CPU 2000 suite
(e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf,
swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see
Table 2). We categorize all benchmarks into two classes: CPU-bound and MEM
bound applications. We design three types of experimental workloads: CPU, MEM
and MIX. The CPU and MEM workloads consist of programs from only the CPU
intensive and memory intensive categories respectively. MIX workloads are the com-
bination of two benchmarks from the CPU intensive group and two from the memory
intensive group.
These multi-programmed workloads were simulated on our multi-core simulator
configured as 3D quad-core processors. We use the Simpoint tool [23] to obtain a
representative slice for each benchmark (with full reference input set) and each
112 C.-B. Cho, W. Zhang, and T. Li
Chip Frequency 3G
Voltage 1.2 V
Proc. Technology 65 nm
Die Size 21 mm × 21 mm
CPU1 bzip2, eon, gcc, perlbmk
CPU2 perlbmk, mesa, facerec, lucas
CPU3 gap, parser, eon, mesa
MIX1 gcc, mcf, vpr, perlbmk
Workloads MIX2 perlbmk, mesa, twolf, applu
MIX3 eon, gap, mcf, vpr
MEM1 mcf, equake, vpr , swim
MEM2 twolf, galgel, applu, lucas
MEM3 mcf, twolf, swim, vpr
In this study, we consider a design space that consists of 23 parameters (see Table 3)
spanning from floor-planning to packaging technologies. These design parameters have
been shown to have a large impact on processor thermal behavior. The ranges for these
parameters were set to include both typical and feasible design points within the explored
design space. Using detailed cycle-accurate simulations, we measure processor power
and thermal characteristics on all design points within both training and testing data sets.
We build a separate model for each benchmark domain and use the model to predict
thermal behavior at unexplored points in the design space. The training data set is used to
build the wavelet-based neural network models. An estimate of the model’s accuracy is
obtained by using the design points in the testing data set.
To train an accurate and prompt neural network prediction model, one needs to en-
sure that the sample data sets disperse points throughout the design space but keeps
the space small enough to maintain the low model building cost. To achieve this goal,
we use a variant of Latin Hypercube Sampling (LHS) [24] as our sampling strategy
since it provides better coverage compared to a naive random sampling scheme. We
generate multiple LHS matrices and use a space filing metric called L2-star discrep-
ancy [25]. The L2-star discrepancy is applied to each LHS matrix to find the represen-
tative design space that has the lowest value of L2-star discrepancy. We use a
randomly and independently generated set of test data points to empirically estimate
the predictive accuracy of the resulting models. In this work, we used 200 train and 50
test data to reach a high accuracy for thermal behavior prediction since our study
shows that it offers a good tradeoff between simulation time and prediction accuracy
for the design space we considered. In our study, the thermal characteristics across
each die are represented by 64×64 samples.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 113
5 Experimental Results
In this section, we present detailed experimental results using 2D wavelet neural net-
works to forecast thermal behaviors of large scale 3D multi-core structures running
various CPU/MIX/MEM workloads without using detailed simulation.
1 N ~
x (k ) − x(k ) .
ME =
N
∑
k =1 x (k )
(5)
where: x(k ) is the actual value generated by the Hotspot thermal model, ~x ( k ) is the
predicted value and N is the total number of samples (a set of 64 x 64 temperature
samples per layer, detailed in section 4.1). As prediction accuracy increases, the ME
becomes smaller.
We present boxplots to observe the average prediction errors and their deviations
for the 50 test configurations against Hotspot simulation results. Boxplots are graphi-
cal displays that measure location (median) and dispersion (interquartile range), iden-
tify possible outliers, and indicate the symmetry or skewness of the distribution. The
central box shows the data between “hinges” which are approximately the first and
third quartiles of the ME values. Thus, about 50% of the data are located within the
box and its height is equal to the interquartile range. The horizontal line in the interior
of the box is located at the median of the data, it shows the center of the distribution
for the ME values. The whiskers (the dotted lines extending from the top and bottom
of the box) extend to the extreme values of the data or a distance 1.5 times the inter-
quartile range from the median, whichever is less. The outliers are marked as circles.
In Fig. 10, the blue line with diamond shape markers indicates the statistics average
of ME across all benchmarks.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 115
20
16
Error (%)
12
0
CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3
Fig. 10 shows that using 16 wavelet coefficients, the predictive models achieve
median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median
error of 6.9% across all experimented workloads. As can be seen, the maximum error
at any design point for any benchmark is 17.5% (MEM1), and most benchmarks show
an error less than 9%. This indicates that our hybrid neuro-wavelet framework can
predict 2D spatial thermal behavior across large and sophisticated 3D multi-core ar-
chitecture with high accuracy. Fig. 10 also indicates that CPU (average 4.4%) work-
loads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%)
workloads. This is because the CPU workloads usually have higher temperature on
the small core area than the large L2 cache area. These small and sharp hotspots can
be easily captured using just few wavelet coefficients. On MEM and MIX workloads,
the complex thermal pattern can spread the entire die area, resulting in higher predic-
tion error.
Prediction
Simulation
Fig. 11 illustrates the simulated and predicted 2D thermal spatial behavior of die 4
(for one configuration) on CPU1, MEM1 and MIX1 workloads. The results show that
our predictive models can tack both size and location of thermal hotspots. We further
examine the accuracy of predicting locations and area of the hottest spots and the
results are similar to those presented in Figure 10.
Fig. 12 shows the prediction accuracies with different number of wavelet coeffi-
cients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D
116 C.-B. Cho, W. Zhang, and T. Li
thermal spatial pattern prediction accuracy is increased when more wavelet coeffi-
cients are involved. However, the complexity of the predictive models is proportional
to the number of wavelet coefficients. The cost-effective models should provide high
prediction accuracy while maintaining low complexity. The trend of prediction accu-
racy shown in Fig. 12 suggests that for the programs we studied, a set of wavelet
coefficients with a size of 16 combine good accuracy with low model complexity;
increasing the number of wavelet coefficients beyond this point improves error at a
lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this
work to minimize the complexity of prediction models while achieving good
accuracy.
CPU1
8
Error (%)
0
16wc 32wc 64wc 96wc 128wc 256wc
MEM1
15
Error (%)
10
0
16wc 32wc 64wc 96wc 128wc 256wc
MIX1
20
Error (%)
10
0
16wc 32wc 64wc 96wc 128wc 256wc
Fig. 12. ME boxplots of prediction accuracies with different number of wavelet coefficients
We further compare the accuracy of our proposed scheme with that of approximat-
ing 3D stacked die spatial thermal patterns via predicting the temperature of 16 evenly
distributed locations across 2D plane. The results shown in Fig. 13 indicate that using
the same number of neural networks, our scheme yields significant higher accuracy
than conventional predictive models. This is because wavelets provide a good time
and locality characterization capability and most of the energy is captured by a limited
set of important wavelet coefficients. The coordinated wavelet coefficients provide
superior interpretation of the spatial patterns across scales of time and frequency
domains.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 117
100
Predicting the wav elet coefficients
Predicting the raw data
80
Error (%)
60
40
20
0
CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3
Our RBF neural networks were built using a regression tree based method. In the
regression tree algorithm, all input parameters (refer to Table 3) were ranked based on
split frequency. The input parameters which cause the most output variation tend to
be split frequently in the constructed regression tree. Therefore, the input parameters
that largely determine the values of a wavelet coefficient have a larger number of
splits. We present in Fig. 14 (shown as star plot) the most frequent splits within the
regression tree that models the most significant wavelet coefficient.
ly0 _th ly0 _fl ly0 _bench ly1 _th ly1 _fl ly1 _bench
ly2 _th ly2 _fl ly2 _bench ly3 _th ly3 _fl ly3 _bench
Clockwise:
CPU1
MEM1
H S_th H P _side H P _th am_temp Iss_size
MIX1
A star plot [15] is a graphical data analysis method for representing the relative be-
havior of all variables in a multivariate data set. Each volume size of parameter is
proportional to the magnitude of the variable for the data point relative to the maxi-
mum magnitude of the variable across all data points. From the star plot, we can ob-
tain information such as: What variables are dominant for a given datasets? Which
observations show similar behavior? As can be seen, floor-planning of each layer and
core configuration largely affect thermal spatial behavior of the studied workloads.
118 C.-B. Cho, W. Zhang, and T. Li
6 Related Work
There have been several attempts to build thermal aware microarchitecture [3, 20, 27,
28]. [27, 28] propose invoking energy saving techniques when the temperature ex-
ceeds a predefined threshold. [5] proposes a performance and thermal aware floor-
planning algorithm to estimate power and thermal effects for 2D and 3D architectures
using an automated floor-planner with iterative simulations. To our knowledge, little
research has been completed so far in developing accurate and informative analytical
methods to forecast complex thermal spatial behavior of emerging 3D multi-core
processors at early architecture design stage.
Researchers have successfully applied wavelet techniques in many fields, including
image and video compression, financial data analysis, and various fields in computer
science and engineering [29, 30]. In [31], Joseph and Martonosi used wavelets to
analyze and predict the change of processor voltage over time. In [32], wavelets were
used to improve accuracy, scalability, and robustness in program phase analysis. In
[33], the multiresolution analysis capability of wavelets was exploited to analyze
phase complexity. These studies, however, made no attempt to link architecture wave-
let domain behavior to various design parameters.
In [13] Joseph et al. developed linear models using D-optimal designs to identify
significant parameters and their interactions. Lee and Brooks [14, 15] proposed re-
gression on cubic splines for predicting the performance and power of applications
executing on microprocessor configurations in a large microarchitectural design
space. Neural networks have been used in [9, 10, 11, 12] to construct predictive mod-
els that correlate processor performance characteristics with the design parameters.
The above studies all focus on analyzing and predicting aggregated architecture char-
acteristics and assume monolithic architecture designs while our work aims to model
heterogeneous 2D thermal behavior. Our work significantly extends the scope of
these existing studies and is distinct in its use of 2D multiscale analysis to character-
ize the spatial thermal behavior of large-scale 3D multi-core architecture substrate.
7 Conclusions
Leveraging 3D die stacking technologies in multi-core processor design has received
increased momentum in both the chip design industry and research community. One
of the major road blocks to realizing 3D multi-core design is its inefficient heat dissi-
pation. To ensure thermal efficiency, processor architects and chip designers rely on
detailed yet slow simulations to model thermal characteristics and analyze various
design tradeoffs. However, due to the sheer size of the design space, such techniques
are very expensive in terms of time and cost.
In this work, we aim to develop computationally efficient methods and models
which allow architects and designers to rapidly yet informatively explore the large
thermal design space of 3D multi-core architecture. Our models achieve several or-
ders of magnitude speedup compared to simulation based methods. Meanwhile, our
model significantly improves prediction accuracy compared to conventional predic-
tive models of the same complexity. More attractively, our models have the capability
of capturing complex 2D thermal spatial patterns and can be used to forecast both the
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 119
location and the area of thermal hotspots during thermal-aware design exploration. In
light of the emerging 3D multi-core design era, we believe that the proposed thermal
predictive models will be valuable for architects to quickly and informatively examine
a rich set of thermal-aware design alternatives and thermal-oriented optimizations for
large and sophisticated architecture substrates at an early design stage.
References
[1] Banerjee, K., Souri, S., Kapur, P., Saraswat, K.: 3-D ICs: A Novel Chip Design for Im-
proving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integra-
tion. Proceedings of the IEEE 89, 602–633 (2001)
[2] Tsai, Y.F., Wang, F., Xie, Y., Vijaykrishnan, N., Irwin, M.J.: Design Space Exploration
for 3-D Cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16(4)
(April 2008)
[3] Black, B., Nelson, D., Webb, C., Samra, N.: 3D Processing Technology and its Impact on
IA32 Microprocessors. In: Proc. of the 22nd International Conference on Computer De-
sign, pp. 316–318 (2004)
[4] Reed, P., Yeung, G., Black, B.: Design Aspects of a Microprocessor Data Cache using 3D
Die Interconnect Technology. In: Proc. of the International Conference on Integrated Cir-
cuit Design and Technology, pp. 15–18 (2005)
[5] Healy, M., Vittes, M., Ekpanyapong, M., Ballapuram, C.S., Lim, S.K., Lee, H.S., Loh,
G.H.: Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs. IEEE Trans.
on Computer Aided Design of IC and Systems 26(1), 38–52 (2007)
[6] Lim, S.K.: Physical design for 3D system on package. IEEE Design & Test of Com-
puters 22(6), 532–539 (2005)
[7] Puttaswamy, K., Loh, G.H.: Thermal Herding: Microarchitecture Techniques for Control-
ling Hotspots in High-Performance 3D-Integrated Processors. In: HPCA (2007)
[8] Wu, Y., Chang, Y.: Joint Exploration of Architectural and Physical Design Spaces with
Thermal Consideration. In: ISLPED (2005)
[9] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: A Predictive Performance Model for
Superscalar Processors. In: MICRO (2006)
[10] Ipek, E., McKee, S.A., Supinski, B.R., Schulz, M., Caruana, R.: Efficiently Exploring Ar-
chitectural Design Spaces via Predictive Modeling. In: ASPLOS (2006)
[11] Yoo, R.M., Lee, H., Chow, K., Lee, H.H.S.: Constructing a Non-Linear Model with Neu-
ral Networks For Workload Characterization. In: IISWC (2006)
[12] Lee, B., Brooks, D., Supinski, B., Schulz, M., Singh, K., McKee, S.: Methods of Infer-
ence and Learning for Performance Modeling of Parallel Applications. In: PPoPP 2007
(2007)
[13] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: Construction and Use of Linear Regres-
sion Models for Processor Performance Analysis. In: HPCA (2006)
[14] Lee, B., Brooks, D.: Accurate and Efficient Regression Modeling for Microarchitectural
Performance and Power Prediction. In: ASPLOS (2006)
[15] Lee, B., Brooks, D.: Illustrative Design Space Studies with Microarchitectural Regression
Models. In: HPCA (2007)
[16] Daubechies, I.: Ten Lectures on Wavelets. Capital City Press, Montpelier (1992)
[17] Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood
Cliffs (1999)
120 C.-B. Cho, W. Zhang, and T. Li
Abstract. The SPEC CPU2006 suite, released in Aug 2006 is the cur-
rent industry-standard, CPU-intensive benchmark suite, created from a
collection of popular modern workloads. But, these workloads take ma-
chine weeks to months of time when fed to cycle accurate simulators
and have widely varying behavior even over large scales of time [1]. It
is to be noted that we do not see simulation based papers using SPEC
CPU2006 even after 1.5 years of its release. A well known technique to
solve this problem is the use of simulation points [2]. We have gener-
ated the simulation points for SPEC CPU2006 and made it available at
[3]. We also report the accuracies of these simulation points based on
the CPI, branch misspredictions, cache & TLB miss ratios by comparing
with the full runs for a subset of the benchmarks. It is to be noted that
the simulation points were only used for cache, branch and CPI studies
until now and this is the first attempt towards validating them for TLB
studies. They have also been found to be equally representative in depict-
ing the TLB characteristics. Using the generated simulation points, we
provide an analysis of the behavior of the workloads in the suite for dif-
ferent branch predictor & cache configurations and report the optimally
performing configurations. The simulations for the different TLB config-
urations revealed that usage of large page sizes significantly reduce the
translation misses and aid in improving the overall CPI of the modern
workloads.
1 Introduction
Understanding program behaviors through simulations is the foundation for com-
puter architecture research and program optimization. These cycle accurate sim-
ulations take machine weeks of time on most modern realistic benchmarks like
the SPEC [4] [5] [6] suites incurring a prohibitively large time cost. This problem
is further aggravated due to the need to simulate on different micro-architectures
to test the efficacy of the proposed enhancement. This necessitates the need to
come up with techniques [7] [8] that can facilitate faster simulations of large work-
loads like SPEC suites. One such well known technique is the Simulation Points.
While there are Simulation Points for the SPEC CPU2000 suite widely available
and used, the simulation points are not available for the SPEC CPU2006 suite.
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 121–137, 2009.
c Springer-Verlag Berlin Heidelberg 2009
122 K. Ganesan, D. Panwar, and L.K. John
We used the SimPoint [9] [10] [11] tool to generate these simulation points for
the SPEC2006 benchmark suite and provide it for use at [3].
The contributions of this paper are two-fold. The first contribution is the
creation of the simulation points, which we make it available at [3] to the rest
of the architecture research community. We also provide the accuracy of these
simulation points by comparing the results with the full run of select benchmarks.
It must be noted that 1.5 years after the release of SPEC CPU2006, simulations
based papers using CPU2006 are still not appearing in architecture conferences.
The availability of simulation points for CPU2006 will change this situation.
The second contribution is the use of CPU2006 simulation points for branch
predictor, cache & TLB studies. Our ultimate goal was to find the optimal branch
predictor, the cache and the TLB configurations which provide the best perfor-
mance on most of the benchmarks. For this, we analyzed the benchmark results
for different set of static and dynamic branch predictors [12] and tried to come
up with the ones that perform reasonably well on most of the benchmarks. We
then varied the size of one of these branch predictors to come up with the best
possible size for a hardware budget. A similar exercise was performed to come
up with the optimum instruction and data cache design parameters. We varied
both the associativity and size of caches to get an insight into the best perform-
ing cache designs for the modern SPEC CPU workloads. The performance for
different TLB configurations was also studied to infer the effect of different TLB
parameters like the TLB size, page size and associativity.
It should be noted that such a study without simulation points will take
several machine weeks. Since the accuracy of the simulation points were verified
with several full runs, we are fairly confident of the usefullness of the results.
2 Background
Considerable work has been done in investigating the dynamic behavior of the
current day programs. It has been seen that the dynamic behavior varies over
time in a way that is not random, rather structured [1] [13] as sequences of a
number of short reoccurring behaviors. The SimPoint [2] tool tries to intelli-
gently choose and cluster these representative samples together, so that they
represent the entire execution of the program. These small set of samples are
called simulation points that, when simulated and weighted appropriately pro-
vide an accurate picture of the complete execution of the program with large
reduction in the simulation time.
Using the Basic Block Vectors [14] , the SimPoint tool [9][10][11] employs the
K-means clustering algorithm to group intervals of execution such that the inter-
vals in one cluster are similar to each other and the intervals in different clusters
are different from one another. The Manhattan distance between the Basic Block
Vectors serve as the metric to know the extent of similarity between two inter-
vals. The SimPoint tool takes the maximum number of clusters as the input and
generates a representative simulation point for each cluster. The representative
simulation point is chosen as the one which has the minimum distance from the
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 123
centroid of the cluster. Each of the simulation points is assigned a weight based
on the number of intervals grouped into its corresponding cluster. These weights
are normalized such that they sum up to unity.
3 Methodology
In this paper we used sim-fast, sim-outorder simulators of the simplescalar toolset
[6] along with the SimPoint tool to generate the simulation points for the SPEC
CPU2006 suite. Figure 1 shows the flowchart representation of the methodology.
We used sim-fast simulator to identify the different basic blocks in the static code
of the benchmark and generate a Basic Block Vector for every fixed dynamic in-
terval of execution of the program. We chose the interval size to be 100 million
instructions. Further, these basic block vectors are fed as input to the clustering
algorithm of the SimPoint tool, which generates the different simulation points
(collection of Basic Block Vectors) and their corresponding weights. Having ob-
tained the simulation points and their corresponding weights, the simulation
points are tested by fast-forwarding (i.e., executing the program without per-
forming any cycle accurate simulation, as described in [3]) up to the simulation
point, and then running a cycle accurate simulation for 100 million instructions.
The sim-outorder tool provides a convenient method of fast-forwarding, to simu-
late programs in the manner described above. Fast-forwarding a program implies
only a functional simulation and avoids any time consuming detailed cycle ac-
curate measurements. The statistics like CPI (Cycles Per Instruction), cache
misses, branch mispredictions etc. are recorded for each simulation point. The
metrics for the overall program were computed based on the weight of each simu-
lation point. Each of the individual simulation point is simulated in parallel and
their results were aggregated based on their corresponding normalized weight.
For example, the CPI was computed by multiplying the CPI of each individual
simulation point with its corresponding weights as in eqn (1).
n
CP I = (CP I i ∗ weighti ) (1)
i=0
On the other hand, the ratio based metrics like branch misprediction rate,
cache miss ratio were computed by weighing the numerator and denominator
correspondingly as in eqn (2).
n
(missesi ∗ weighti )
M issRatio = ni=0 (2)
i=0 (lookupsi ∗ weighti )
The accuracy of the generated simulation points were studied by performing
the full program simulation using sim-outorder simulator and comparing the
metrics like CPI, cache miss ratios and branch mispredictions. This validation
was performed to know the effectiveness of the SimPoint methodology on SPEC
CPU2006 [15] suite in depicting the true behavior of the program. Since, sim-
outorder runs on SPEC CPU2006 take machine weeks of time, we restricted
ourselves to running only a few selected benchmarks for this purpose.
124 K. Ganesan, D. Panwar, and L.K. John
Bench
mark
Simfast
Sim-outorder
BBVs
Simpoint Engine
Simp weigh
oints ts
Compa
re
Sim-outorder
Simpoint
o/p 1 2 . . . .n
error % , speedup
Aggregate Data
For studying the branch behavior of the suite we once again used the sim-
outorder simulator available in SimpleScalar [6]. This tool has in-built imple-
mentation for most of the common static and dynamic branch predictors namely
Always Taken, Always Not-Taken, Bimodal, Gshare and other Twoway adaptive
predictors. We studied the influence of above predictors on the program behav-
ior in terms of common metrics like execution time, CPI, branch misprediction.
One of the best performing predictors was chosen and the Pattern History Table
(PHT) size was varied and the results were analyzed to come up with an optimal
size for the PHT.
To get an insight into the memory and TLB behavior of the Suite, the same
sim-outorder simulator was employed, using which the configurations for the
different levels of the cache hierarchy and TLB were specified. We obtained
the corresponding hit and miss rate for various configurations along with their
respective CPIs.
Fig. 3. Speedup obtained by using the simulation points. The simulation point runs
were done on the Texas Advance Computing Center and the full runs on a quad core
2 Ghz Xeon Processor
generated for each of the benchmarks along with their instruction count and
simulation time on a 2 GHz Xeon machine. The interval of execution given
to the sim-fast simulator was 100 million instructions. Also, maximum number
of clusters given to the SimPoint tool were 30. These simulation points were
launched as parallel jobs on the Texas Advance Computing Center (TACC) us-
ing the sim-outorder simulator. A node on TACC could have been 2x to 3x faster
than the other xeon machine to which the execution times are compared. But,
still the speedup numbers here are too high that this discrepancy in machine
speeds can be safely ignored. The final aggregated metrics for the simulation
point runs were calculated using the formulae mentioned in the previous section.
The full run simulations were also carried out for a few integer and floating point
126 K. Ganesan, D. Panwar, and L.K. John
2.5
1.5
CPI
0.5
er
ll
k
s
p
r
ex
al
g
en
ip
m
bm
ta
es
tp
m
m
en
de
bz
tu
pl
as
rlb
ne
us
hm
m
go
sj
an
so
1.
7.
3.
pe
ga
om
ze
8.
40
5.
6.
qu
0.
44
47
45
0.
6.
4.
44
45
1.
45
lib
40
41
43
47
2.
Full run
46
Benchmarks Simpoint run
Fig. 4. CPI comparison between full runs and simulation point runs
0.08
0.07
Misprediction Rate
0.06
0.05
0.04
Full run
0.03
0.02 Simpoint run
0.01
0
ll
er
k
s
p
r
p
al
ex
g
en
ip
bm
m
ta
es
tp
m
m
en
de
bz
tu
pl
as
rlb
ne
us
hm
m
go
sj
an
so
1.
7.
3.
pe
ga
om
ze
8.
40
5.
6.
qu
44
0.
47
45
0.
6.
4.
44
45
1.
45
lib
40
41
43
47
2.
46
Benchmarks
Fig. 5. Branch misprediction rate comparison between full runs and simulation point
runs
benchmarks and the accuracy of the generated simulation points were obtained
by comparing the results.
To verify the accuracy of the simulation points, we further compared the CPIs
and cache miss ratios of the simulation point run to that of full run and analyzed
the speedup obtained due to the usage of simulation points. The configuration
that we used to simulate the various full and the simulation point runs is with a
RUU size of 128, LSQ size of 64, decode, issue and commit widths of 8, L1 data
and instruction cache size of 256 sets, 64B block size, an associativity of 2, L2
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 127
0.008
0.007
0.006
0.005
Miss rate
0.004
Full run
0.003
Simpoint run
0.002
0.001
0
ll
k
s
p
r
x
al
en
ip
bm
ta
es
tp
le
bz
de
as
rlb
ne
p
m
go
so
1.
7.
3.
ga
pe
om
40
5.
44
0.
47
0.
6.
44
1.
45
40
41
47
Benchmarks
Fig. 6. Instruction cache miss ratio comparison between full runs and simulation point
runs
data and instruction cache size of 4096 sets, 64B block size, an associativity of
4. The ITLB size used was 32 sets with 4K block size, and an associativity of 4.
The DTLB size used was 64 sets, 4K block size and an associativity of 4. The
number of Integer ALUs were set to 4 and the number of Floating Point ALUs
were set to 2. A combined branch predictor with a meta table size of 2048. The
error percentage in CPI and the speed-up obtained due to the use of simulation
points are given in Figures 3 and 4 . Clearly, performing the simulation using
the generated simulation points results in considerable speedup without much
loss in the accuracy, reducing machine weeks of time to a few hours. The CPI
values obtained using simulation points was within 5 percent of the full run CPI
values for all the benchmarks except 401.bzip where the value was off by around
8 percent. Even the error in Data, Instruction cache miss rates, DTLB miss rates
and the branch misprediction ratios were within a limit of 5 percent for most of
the benchmarks excepting bzip and libquantum that have an error of 11% and
13% for the branch missprediction rates. Figures 4, 5, 6, 7 show the errors in the
values of CPI, branch mispredictions, data cache, instruction cache and DTLB
miss rates for a set of benchmarks. Though the concept of simulation points have
been widely used in various studies about caches, branch predictors etc., this is
the first attempt towards validating and studying the TLB characteristics based
on simulation points. It is quite evident from the results that these simulation
points are representative of the whole benchmark even in terms of the TLB
characteristics. Though the methodology used by SimPoint is micorarchitecture
independent, this validation is performed by taking one specific platform (alpha)
as a case study and the error rates may vary for other platforms.
128 K. Ganesan, D. Panwar, and L.K. John
0.14
0.12
0.1
Miss rate
0.08
0.06 Full run
0.04
Simpoint run
0.02
0
er
pe lll
k
s
45 mp
om ar
47 e x
g
lib lben
44 ip
m
m
a
es
tp
m
45 jen
de
t
bz
43 ntu
b
pl
47 .as
ne
us
hm
m
go
so
1.
s
7.
ga
ze
8.
3
40
5.
6.
qu
0.
0.
6.
4.
44
1.
45
40
41
2.
46
Benchmarks
0.02
0.015
Miss rate
0
er
ll
k
s
p
r
ex
al
g
en
ip
bm
m
ta
es
tp
m
m
en
de
bz
tu
pl
as
rlb
ne
us
hm
m
go
sj
an
so
1.
7.
3.
ga
pe
om
ze
8.
40
5.
6.
qu
44
0.
47
45
0.
6.
4.
44
45
1.
45
lib
40
41
43
47
2.
46
Benchmarks
Fig. 7. Data cache and DTLB miss rate comparison between full runs and simulation
point runs
We hope that these simulation points that are provided [3] will serve as a
powerful tool aiding in carrying out faster simulations using the large and repre-
sentative benchmarks of the SPEC CPU2006 Suite. The reference provided has
the simulation points for 21 benchmarks and we are in the process of generating
the remaining simulation points, which will also be added to the same reference.
their simplicity and low power requirements. Static predictors are also employed
in designing simple cores in case of single chip multiprocessors like Niagara [15],
where there exists strict bounds on area and power consumption on each core.
It is also commonly used as backup predictors in superscalar processors that
require an early rough prediction during training time and when there are misses
in the Branch Targe Buffer. On the other hand, dynamic predictors give superior
performance compared to the static ones but at the cost of increased power and
area, as implemented in the modern complex x86 processors.
Fig. 8 shows the CPI results for two common type of static branch predictors
viz., Always Taken and Always Not-Taken. As expected, it is clear from Fig. 8
and Fig. 10 that the performance of static predictors is quite poor compared to
the perfect predictor. Always taken has the overhead in branch target calculation,
but most of the branches in loops are taken.
Fig. 9 shows the CPI results for some common dynamic branch predictors. In
this paper, we have studied the performance of the following dynamic predictors
viz., Bimodal, Combined, Gshare, PAg and GAp. The configurations that were
used for these predictors respectively are,
– Bimodal - 2048
– Combined - 2048 (Meta table size)
– Gshare - 1:8192:13:1
– PAg - 256:4096:12:0
– GAp - 1:8192:10:0
Gshare, PAg and GAp are 2level predictors and their configurations are given
in the format {l1size:l2size:hist size:xor}. Clearly, the CPI values obtained using
dynamic predictors is much closer to the values obtained from the perfect pre-
dictor. Also, among these predictors, Gshare and Combined branch predictors
performs much better compared to others. Taking a closer look at the graphs,
we see that the Gshare predictor is ideal in the case of FP benchmarks while
combined predictors fares better for the integer benchmarks. Also, PAg per-
forms better than GAp predictor which indicates that a predictor with a global
Pattern History Table (PHT) performs better than one with a private PHT.
This clearly shows that constructive interference in a global PHT is helping the
modern workloads and results in an improved CPI.
Looking at the performance of the private and the global configurations of
the Branch History Shift Register (BHSR), it is evident that each of them per-
form well on specific benchmarks. Fig. 11 shows the misprediction rates for the
different dynamic predictors. The performance improvement in CPI and Mis-
prediction rate by using a dynamic predictor to a static predictor is drastic for
the cases of 471.omnetpp and 416.gamess. Both of these benchmarks are pretty
small workloads, that their branch behavior is easily captured by these history
based Branch Predictors. 462.libquantum and 450.soplex also have a significant
improvement in the CPI compared to their static counterparts, which can be
attributed to fact that the dynamic predictors are able to efficiently capture the
branch behavior of these benchmarks.
130 K. Ganesan, D. Panwar, and L.K. John
4
CPI
3
Not Taken
2 Taken
Perfect
1
0 er
ll
k
s
s
ilc
p
d
r
s
3d
al
ex
an
en
3
ip
bm
ta
es
em
es
tp
m
m
ac
m
en
nx
m
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
sj
G
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
43
sp
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
0.
45
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
2.5
1.5 Bimod
CPI
Comb
1
gshare
PAg
0.5
GAp
0
er
ll
k
ilc
s
p
d
r
s
3d
al
ex
an
en
ip
3
bm
ta
em
es
es
tp
m
m
ac
en
nx
m
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
sl
so
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
43
sp
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp---------------------------------->
0.45
0.4
0.35
0.3
Miss ratio
0.25
0.2 NotTaken
0.15 Taken
0.1
0.05
0
er
ll
k
s
s
p
ilc
d
p
r
s
3d
ex
al
an
en
ip
3
bm
ta
em
es
es
tp
m
m
ac
en
nx
m
bz
de
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
45
0.
7.
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<--------------------------SPECInt-----------------> <-------------------------SPECfp------------------------>
For the purpose of analyzing the effect of PHT size on the behavior of the
programs, we chose one of the best performing predictors obtained in the previ-
ous analysis i.e. Gshare and varied the size of it’s PHT. We used PHT of index
12, 13 and 14 bits and observed the improvement in both CPI and branch mis-
prediction rate (Fig 12. & 10). Different benchmarks responded differently to
the increase in the PHT size. It can be observed that the integer benchmarks
respond more to the increase in the PHT size compared to the floating point
benchmarks. The floating point benchmarks have the least effect on the CPI for
the increase in the PHT size. This is because of the fact that the floating
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 131
0.2
0.18
0.16
0.14 Bimod
Miss ratio
0.12
Comb
0.1
0.08 gshare
0.06 PAg
0.04
GAp
0.02
0
er
ll
t
r
ip
ex
es
p
ilc
k
3
s
an
en
pp
s
ng
d
ta
al
em
m
bm
nx
es
bz
ac
m
e3
m
pl
av
de
as
rlb
t
sje
qu
na
us
ne
hm
hi
1.
om
3.
so
sli
G
go
bw
3.
7.
pe
sp
lib
4.
ze
40
8.
ga
43
om
9.
le
0.
47
5.
6.
gr
44
44
45
0.
0.
2.
2.
4.
45
7.
6.
45
44
45
5.
1.
41
40
48
46
43
43
41
43
47
<--------------------------SPECInt-----------------> <------------------------------SPECfp------------------------>
0.12
0.1
0.08
Miss Ratio
0.06
1:4096:12:1
0.04
1:8192:13:1
0.02 1:16384:14:1
0
er
ll
k
s
s
p
ilc
d
p
r
s
ex
3d
al
an
en
ip
3
bm
ta
em
es
es
tp
m
m
ac
en
nx
m
bz
de
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
45
0.
7.
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<-----------------------SPECInt-----------------> <------------------------SPECfp------------------------->
Fig. 12. Misprediction rate for Gshare configurations given as L1size:L2size:hist size
& xor
2.5
1.5
CPI
1:4096:12:1
1
1:8192:13:1
0.5 1:16384:14:1
0
er
ll
k
s
s
p
ilc
d
p
r
s
ex
3d
al
an
en
ip
3
bm
ta
em
es
es
tp
m
m
ac
m
en
nx
m
bz
de
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
45
0.
7.
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<---------------------SPECInt-----------------> <-------------------------SPECfp------------------------>
Fig. 13. CPI for Gshare configurations given as L1size:L2size:hist size & xor
point benchmarks have lesser number of branches and thus their behavior can
be captured with a smaller PHT.
For instance, considering 435.gromacs, although there is a significant reduc-
tion in the misprediction rate with an increase in the PHT size, there is not
much improvement observed in the CPI. After analyzing this benchmark, we
found that 435.gromacs has only 2 percent of the instructions as branches. So,
improving the accuracy of branch predictor does not have much effect on the CPI
of the FP benchmarks. On the other hand, for the case of 445.gobmk which is an
integer benchmark, the improvement in misprediction rate shows a proportional
change in the CPI. This is expected since 445.gobmk has higher percentage of
branches (15 percent) to the total instructions.
132 K. Ganesan, D. Panwar, and L.K. John
2.5
1.5 DL1:256:64:1:1
CPI
DL1:512:64:1:1
1
DL1:1024:64:1:1
0.5 DL1:256:64:2:1
DL1:128:64:4:1
0
er
ll
k
s
s
ilc
p
p
r
s
3d
an
al
x
g
en
3
ip
bm
ta
em
s
es
tp
m
m
ac
e
en
nx
m
e
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
sl
so
1.
7.
3.
pe
ga
lib
om
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
0.3
0.25
0.2 DL1:256:64:1:1
Miss Ratio
0.15 DL1:512:64:1:1
DL1:1024:64:1:1
0.1
DL1:256:64:2:1
0.05
DL1:128:64:4:1
0
er
ll
k
s
s
ilc
p
d
r
s
3d
al
ex
an
en
ip
3
bm
ta
es
em
es
tp
m
m
ac
m
en
nx
m
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
43
sp
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
0.
45
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
2.5
1.5 IL1:1:256:64:1:1
CPI
IL1:1:512:64:1:1
1 IL1:1:1024:64:1:1
IL1:1:256:64:2:1
0.5 IL1:1:128:64:4:1
0
er
ll
k
s
s
ilc
p
p
r
s
ex
3d
an
al
g
en
ip
3
bm
ta
es
em
es
tp
m
m
ac
en
nx
m
bz
de
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
sj
G
so
sl
1.
7.
3.
pe
ga
lib
om
ze
4.
8.
bw
sp
43
9.
le
40
5.
6.
0.
44
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <-------------------------------SPECfp----------------------------->
Fig. 16. CPI for IL1 configs in format name:no.sets:blk size:associativity&repl. policy
0.02
0.018
0.016
0.014
0.012 IL1:1:256:64:1:1
Miss ratio
0.01 IL1:1:512:64:1:1
0.008 IL1:1:1024:64:1:1
0.006
IL1:1:256:64:2:1
0.004
IL1:1:128:64:4:1
0.002
0
er
ll
k
s
s
ilc
p
p
r
s
3d
ex
al
an
en
ip
3
bm
ta
es
em
es
tp
m
m
ac
en
nx
m
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
sj
G
so
sl
1.
7.
3.
pe
ga
lib
om
ze
4.
8.
bw
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
Fig. 18. CPI for varying associativity with 16KB page sizes
For the purpose of analyzing the L1 caches, we varied both the cache size and
the associativity and compared the values of CPI and the miss ratios. We used
the LRU replacement policy for all our experiments which is given as one in
specifying the configuraion of the cache in the figures. From the graph in Fig. 14
& 15, it is evident that the effect of increasing associativity has a prominent ef-
fect on the performance than just increasing the size of the data cache. For some
benchmarks like 445gobmk, increasing the associativity to 2 result in a colossal
reduction in the miss ratios, which can be attributed to smaller foot prints of
these benchmarks. Other benchmarks where associativity provided significant
benefit are 456.hmmer, 458.sjeng and 482.sphinx3 in which case increasing the
associativity to 2 resulted in more than 50 percent reduction in miss ratio. How-
ever, some benchmarks like 473.astar and 450.soplex responded more to the size
than associativity. It can be concluded that 473.astar and 450.soplex has lot of
sequential data and hence we cannot extract much benefit by increasing the as-
sociativity. The CPIs of the benchmarks 462.libquantum and 433.milc neither
respond to the increase in the cache size nor to that in associativity. This may be
due to a smaller memory footprint of these benchmarks which can be captured
completely by just a small direct mapped cache.
134 K. Ganesan, D. Panwar, and L.K. John
Fig. 19. TLB miss ratios for varying sssociativity with 16KB page sizes
2.5
1.5
CPI
4 KB
1 16 KB
64 KB
0.5 16 MB
0
er
ll
k
s
s
p
ilc
d
p
r
s
ex
3d
an
al
g
en
ip
3
bm
ta
em
es
es
tp
m
m
ac
en
nx
m
bz
de
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
0.
44
gr
47
44
45
45
0.
7.
2.
6.
4.
2.
44
0.
45
1.
45
5.
40
43
46
41
43
48
41
47
43
<-------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
Fig. 20. CPI for varying page sizes with 2-way associative TLB
The CPI and the miss ratios for different Level 1 instruction cache configura-
tions are shown in Fig. 16 and 17. As expected, the miss ratios of the instruction
cache is much lesser than that of the data cache because of the uniformity in
the pattern of access to the instruction cache. For some of the benchmarks
like 473.astar, 456.hmmer, 435.gromacs, the miss ratio is almost negligible and
hence further increase in the cache size or associativity does not have any effect
on the performance. The performance benefit due to increase in associativity
compared to cache size in instruction cache is not as much as the data cache.
This is because of the fact that the instruction cache responds more to the
increase in the cache size to that of associativity because of high spatial lo-
cality in the references. Considering the tradeoff between the performance and
complexity, an associativity of two at the instruction cache level seems to be
optimal.
0.014
0.012
0.01
TLB Miss ratios
0.008 4 KB
0.006 16 KB
64 KB
0.004
16 MB
0.002
0
er
ll
k
s
s
ilc
p
p
r
s
ex
3d
al
an
en
3
ip
bm
ta
em
es
es
tp
m
m
ac
en
nx
m
de
bz
pl
ie
as
rlb
qu
ne
na
us
hm
av
m
om
3.
go
hi
G
sj
so
sl
1.
7.
3.
pe
ga
om
lib
ze
4.
bw
8.
sp
43
9.
le
40
5.
6.
44
0.
gr
47
44
45
0.
7.
45
2.
6.
4.
2.
44
45
0.
1.
45
5.
40
43
46
41
43
48
41
47
43
<-------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>
Fig. 21. TLB miss ratio for varying page sizes with 2-way associative TLB
6 Conclusion
The simulation points have proved to be an effective technique in reducing the
simulation time to a large extent without much loss of accuracy in the SPEC
136 K. Ganesan, D. Panwar, and L.K. John
CPU2006 Suite. Using simulation points not only reduces the number of dynamic
instructions to be simulated but also makes the workload parallel, making them
ideal for the current day parallel computers.
Further, simulating the different benchmarks with the different branch pre-
dictors, gave an insight into understanding the branch behavior of modern work-
loads, which helped in coming up with the best performing predictor configura-
tions. We observed Gshare and the combined (Bimodal & 2-level) to be the ideal
predictors, predicting most of the branches to near perfection. Looking at the
effect of different cache parameters, it is observed that the design of level-1 data
cache parameters proves to be more important in affecting the CPI than that
of the instruction cache parameters. Instruction accesses, due to their inherent
uniformity, tends to miss less frequently, which makes the task of designing the
Instruction cache much easier. The line size of the Instruction cache seems to
be the most important, while for the data cache, both the line size and the as-
sociativity needs to be tailored appropriately to get the best performance. The
simulations for the different TLB configurations revealed that usage of large page
sizes significantly reduce the translation misses and aid in improving the overall
CPI of the modern workloads.
Acknowledgement
We would like to thank the Texas Advance Computing Center (TACC) for the
excellent simulation environment provided for performing all the time consuming
simulations of SPEC CPU2006 with enough parallelism. Our thanks to Lieven
Eeckhout and Kenneth Hoste of the Ghent University, Belgium for providing
us the alpha binaries for the SPEC suite. This work is also supported in part
through the NSF award 0702694. Any opinions, findings and conclusions ex-
pressed in this paper are those of the authors and do not necessarily reflect the
views of the National Science Foundation (NSF).
References
1. Sherwood, T., Calder, B.: Time varying behavior of programs. Technical Report
UCSD-CS99-630, UC San Diego, (August 1999)
2. Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing
large scale program behavior. In: ASPLOS (October 2002)
3. http://www.freewebs.com/gkofwarf/simpoints.htm
4. SPEC. Standard performance evaluation corporation, http://www.spec.org
5. Henning, J.L.: SPEC CPU 2000: Measuring cpu performance in the new millen-
nium. IEEE Computer 33(7), 28–35 (2000)
6. Charney, M.J., Puzak, T.R.: Prefetching and memory system behavior of the
SPEC95 benchmark suite. IBM Journal of Research and Development 41(3) (May
1997)
7. Haskins, J., Skadron, K.: Minimal subset evaluation: warmup for simulated hard-
ware state. In: Proceedings of the 2001 International Conference on Computer
Design (September 2000)
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 137
8. Phansalkar, A., Joshi, A., John, L.K.: Analysis of redundancy and application
balance in the SPEC CPU 2006 benchmark suite. In: The 34th International Sym-
posium on Computer Architecture (ISCA) (June 2007)
9. Hamerly, G., Perelman, E., Lau, J., Calder, B.: Simpoint 3.0: Faster and more flex-
ible program analysis. In: Workshop on Modeling, Benchmarking and Simulation
(June 2005)
10. Hamerly, G., Perelman, E., Calder, B.: How to use simpoint to pick simulation
points. ACM SIGMETRICS Performance Evaluation Review (March 2004)
11. Perelman, E., Hamerly, G., Calder, B.: Picking statistically valid and early simula-
tion points. In: International Conference on Parallel Architectures and Compilation
Techniques (September 2003)
12. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch
prediction. In: 19th Annual International Symposium on Computer Architecture
(May 1992)
13. Lau, J., Sampson, J., Perelman, E., Hamerly, G., Calder, B.: The strong correlation
between code signatures and performance. In: IEEE International Symposium on
Performance Analysis of Systems and Software (March 2005)
14. Perelman, E., Sherwood, T., Calder, B.: Basic block distribution analysis to find
periodic behavior and simulation points in applications. In: International Confer-
ence on Parallel Architectures and Compilation Techniques (September 2001)
15. Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc
processor. MICRO 25(2), 21–29 (2005)
16. Korn, W., Chang, M.S.: SPEC CPU 2006 sensitivity to memory page sizes. ACM
SIGARCH Computer Architecture News (March 2007)
A Note on the Effects of Service Time Distribution in the
M/G/1 Queue
Abstract. The M/G/1 queue is a classical model used to represent a large num-
ber of real-life computer and networking applications. In this note, we show
that, for coefficients of variation of the service time in excess of one, higher-
order properties of the service time distribution may have an important effect on
the steady-state probability distribution for the number of customers in the
M/G/1 queue. As a result, markedly different state probabilities can be observed
even though the mean numbers of customers remain the same. This should be
kept in mind when sizing buffers based on the mean number of customers in the
queue. Influence of higher-order distributional properties can also be important
in the M/G/1/K queue where it extends to the mean number of customers itself.
Our results have potential implications for the design of benchmarks, as well as
the interpretation of their results.
1 Introduction
The M/G/1 queue is a classical model used to represent a large number of real-life
computer and networking applications. For example, M/G/1 queues have been applied
to evaluate the performance of devices such as volumes in a storage subsystem [1],
Web servers [13], or nodes in an optical ring network [3]. In many applications re-
lated to networking, the service times may exhibit significant variability, and it may
be important to account for the fact that the buffer space is finite. It is well known
that, in the steady state, the mean number of users in the unrestricted M/G/1 queue
depends only on the first two moments of the service time distribution [11]. It is also
known [4] that the first three (respectively, the first four) moments of the service time
distribution enter into the expression for the second (respectively, the third) moment
of the waiting time. In this note our goal is to illustrate the effect of properties of the
service time distribution beyond its mean and coefficient of variation on the shape of
the stationary distribution of the number of customers in the M/G/1 queue. In particu-
lar, we point out the risk involved in dimensioning buffers based on the mean number
of users in the system.
D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 138–144, 2009.
© Springer-Verlag Berlin Heidelberg 2009
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 139
2 M/G/1 Queue
Assuming a Poisson arrival process, a quick approach to assess the required capacity
for buffers in a system is to evaluate it as some multiplier (e.g. three or six) times the
mean number of customers in an open M/G/1 queue (e.g. [12]). From the Pollaczek-
Khintchine formula [11], this amounts to dimensioning the buffers based on only the
first two moments of service time distribution. Unfortunately, the steady-state distri-
bution of the number of customers in the M/G/1 queue can exhibit a strong depend-
ence on higher-order properties of the service time distribution.
This is illustrated in Figure 1, which compares the distribution of the number of
customers for two different Cox-2 service time distributions with the same first two
moments, and thus yielding the same mean number of customers in the system. The
parameters of these distributions are given in Table 1. Note that both distributions I
and II correspond to a coefficient of variation of 3 but have different higher-order
properties such as skewness and kurtosis [14]. Similarly, distributions III and IV both
correspond to a coefficient of variation of 5 but again different higher-order proper-
ties. The stationary distribution of the number of customers in this M/G/1 queue was
computed using a recently published recurrence method [2]. We observe that, per-
haps not surprisingly, the effects of the distribution tend to be more significant as the
server utilization and the coefficient of variation of the service time distribution in-
crease. It is quite instructive to note, for instance, that with a coefficient of variation
of 3 and server utilization of 0.5, the probability of exceeding 20 users in the queue (a
little over 6 times the mean) is about 0.1% in one case while it is an order of magni-
tude larger for another service time distribution with same first two moments.
Table 1. Parameters and properties of the service time distributions used in Figure 1
3 M/G/1/K Queue
Clearly, using the M/G/1/K, i.e., the M/G/1 queue with a finite queueing room would
be a more direct way to dimension buffers. There seem to be fewer theoretical results
for the M/G/1/K queue than for the unrestricted M/G/1 queue, but it is well known
that the steady-state distribution for the M/G/1/K queue can be obtained from that for
the unrestricted M/G/1 queue after appropriate transformations [10, 7, 4]. Clearly,
this approach can only work if the arrival rate does not exceed the service rate since
otherwise the unrestricted M/G/1 would not be stable.
140 A. Brandwajn and T. Begin
0.6
0.5
0.4
Dist. I
0.3
Dist. II
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
number of customers
0.12
0.1
0.08
Dist. I
0.06
Dist. II
0.04
0.02
0
12
15
18
21
24
27
30
33
36
39
42
45
48
0
number of customers
0.6
0.5
0.4
Dist. III
0.3
Dist. IV
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
number of customers
Fig. 1. Effect of service time distributions on the number of customers in the M/G/1 queue
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 141
0.12
0.1
0.08
Dist. III
0.06
Dist. IV
0.04
0.02
0
0
9
12
15
18
21
24
27
30
33
36
39
42
45
48
number of customers
Fig. 1. (continued)
While the steady-state distribution for the M/G/1/K queue can be derived from the
one for the unrestricted M/G/1 queue, and the mean number of users in the latter de-
pends only on the first two moments of the service time distribution, this is not the
case for the M/G/1/K queue. Table 2 shows that even the first three moments of the
service time distribution do not generally suffice to determine the mean number of
customers in the M/G/1/K queue. Here we illustrate the results obtained for two
Cox-3 distributions sharing the first three moments but with different properties of
higher-order.
Since the mean number of customers in the unrestricted M/G/1 queue depends only
on the first two moments of the service time distribution, and in the M/G/1/K for K=1
there is no distributional dependence at all (since there is no queueing), it is interest-
ing to see how the dependence on properties of higher-order varies with K, the size of
the queueing room. This is the objective in Figure 2 where we have represented the
relative difference in the probabilities of having exactly one customer in the system,
as well as in the probabilities of the buffer being full, for distribution I and II of Table
1. We observe that, although the first two moments of the service time distribution
are the same for both distributions, higher-order properties lead to drastically different
values for the selected probabilities. Interestingly, for the probability of the buffer
being full, although the relative difference between the distributions considered de-
creases as the size of the queueing room, K , increases, it remains significant even for
large values of the latter.
To further illustrate the dependence on higher-order properties of the service time
distribution, we consider read performance for two simple cached storage devices.
When the information requested is found in the cache, a hit occurs and the service
time is viewed as a constant (assuming a fixed record length). When the information
is not in the cache, it must be fetched from the underlying physical storage device. In
Table 3 we show simulation results [8] obtained for two different storage systems
with the same first two moments of the service time (resulting from the combination
142 A. Brandwajn and T. Begin
Table 2. Effect of properties beyond the third moment on the mean number in the M/G/1/K
queue
Relative
First Cox-3 Second Cox-3
differences
Rate of arrivals 1 1
Size of queueing
30 30
room
Mean service
1
time
Coefficient of
6.40
variation
Skewness 2331.54
Kurtosis 7.43*106 1.44*107
Mean number in
3.98 5.07 27.4 %
the M/G/1/K
Fig. 2. Relative difference in selected probabilities for distributions I and II as a function of the
queueing room in the M/G/1/K queue
of hit and miss service times), and queueing room limited to 10. In one case the ser-
vice time of the underlying physical device (i.e. miss service time) is represented by a
uniform distribution, and in the other by a truncated exponential [9]. We are inter-
ested in the achievable I/O rate such that the mean response time does not exceed 5
ms. We observe that the corresponding I/O rates differ by over 20% in this example
(the coefficient of variation of the service time being a little over 1.6).
It has been our experience that the influence of higher-order properties tends to in-
crease as the coefficient of variation and the skewness of the service time increase. It
is interesting to note that this is precisely the case when one considers instruction
execution times in programs running on modern processors where most frequent
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 143
Table 3. I/O rate for same mean I/O time in two storage subsystems
Truncated
Uniform miss Relative
exponential miss
service time differences
service time
Mean service
1.9
time
Coefficient of
1.62
variation
Hit probability 0.9 0.985
Hit service time 1 1.64
Truncated exponen-
Miss service time Uniform [2,18] tial mean: 20, max:
100
Attainable I/O
rate for Mean I/O 0.257 0.312 21.4 %
time of 5 ms
4 Conclusion
In conclusion, we have shown that, for coefficients of variation of the service time in
excess of one, higher-order properties of the service time distribution may have an
important effect on the steady-state probability distribution for the number of custom-
ers in the M/G/1 queue. As a result, markedly different state probabilities can be ob-
served even though the mean numbers of customers remain the same. This should be
kept in mind when sizing buffers based on the mean number of customers in the
queue. Influence of higher-order distributional properties can also be important in the
M/G/1/K queue where it extends to the mean number of customers itself. The poten-
tially significant impact of higher-order distributional properties of the service times
should be kept in mind also when interpreting benchmark results for systems that may
144 A. Brandwajn and T. Begin
Acknowledgments. The authors wish to thank colleagues for their constructive re-
marks on an earlier version of this note.
References
1. Brandwajn, A.: Models of DASD Subsystems with Multiple Access Paths: A Throughput-
Driven Approach. IEEE Transactions on Computers C-32(5), 451–463 (1983)
2. Brandwajn, A., Wang, H.: Conditional Probability Approach to M/G/1-like Queues. Per-
formance Evaluation 65(5), 366–381 (2008)
3. Bouabdallah, N., Beylot, A.-L., Dotaro, E., Pujolle, G.: Resolving the Fairness Issues in
Bus-Based Optical Access Networks. IEEE Journal on Selected Areas in Communica-
tions 23(8), 1444–1457 (2005)
4. Cohen, J.W.: On Regenerative Processes in Queueing Theory. Lecture Notes in Economics
and Mathematical Systems. Springer, Berlin (1976)
5. Cohen, J.W.: The Single Server Queue, 2nd edn. North-Holland, Amsterdam (1982)
6. Ferrari, D.: On the foundations of artificial workload design. SIGMETRICS Perform. Eval.
Rev. 12(3), 8–14 (1984)
7. Glasserman, P., Gong, W.: Time-changing and truncating K-capacity queues from one K
to another. Journal of Applied Probability 28(3), 647–655 (1991)
8. Gross, D., Juttijudata, M.: Sensitivity of Output Performance Measures to Input Distribu-
tions in Queueing Simulation Modeling. In: Proceedings of the 1997 winter simulation
conference, pp. 296–302 (1997)
9. Jawitz, J.W.: Moments of truncated continuous univariate distributions. Advances in Water
Resources 27(3), 269–281 (2004)
10. Keilson, J.: The Ergodic Queue Length Distribution for Queueing Systems with Finite Ca-
pacity. Journal of the Royal Statistical Society 28(1), 190–201 (1966)
11. Kleinrock, L.: Queueing systems. Theory, vol. I. Wiley, Chichester (1974)
12. Mitrou, N.M., Kavidopoulos, K.: Traffic engineering using a class of M/G/1 models. Jour-
nal of Network and Computer Applications 21, 239–271 (1998)
13. Molina, M., Castelli, P., Foddis, G.: Web traffic modeling exploiting. TCP connections’
temporal clustering through HTML-REDUCE 14(3), 46–55 (2000)
14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall /
CRC, Boca Raton (1986)
Author Index