0% found this document useful (0 votes)
386 views152 pages

Pub - Computer Performance Evaluation and Benchmarking S PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
386 views152 pages

Pub - Computer Performance Evaluation and Benchmarking S PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Lecture Notes in Computer Science 5419

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
David Kaeli Kai Sachs (Eds.)

Computer Performance
Evaluation
and Benchmarking

SPEC Benchmark Workshop 2009


Austin, TX, USA, January 25, 2009
Proceedings

13
Volume Editors

David Kaeli
Northeastern University
Department of Electrical and Computer Engineering
360 Huntington Ave., Boston, MA 02115, USA
E-mail: [email protected]

Kai Sachs
Technische Universität Darmstadt
Dept. of Computer Science
Schlossgartenstr. 73, 64289 Darmstadt, Germany
E-mail: [email protected]

Library of Congress Control Number: Applied for

CR Subject Classification (1998): B.2.4, B.2.2, B.3.3, B.8, C.1, B.1, B.7.1

LNCS Sublibrary: SL 2 – Programming and Software Engineering

ISSN 0302-9743
ISBN-10 3-540-93798-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-93798-2 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2009
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12603886 06/3180 543210
Preface

This volume contains the set of papers presented at the SPEC Benchmark Work-
shop 2009 held January 25 in Austin, Texas, USA. The program included eight
refereed papers, a keynote talk on virtualization technology benchmarking, an
invited paper on power benchmarking and a panel on multi-core benchmarking.
Each refereed paper was reviewed by at least four Program Committee members.
The result is a collection of high-quality papers discussing current issues in the
area of benchmarking research and technology.
A number of people contributed to the success of this workshop. Rudi Eigen-
mann served as General Chair and ably handled many of the details involved
with providing a high-quality meeting. We would like to thank the members of
the Program Committee for their time and effort in arriving at a high-quality
program. We would also like to acknowledge the guidance provided by the SPEC
Workshop Steering Committee.
We would like to thank the staff at Springer for their cooperation and support.
We want to particularly recognize Dianne Rice for her assistance and guidance,
and also Kathy Power, Cathy Sandifer and the whole SPEC office for their help.
And finally, we want to thank all SPEC members for their continued support
and sponsorship of this meeting.

January 2009 David Kaeli


Kai Sachs
Organization

SPEC Benchmark Workshop 2009 was sponsored by the Standard Performance


Evaluation Corporation in cooperation with IEEE Technical Committee on Com-
puter Architecture (TCAA).

Executive Committee
General Chair Rudi Eigenmann (Purdue University, USA)
Program Chair David Kaeli (Northeastern University, USA)
Publication Chair Kai Sachs (TU Darmstadt, Germany)

Program Committee
Jose Nelson Amaral University of Alberta, USA
Umesh Bellur Indian Institute of Technology Bombay, India
Anton Chernoff AMD, USA
Lieven Eeckhout University of Ghent, Belgium
Rudi Eigenmann Purdue University, USA
Jose Gonzalez Intel Barcelona, Spain
John L. Henning Sun Microsystems, USA
Lizy K. John University of Texas at Austin, USA
David Kaeli Northeastern University, USA
Helen Karatza Aristotle University of Thessaloniki, Greece
Samuel Kounev Universität Karlsruhe (TH), Germany
Tao Li University of Florida, USA
David Lilja University of Minnesota, USA
Christoph Lindemann University of Leipzig, Germany
John Mashey Consultant, USA
Jeffrey Reilly Intel Corporation, USA
Kai Sachs TU Darmstadt, Germany
Resit Sendag University of Rhode Island, USA
Erich Strohmaier Lawrence Berkeley National Laboratory, USA
Bronis Supinski Lawrence Livermore National Laboratory, USA
Petr Tůma Charles University in Prague, Czech Republic
Reinhold Weicker (formerly) Fujitsu Siemens, Germany
VIII Organization

Workshop Steering Committee


Alan Adamson IBM, Canada
Jose Nelson Amaral University of Alberta, USA
David Bader Georgia Tech, USA
Rudi Eigenmann Purdue University, USA
Rema Hariharan AMD, USA
John L. Henning Sun Microsystems, USA
Lizy K. John University of Texas at Austin, USA
David Kaeli Northeastern University, USA
Samuel Kounev Universität Karlsruhe (TH), Germany
David Morse Dell, USA
Kai Sachs TU Darmstadt, Germany
Table of Contents

Benchmark Suites
SPECrate2006: Alternatives Considered, Lessons Learned . . . . . . . . . . . . . 1
John L. Henning
SPECjvm2008 Performance Characterization . . . . . . . . . . . . . . . . . . . . . . . . 17
Kumar Shiv, Kingsum Chow, Yanping Wang, and
Dmitry Petrochenko

CPU Benchmarking
Performance Characterization of Itanium R
2-Based Montecito
Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Darshan Desai, Gerolf F. Hoflehner, Arun Kejariwal,
Daniel M. Lavery, Alexandru Nicolau,
Alexander V. Veidenbaum, and Cameron McNairy
A Tale of Two Processors: Revisiting the RISC-CISC Debate . . . . . . . . . . 57
Ciji Isen, Lizy K. John, and Eugene John
Investigating Cache Parameters of x86 Family Processors . . . . . . . . . . . . . 77
Vlastimil Babka and Petr Tůma

Power/Thermal Benchmarking
The Next Frontier for Power/Performance Benchmarking: Energy
Efficiency of Storage Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Klaus-Dieter Lange
Thermal Design Space Exploration of 3D Die Stacked Multi-core
Processors Using Geospatial-Based Predictive Models . . . . . . . . . . . . . . . . . 102
Chang-Burm Cho, Wangyuan Zhang, and Tao Li

Modeling and Sampling Techniques


Generation, Validation and Analysis of SPEC CPU2006 Simulation
Points Based on Branch, Memory and TLB Characteristics . . . . . . . . . . . . 121
Karthik Ganesan, Deepak Panwar, and Lizy K. John
A Note on the Effects of Service Time Distribution in the M/G/1
Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Alexandre Brandwajn and Thomas Begin

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


SPECrate2006: Alternatives Considered, Lessons
Learned

John L. Henning

Sun Microsystems
[email protected]

Abstract. Since 1992, SPEC has used multiple identical benchmarks


to measure multi-processor performance. This “Homogeneous Capacity
Method” (aka “SPECrate”) has been criticized on the grounds that real
workloads are not homogeneous. Nevertheless, SPECrate provides a use-
ful window into how systems perform when stressed by multiple requests
for similar resources. This paper reviews SPECrate’s history, and sev-
eral performance lessons learned using it: (1) a 4:1 performance gain for
startup of a benchmark when I/O was reconfigured; (2) a benchmark
that improved up to 2:1 when a TLB data structure was re-sized; and
(3) a benchmark that improved by 52% after a change to NUMA page
allocation. The SPEC CPU workloads usefully exposed several opportu-
nities for performance improvement.

1 Introduction: A Philosophy of Divots


When systems do not perform as expected, performance anomalies are some-
times called “divots”: an unexpected hole where performance sinks. Is a divot
something to be ashamed of? Or an opportunity? This tester suggests:
Although it is widely understood that “all software has bugs”, it may
not be as widely understood that all systems have performance divots.
A repeatable, analyzable workload allows divots to be analyzed. Cherish
your divots.

2 Background: About SPECrate


The Original Metric: Speed. The SPEC CPU suites are made up of compo-
nent benchmarks. The original SPECmark (now referenced as “CPU89”) con-
tained 10 benchmarks, such as gcc, spice, and a lisp interpreter; the most recent
suite, CPU2006, contains 29 benchmarks, such as bzip2, GNU Go, GAMESS,
POV-Ray, and perl. The SPEC-supplied tool set runs each component bench-
mark individually, and the time in seconds is reported. For each benchmark,
a “SPECratio” is computed by dividing the time on a reference system by
the time seen on the system under test. Finally, a bottom line metric (such
as SPECint95, SPECfp2000, SPECfp base2006) is computed as the geometric
mean of the benchmark SPECratios.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 1–16, 2009.

c Springer-Verlag Berlin Heidelberg 2009
2 J.L. Henning

The bottom line metrics mentioned thus far are called “speed” metrics, and
are analogous to speed of travel in the real world in that higher numbers are
better, and numbers are comparable. If a sports car takes 1/4 the time to get
to Cleveland as a truck, we routinely say that the sports car is 4x as fast; and if
a new laptop finishes a well defined task in 1/4 as much time as an old desktop
computer, it seems natural to call the laptop 4x as fast as the desktop.

Adding a Throughput Metric. The speed tests run only one copy of each
component benchmark at a time, leaving resources idle on multi-processor sys-
tems. SPEC addressed this problem in 1992 by adding throughput tests that
allow the tester to run multiple copies of identical benchmarks. For example, in
a 32-copy SPECint rate2006 test, the SPEC tool set starts 32 copies of 400.perl-
bench, waits for all of them to complete; records the time from start of first to
finish of last; then starts 32 copies of 401.bzip2, and so forth.
The fact that all copies are running the same workload is the reason that
SPECrate was originally known as the “Homogeneous Capacity Method” [1].
The details of the metric calculation have varied somewhat as the suites have
evolved, but in all cases a score is calculated for each benchmark which is pro-
portional to the number of copies run divided by the time required to complete
the copies. The bottom line metrics (e.g. SPECint rate95, SPECfp rate2000) are
the geometric means of the benchmark scores.
Interpretation of SPEC CPU throughput metrics is somewhat less intuitive
than the speed metrics. For example, if a laptop has a SPECint rate2006 score
of 10, and a server has a SPECint rate2006 score of 20, it is not immediately
obvious if the better result is achieved by running twice as many copies in the
same time, or by running the same number of copies in 1/2 the time, or by
some other method. The full reports provide the additional level of detail for the
motivated reader.

Positioning: A Component Benchmark. Although the throughput met-


rics exercise more of the system than a single processing unit, while using the
compute-intensive portion of real applications [4], it should be understood that
SPECrate is not positioned as a whole-system benchmark. It is a design goal to
reduce disk I/O, remove network I/O, and eliminate GUIs from the SPEC CPU
suites. Use of system services and libraries are also minimized [19].

3 Perceived Weaknesses of SPECrate


Scaling. SPECrate has been criticized because the scaling can sometimes ap-
pear to be less than credible. For example, on the metric SPECint rate base2000,
an Alpha 21264A with 16 chips vs. 32 scales at .975 [6]; an SGI R14000 with 8
chips vs. 128 scales at .976 [7].
Explanations for the excellent scaling include: (1) SPECrate jobs are inde-
pendent – there are no stalls for cross-job communication. (2) As mentioned
above, they do little IO and use few system services. (3) Although one of the
design goals is to exercise the memory hierarchy, it has been shown that the
SPECrate2006: Alternatives Considered, Lessons Learned 3

benchmarks usually exercise only a relatively small part of main memory at a


time, [3] and therefore caches are usually effective. (4) Even for the benchmarks
that exercise main memory the most, there are often access patterns that com-
pilers and hardware can usefully prefetch. (5) Although anyone can publish a
rule-compliant SPEC CPU result, in fact, most publications are done by ven-
dors, who are motivated to ensure that systems are properly set up to ensure
good scaling. (6) If good scaling is not possible on a particular system, there is
typically no particular motivation for the vendor to publish such a result.
In short, a concern with the scaling is that it may appear to be “too good”.
Customer applications that depend on interprocess coordination, system ser-
vices, I/O, or other components not measured by the SPEC CPU benchmarks
are unlikely to scale as well as does SPECrate. This is not to say that SPECrate
is a dishonest measurement; rather that it is a component benchmark, and that
it needs to be understood as such.

Homogeneity and Convoys. A second perceived weakness of SPECrate is its


homogeneity. In real life, servers often host a variety of applications. Even in
an environment where everyone has a common interest (say, all are molecular
chemists, running a common tool set) jobs do not all start at the same instant,
and run identical workloads.
It has been hypothesized that SPEC’s implementation of SPECrate may lead
to “convoy effects”: for example, 128 memory-intensive programs all hit their
most intense memory bandwidth demand at the same time; or 128 programs
all try to read their startup data at the same time, thrashing the disk; or 128
programs all try to acquire an OS lock on the filesystem at the same instant.

4 Alternatives Considered by SPEC

In response to concerns about SPECrate, the SPEC CPU Subcommittee has


considered various alternatives over the years. Two alternatives are described in
this section: “heterogeneous” and “staggered homogeneous”.

4.1 Heterogeneous

During the development of SPEC CPU2006, a prototype was implemented that


ran the CPU2000 jobs in a heterogeneous fashion. Tables 1 and 2 show the
difference in run order on a system running 4 queues (which would, typically,
use 4 processors).
For the homogeneous method, each processor runs the same program and
workload.
For the heterogeneous prototype, each processor starts off running a different
job than the other processors (provided that the number of processors is less
than the number of benchmarks in the suite).
It is important to note that for homogeneous SPECrate, all copies of a bench-
mark finish, and then the next benchmark is started. Thus one may read Table 1
4 J.L. Henning

Table 1. Homogeneous run order. All processors run identical jobs.

Queue 0 Queue 1 Queue 2 Queue 3


164.gzip 164.gzip 164.gzip 164.gzip
175.vpr 175.vpr 175.vpr 175.vpr
176.gcc 176.gcc 176.gcc 176.gcc
181.mcf 181.mcf 181.mcf 181.mcf
186.crafty 186.crafty 186.crafty 186.crafty
197.parser 197.parser 197.parser 197.parser
252.eon 252.eon 252.eon 252.eon
254.gap 254.gap 254.gap 254.gap
253.perlbmk 253.perlbmk 253.perlbmk 253.perlbmk
255.vortex 255.vortex 255.vortex 255.vortex
256.bzip2 256.bzip2 256.bzip2 256.bzip2
300.twolf 300.twolf 300.twolf 300.twolf

Table 2. Heterogeneous run order. Different processors run different jobs.

Queue 0 Queue 1 Queue 2 Queue 3


164.gzip 175.vpr 176.gcc 181.mcf
175.vpr 176.gcc 181.mcf 186.crafty
176.gcc 181.mcf 186.crafty 197.parser
181.mcf 186.crafty 197.parser 252.eon
186.crafty 197.parser 252.eon 254.gap
197.parser 252.eon 254.gap 253.perlbmk
252.eon 254.gap 253.perlbmk 255.vortex
254.gap 253.perlbmk 255.vortex 256.bzip2
253.perlbmk 255.vortex 256.bzip2 300.twolf
255.vortex 256.bzip2 300.twolf 164.gzip
256.bzip2 300.twolf 164.gzip 175.vpr
300.twolf 164.gzip 175.vpr 176.gcc

as implicitly containing 12 phases: between each row there is a pause to wait for
all of the row to finish. No such pause occurs with Table 2. In the heterogeneous
prototype, each queue runs independently.

Results. Informal (mostly non-quantitative) reports of results with the hetero-


geneous prototype fell into two categories. Some reports indicated only minor
differences in observed run times, within the usual range for run-to-run vari-
ation. Others said that benchmarks with noticeable main memory traffic ran
noticeably faster, presumably because they tended to compete with less intense
jobs, rather than with equally intense copies of themselves. Therefore, the likely
bottom line with a heterogeneous method would be slightly better scaling than
with the homogeneous method. The possibility that scaling would improve may
be seen as a negative aspect of the heterogeneous method, if one is concerned
that homogeneous scaling already appears to be “too good”.
SPECrate2006: Alternatives Considered, Lessons Learned 5

It seems intuitive that any particular resource stressed by a homogeneous


workload would be less stressed by the above heterogeneous method.1

Difficulties with the heterogeneous method. On the assumption that such resource
stresses are useful to study, reducing their levels in a heterogeneous workload
is bad, because it makes them less apparent and harder to analyze. A hetero-
geneous workload also makes it much more difficult to reproduce performance
conditions. For example, suppose that 255.vortex runs more slowly than desired.
To reproduce its conditions from Table 1, to a first approximation, one can sim-
ply run the 4 copies of vortex. To reproduce its conditions from Table 2, it is
necessary to run the whole suite. One cannot try to just run selected “rows”,
because the rows in Table 2 do not represent separate phases.

4.2 Staggered Homogeneous


Another alternative prototyped by SPEC delays the start of each job by a small
amount (a “stagger”), while running the same job on all processors. The intent
of the staggered homogeneous method is to avoid the hypothesized convoy ef-
fects mentioned above. The prototype still exists, latent and unsupported, in
SPEC CPU2006. The excerpts below are taken from an unmodified copy of the
suite:

$ specinvoke -h
-S msecs sleep between spawning copies (in milliseconds)

$ runspec --stag
Option stag is ambiguous (stagger, staggeredhomogenousrate)

$ runspec --config oct14a --size test \\


--copies 2 --staggeredhomogen --stagger 6000 473.astar

The specinvoke [11] utility provides a help message that tells us that stag-
gers are expressed in milliseconds. The first runspec command tricks the switch
parser into reminding us how to spell its undocumented switches, and the sec-
ond runspec command runs 2 copies of the test workload for the benchmark
473.astar, with a delay of 6 seconds between each copy.
As a reminder, the staggered homogeneous prototype is unsupported. If
the reader plays with it, you are reminded that anything you learn from it
cannot be represented as an official SPEC metric. If you do decide to use it,
you will probably find it easiest to discern what it did by looking in the run
directory:

1
With the notable exception of hardware and OS support for the instruction stream.
For SPECrate, each copy has its own data, but all use the same program binary,
allowing the OS the opportunity to load only one copy into physical memory. In a
heterogeneous context, obviously, multiple program binaries are active.
6 J.L. Henning

$ cd $SPEC/benchspec/CPU2006/473.astar/run/run*000
$ cat speccmds.out
timer ticks over every 1000 ns
running commands in speccmds.cmd 1 times
runs started at 1225226364, 29870000, Tue Oct 28 16:39:24 2008
run 1 started at 1225226364, 29876000, Tue Oct 28 16:39:24 2008
child started: 0, 1225226364, 29883000, pid=3147,
’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’
child started: 1, 1225226370, 30218000, pid=3148,
’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’
child finished: 0, 1225226376, 980432000, sec=12, nsec=950549000,
pid=3147, rc=0
child finished: 1, 1225226383, 556000, sec=12, nsec=970338000,
pid=3148, rc=0
run 1 finished at: 1225226383, 562000, Tue Oct 28 16:39:43 2008
run 1 elapsed time: 18, 970686000, 18.970686000
runs finished at 1225226383, 597000, Tue Oct 28 16:39:43 2008
runs elapsed time: 18, 970727000, 18.970727000

Notice above that the two copies were started 6 seconds apart (1225226364 and
1225226370 seconds after Jan. 1, 1970), each took just under 13 seconds, and
the total elapsed time was just under 19 seconds. The bottom line includes the
time for the stagger, as it is measured from start-of-first copy to finish-of-last.
One might want to consider other ways of calculating a bottom line. (Reminder:
any use of the prototype may not be represented as an official SPEC metric.)

Results. As SPEC experimented with the prototype, the hypothesized con-


voy effect was not observed. That is, the expectation had been that the nor-
mal SPECrate causes unrealistic resource overloads when, for example, 128
copies all try simultaneously to acquire a lock on a filesystem; and that a small
stagger (on the order of 10s of milliseconds) would avoid the overloading and
actually cause faster overall execution time. Instead, small staggers were ob-
served to make no particular difference to overall time (indistinguishable from
noise).

Difficulties with the staggered homogeneous method. Should the metric include
the stagger time? If so, unless the staggers are very small, too much idle time
may be included. Alternatively, one might try to exclude the staggers by, for
example, calculating time from start-of-last to finish-of-first; a disadvantage of
this approach is that it could cause performance to be overstated if one copy has
more hardware resources than others (e.g. a 16-chip, 64-core system with 4 copies
on 15 of the chips, but only 1 copy on the last). Perhaps the most attractive
alternative would be to attempt to achieve a steady state of repeated execution,
with all processors busy, running staggered workloads; one would compute a
metric that sampled execution time for complete jobs during the steady state.
The primary disadvantage of this approach is that the suite is sometimes already
criticized as taking too long; running repeated workloads to ramp up to a steady
state was not viewed as attractive.
SPECrate2006: Alternatives Considered, Lessons Learned 7

SPEC’s Decision. After discussion, neither of the prototyped alternatives was


adopted for CPU2006, and SPECrate remains essentially unchanged since 1992.

5 Applying SPECrate2006

SPECrate provides a useful window into how systems perform when stressed
by multiple requests for similar resources such as program startup, data ini-
tialization, translation lookaside buffer (TLB) requests, and memory allocation.
It is understood that in real life, an OS is unlikely to get 128 simultaneous
identical requests, so one must be careful not to over optimize to this, or to
any other, benchmark. Nevertheless, the homogeneity may be its virtue: in
real life, systems do have to deal with intense requests, traffic jams do occur,
and SPECrate presents a compute-intensive workload that is repeatable and
analyzable.
In this section, three case studies are briefly summarized from applying
SPECrate2006 to Solaris systems.

5.1 A 4:1 Performance Gain for Startup of a SPEC CPU2006


Benchmark When I/O Was Properly Configured

Although the intent of SPEC CPU benchmarks are to be compute intensive,


some I/O inevitably remains. When multiple copies are run for SPECrate, I/O
is magnified. With each suite, it seems that one or two benchmarks stick out
as being especially in need of I/O tuning. For CPU95, a benchmark of concern
was 126.gcc: each copy compiles 56 input files and writes 112 output files with a
total of 8 MB of output data. For CPU2000, the benchmark 200.sixtrack writes
42 files, with a total of 5.3 MB, per copy. For both CPU95 and CPU2000, testers
learned that on large systems, it is useful to have striped disks, preferably with
journaling file systems that do not stall waiting for writes.

Problem. For CPU2006, a benchmark of concern in large SPECrate runs is


450.soplex, a Simplex Linear Program (LP) Solver. The program is invoked
twice, and I/O becomes a problem in startup of part 2, when each benchmark
copy needs to read its copy of the 267 MB input file ref.mps.

Methods. In order to focus on the second part of the benchmark, the utility
convert_to_development [10] was applied to allow modifications to the ref
workload while still using the SPEC tools. The first workload was deleted, leav-
ing only ref.mps in the directory 450.soplex/data/ref/input. Then, 128 run di-
rectories were populated on a large server using runspec --action setup. The
actual runs were done using specinvoke -r [11]. In order to avoid unwanted file
caching effects (which would not be effective in a full reportable run), memory
was cleared between tests by running large copies of STREAM [17] and reading
a series of unrelated files. CPU and IO activity were observed using iostat 30.
8 J.L. Henning

Metrics. As each run began, CPU utilization was low, and disk activity high, as
128 copies of ref.mps were read. Eventually, the io kps fell to zero and the tested
processors achieved 100% utilization. Two metrics are reported: (1) Startup time
in minutes, determined by counting the 2-per-minute iostat records prior to 100%
utilization; (2) kps from the busy period (converted to MB/sec).

Baseline. When a single 10K RPM disk was used, startup required about 24
minutes, reading at about 24 MB/sec.

Software RAID. When Solaris Volume Manager was used with the default 16 KB
block size (known as an “interlace size” in the terminology of SVM) on an
A5200 Fiber Channel disk array with 6x 10K RPM disks, startup fell to about
20 minutes, reading about 30 MB/sec. With a block size of 256 KB, startup
improved to about 8 minutes and 72 MB/sec. For this read-intensive workload,
RAID-0 was not particularly faster than RAID-5. Increasing the number of disks
in the stripe set had little additional effect on performance, as the maximum
observed bandwidth for this somewhat older disk system was about 78 MB/sec.

Hardware RAID. A newer hardware RAID Array, the Sun StorageTek 2540 with
6x 15K RPM disks, did not show sensitivity to block size (called “segment size”
for this device) over the tested range of 16 KB through 512 KB. This insensitivity
may be viewed as a plus, since it may be hard to know in advance what block
size to choose. The bandwidth was about 97 MB/sec, roughly matching the
limit of the 1 Gb Host Bus Adapter (HBA) used in this test. Once again, read
performance was insensitive to use of RAID-0 vs. RAID-5. Further improvement
might be possible with a higher bandwidth HBA.

Divot summary. With hardware RAID, a performance divot of idle CPUs wait-
ing on I/O was reduced from 24 minutes to 6 minutes, which is a 4:1 improvement
over the original single-disk configuration.

Lessons for tuning other systems. Even in an allegedly CPU intensive environ-
ment, IO lurks. Hardware RAID may offload overhead from the server.

5.2 An Improvement of up to 2:1 for a CPU2006 Benchmark When


a TLB Data Structure Was Re-sized
UltraSPARC T2. The UltraSPARC T2 (aka “Niagara2”) and UltraSPARC T2
Plus processors [12] are multi-threaded processors with eight SPARC processor
cores. Each core runs 8 hardware threads and has 2 integer units, one floating
point unit, an 8 KB L1 data cache, and a 16 KB L1 instruction cache. All cores
share a single 4 MB L2 cache. Each core does virtual address translation using
a 64-entry instruction Translation Lookaside Buffer (TLB) and a 128-entry data
TLB [5]. When TLB misses occur, software-managed direct-mapped Translation
Storage Buffers (TSBs) are consulted by a Hardware Table Walker. TSBs are
allocated per-process, for up to 4 page sizes (8 KB, 64 KB, 4 MB, 256 MB). By
default, each TSB holds 512 entries, but the hardware allows much larger TSBs
to be allocated if the operating system so chooses [13].
SPECrate2006: Alternatives Considered, Lessons Learned 9

Table 3. 436.cactusADM SPECratios (higher is better)

base peak
run #1 86.04 86.56
run #2 86.09 86.98
run #3 85.82 63.52

Table 4. Normalized per-copy times (lower is better) for 436.cactusADM

Peak Peak
Metric Run 2 Run 3

Including all 127 copies:


Median 1.0000 .9947
Arithmetic Mean .9916 .9990
Std. Deviation .0232 .0844
Max 1.0104 1.3837
If the worst 6 copies are dropped:
Median .9987 .9933
Arithmetic Mean .9907 .9813
Std. Deviation .0235 .0284
Max 1.0072 1.0086

Problem. During testing of CPU2006 on UltraSPARC T2 and UltraSPARC T2


Plus processors, unexplained variability was sometimes seen for the benchmark
436.cactusADM. For example, a single reportable run of the floating point suite
from December 2007 with 6 runs of the benchmark (3x base tuning and 3x peak
tuning) showed inconsistent performance, as detailed in Table 3. Notice that
although the median performance for peak was 86.56, the slowest run was off by
more than 1/4.

Analysis: Variation by copy. Recall from the metrics discussion at the beginning
of this paper that reported benchmark scores depend on the time from start of
first copy to completion of last. Therefore, a primary goal for the tester is to
attempt to achieve consistency across all tested copies – in this case, 127 copies
on a 2-chip system. Table 4 summarizes the copy-by-copy times in the second
and third peak runs.
In Table 4, times are normalized to the median time from Peak Run 2. Notice
the consistency in Peak Run 2, with the worst of the 127 copies needing only
1.04% more time than the median time. By contrast, the slowest copy in Peak
Run 3 needed 38.37% more time than the median of Peak Run 2. The problems
in Peak Run 3 are not widespread; in fact, only 6 of the 127 copies were slow.
If these 6 copies were eliminated, as shown in the second half of the table, the
two runs would match each other. Unfortunately for the tester the metrics do
not allow post-processing to eliminate the slow copies.
10 J.L. Henning

Considerable time was spent trying to trace the source of the occasional
poor copy time for 436.cactusADM, which sometimes was up to 2x worse than
the expected time. Analysis of experiment logs did not indicate any particular
pattern to the degraded performance. Sometimes, a handful of copies would be
slow; often, none would be slow. The slow performance did not appear to be tied
to system state, nor to particular virtual processors, as it would move around
from one CPU to another. Attempts to instrument the tests were often met by
a failure to reproduce the slow performance.
Smoking gun. Eventually, a bad run was caught with trapstat -T [14]:
cpu dtsb-miss %tim
7 4138331 59.8
11 4117256 60.2
14 4135205 59.9
21 4114273 60.4
23 4139823 59.5
In the trapstat output, it can be seen that various copies (on virtual pro-
cessors 7, 11, 14, 21, 23) are estimated to be spending about 60% of their time
processing TSB misses. Once this was found, the solution to the variability of
436.cactusADM was straightforward. As mentioned above, the hardware allows
TSBs to be expanded, and Solaris supports the hardware feature with a pair
of tunables: enable tsb rss sizing and tsb rss factor [16]. The former is on by
default; the latter provides a measure of how full TSBs have to be before they
become candidates for resizing. As can be seen in SPEC CPU submissions from
early 2008, this Solaris tuning parameter has been used, and 436.cactusADM
performance has been steady. For example, in a large SPECrate submission
with 630 copies, the three runs differed from each other by less than 1% [8].
If per-copy results are analyzed (as in Table 4), the worst time across all 1890
copies differs from the median by only 1.52%.

Divot summary. SPECrate was useful for uncovering a hard-to-predict, hard-


to-reproduce performance divot of up to 2:1. It was resolved by encouraging the
operating system to be more willing to expand the size of the data TSBs.

Lessons for tuning other systems. The default TSB sizing is adequate for most
applications, especially if large pages are employed. If it is suspected that large
applications (e.g. more than 1 GB, with 4 MB pages) may be running more
slowly than desired, trapstat -T can be used to check for TSB activity, and if
it is found, tsb rss factor can be decreased.

5.3 A Gain of 52% for a CPU2006 Benchmark after a Change to


the Operating System Policy for NUMA Page Allocation
Problem. When testing large SPECrate runs, variability was sometimes ob-
served, and, as in the previous section, effort was spent to try to trace it. Unlike
the previous case, there appeared to be a pattern, as shown in Figure 1.
SPECrate2006: Alternatives Considered, Lessons Learned 11

seconds
1200

1000

800

600

400

200

0
0 16 32 48 64 80 96 112 128 144
processor

Fig. 1. 429.mcf variability by processor number

seconds

3000

2500

2000

1500

1000

500

0
0 16 32 48 64 80 96 112 128 144
processor

Fig. 2. 434.zeusmp variability by processor number

Figure 1 is from a large 72-chip, 144 processor server, running 143 copies of
the benchmarks. The server has 18 system boards, each with 8 virtual CPUs. In
the graph, the vertical grid delimits system boards. Notice that most copies of
429.mcf completed in about 800 seconds, except for those on the second system
board. Attempts to trace the problem showed that generally a single system board
would be slow, but it was, at first, hard to predict which board. In Figure 2, taken
from a different large server, notice that it is the 4th from the last that is slower.

Graphical analysis. Edward R. Tufte suggests that graphs should be used only
if one has large amounts of data needing analysis, and they should contain only
pixels that are essential to the analysis, avoiding “chartjunk” [18]. The situation
at hand has over 14,000 benchmark observations in each 143-copy reportable
SPECfp rate2006 run, and many more from tuning runs. To ease graphical
analysis, a perl procedure was written that extracted data from log files, drove
gnuplot with what was viewed as a minimal amount of chartjunk (as in the
above graphs), and joined them into a webpage.
12 J.L. Henning

NUMA Hypothesis. Because the graphs showed that problems would tend to
occur on a single system board, and because it is known that local system board
memory access has shorter latency than remote memory, NUMA (Non Uniform
Memory Access) differences were suspected. Solaris supports NUMA using a
concept of Memory Placement Optimization (MPO) [2], which attempts to place
process resources into “latency groups”. A latency group is a set of resources
which are within some latency of each other. Systems can have multiple latency
groups, and multiple levels of groups.
Tools. NUMA activity can be seen on Solaris 10 systems with the opensolaris.org
“NUMA Observability Tools” [15]. Two useful tools are the extended pmap and
lgrpinfo. The first is easily installed from the tools binary distribution:
$ gunzip -c ptools-bin-0.1.7.tar.gz | tar xf -
$ cd ptools-bin-0.1.7/
$ ./pmap -Ls $$ | head -10
Address Bytes Pgsz Mode Lgrp Mapped File
00010000 640K 64K r-x-- 2 /usr/bin/bash
000C0000 64K 64K rwx-- 1 /usr/bin/bash
000E0000 128K 64K rwx-- 1 [ heap ]
FF0F4000 8K 8K rwxs- 1 [ anon ]

In the pmap example above, note that -L tells us the locality group for each
memory segment, and -s displays the page size. (In the interest of space, various
output is truncated in both the examples in this section.)
To install lgrpinfo requires a couple of extra steps, because a customization
is needed for the local version of perl:
$ gunzip -c Solaris-Lgrp-0.1.4.tar.gz | tar xf -
$ cd Solaris-Lgrp-0.1.4/
$ perl Makefile.PL
Writing Makefile for Solaris::Lgrp
$ make
$ make test
All tests successful.
$ su
Password:
# make install
# exit

$ bin/lgrpinfo
lgroup 0 (root):
Children: 1 2
CPUs: 0-127
Memory: installed 130848 Mb, allocated 3924 Mb, free 126924 Mb
Lgroup resources: 1 2 (CPU); 1 2 (memory)
lgroup 1 (leaf):
CPUs: 0-63
Memory: installed 65312 Mb, allocated 1675 Mb, free 63637 Mb
Lgroup resources: 1 (CPU); 1 (memory)
SPECrate2006: Alternatives Considered, Lessons Learned 13

lgroup 2 (leaf):
CPUs: 64-127
Memory: installed 65536 Mb, allocated 2249 Mb, free 63287 Mb
Lgroup resources: 2 (CPU); 2 (memory)
$

In the lgrpinfo example, the output describes a system with 128 virtual pro-
cessors and 128 GB memory, divided into two latency groups. (For the sake of
brevity, this example is from a simpler system than the one in the graphs.)

Diagnosis. Use of pmap showed that the benchmarks running in the slower local-
ity group were receiving memory of the requested page size (4 MB) but not the
desired location. It was also noted that the slow locality group was the one where
the SPEC tool suite itself (runspec) was started. Observations with lgrpinfo
showed that during the benchmark setup phase, when runspec writes 143 run
directories for each of the benchmarks in the suite, physical memory was used
up in runspec’s locality group, apparently for file system caches.

Workarounds attempted. It was hypothesized that the setup phase may have
fragmented memory on runspec’s system board; and that the operating system
might not be able (or, might not be willing) to coalesce fragmented 8 KB pages
into 4 MB pages. Asking for smaller page sizes (such as 64 KB or 512 KB) some-
times appeared to succeed, but this compromise was not considered desirable
since the benchmarks are large enough that 4 MB pages are known to be help-
ful. The size of file system caches was reduced using system tuning parameters
such as bufhwm and segmap percent, and memory cleanup was encouraged with
reasonably active settings for autoup and tune t fsflushr [16]. To improve pre-
dictability, runspec was initiated in a known location, namely the system board
that is also used by Solaris itself, and the amount of physical memory on that
board was doubled.
These workarounds were usually helpful, and memory availability usually im-
proved, but the workarounds were viewed as less than completely satisfactory
on the grounds that in real life, customers may not have the degree of control
that the benchmark tester has.
Colloquially, the problem can be simply summarized as: “Dear Operating
System: If I ask for local bigpages, and you don’t have them handy, please don’t
give me remote bigpages instead. Please try harder to create local bigpages.”
Given this simple summary, a simple suggestion arises: why not just change the
default policy to always try harder?
There are several reasons to hesitate to change the default policy: (1) Co-
alescing pages may be expensive, as it requires relocating pages for running
processes. (2) For processes that run quickly, it may be better to allocate mem-
ory quickly rather than spending extra effort. (3) It is unknown how frequently
the problem may occur in real life: how often do long-running programs ask for
large memory regions with large pages, which are then used intensely enough to
amortize any extra cost required to coalesce pages? Given insufficient data to
14 J.L. Henning

seconds
6000
Peak run 2 Run 2 has a slow lgroup
5000 Peak run 3

4000

3000 Run 3 is consistent across the lgroups

2000

1000

0
0 16 32 48 64 80 96 112 128 144
processor

Fig. 3. 433.milc before and after lpg alloc prefer

answer questions such as these, the operating system policies must be approached
with care.

Changes to Solaris. Over the course of the investigations of these issues, the
Solaris development group responded by implementing two changes. First, the
algorithm for coalescing pages was made more efficient. Second, a tunable param-
eter was introduced to allow users to increase the priority of local page allocation:
lpg alloc prefer [16]. If you have single threaded, long running, large memory ap-
plications, then consider setting lpg alloc prefer=1. This causes Solaris to spend
more CPU time defragmenting memory to allocate local large pages, versus allo-
cating readily available remote large pages. The long term savings from accessing
local rather than remote memory may offset the higher allocation cost.
This tunable parameter is used in the 256 virtual processor Sun SPARC En-
terprise T5440 SPECfp rate2006 result [9]. When the graphical analysis tools
are applied to this result, NUMA effects are not seen.

Divot summary. An early version of lpg alloc prefer was applied to a system in
the middle of a SPECrate run. The effect was to remove a NUMA performance
divot that would sometimes slow down a single system board. The largest effect
was on the benchmark 433.milc, as shown in Figure 3.
Because the tools report the time from start-of-first to finish-of-last, the bot-
tom line improved by 52%:
Success 433.milc peak ref ratio=226.74, runtime=5789.629458
Success 433.milc peak ref ratio=344.64, runtime=3809.035023

SPECrate was useful as a generator of a repeatable, intense workload on


NUMA systems, allowing careful study of the divot.

Lessons for tuning other systems. Systems that tend to run large single-threaded
programs may benefit from setting lpg alloc prefer.
SPECrate2006: Alternatives Considered, Lessons Learned 15

6 Summary
Although it is widely understood that “all software has bugs”, it may
not be as widely understood that all systems have performance divots.
A repeatable, analyzable workload allows divots to be analyzed. Cherish
your divots.

(Repetition is a form of emphasis.)

Acknowledgments. Thank you to the SPEC CPU subcommittee for permis-


sion to summarize the investigation of SPECrate alternatives, and especially
to Cloyce Spradling for implementation of the alternatives. Numerous collegues
within Sun have assisted with the technical investigations summarized in this
paper, including Miriam Blatt, Jonathan Chew, Michael Corcoran, Darryl Gove,
Aleksandr Guzovskiy, Alexander Kolbasov, Eric Saxe, Steve Sistare, Geetha Val-
labhaneni, and Brian Whitney. Karsten Guthridge was the first to catch 436.cac-
tusADM in trapstat.

References
1. Carlton, A.: CINT92 and CFP 92 Homogeneous Capacity Method Offers Fair Mea-
sure of Processing Capacity, http://www.spec.org/cpu92/specrate.txt
2. Chew, J.: Memory Placement Optimization (MPO),
http://opensolaris.org/os/community/performance/mpo overview.pdf
3. Gove, D.: CPU2006 Working Set Size. ACM SIGARCH Computer Architecture
News 35(1), 90–96 (2007), http://www.spec.org/cpu2006/publications/
4. Henning, J.L.: SPEC CPU Suite Growth: An Historical Perspective. ACM
SIGARCH Computer Architecture News 35(1), 65–68 (2007),
http://www.spec.org/cpu2006/publications/
5. McGhan, H.: Niagara 2 Opens the Floodgates. Microprocessor Report (November
6, 2006),
http://www.sun.com/processors/niagara/M45 MPFNiagara2 reprint.pdf
6. SPEC CPU2000 published results,
http://www.spec.org/osg/cpu2000/results/res2000q2/
cpu2000-20000511-00104.html,
http://www.spec.org/osg/cpu2000/results/res2000q2/
cpu2000-20000511-00105.html
7. SPEC CPU2000 published results,
http://www.spec.org/osg/cpu2000/results/res2002q2/
cpu2000-20020422-01329.html,
http://www.spec.org/osg/cpu2000/results/res2002q1/
cpu2000-20020211-01256.html
8. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/
res2008q2/cpu2006-20080408-04064.html
9. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/
res2008q4/cpu2006-20080929-05409.html
10. SPEC CPU2006 Documentation, http://www.spec.org/cpu2006/docs/utility.
html#convert to development
16 J.L. Henning

11. SPEC CPU2006 Documentation, http://www.spec.org/cpu2006/docs/utility.


html#specinvoke
12. Sun Microsystems, UltraSPARC T2 Processor, http://www.sun.com/processors/
UltraSPARC-T2/datasheet.pdf
13. Sun Microsystems, UltraSPARCT2 Supplement to the UltraSPARC Ar-
chitecture, section 12.2 (2007), http://opensparc-t2.sunsource.net/specs/
UST2-UASuppl-current-draft-P-EXT.pdf
14. Sun Microsystems, Solaris 10 Reference Manual Collection, http://docs.sun.com/
app/docs/doc/816-5166/trapstat-1m?a=view
15. Sun Microsystems, NUMA Observability, http://www.opensolaris.org/os/
community/performance/numa/observability/
16. Sun Microsystems, Solaris Tunable Parameters Reference Manual,
http://docs.sun.com/app/docs/doc/817-0404
17. STREAM: Sustainable Memory Bandwidth in High Performance Computers,
http://www.cs.virginia.edu/stream/
18. Tufte, E.R.: The Visual Display of Quantitative Information, pp. 107–121. Graphics
Press, Chesire (1983)
19. Weicker, R.P., Henning, J.L.: Subroutine Profiling Results for the CPU2006 Bench-
marks. ACM SIGARCH Computer Architecture News 35(1), 102–111 (2007),
http://www.spec.org/cpu2006/publications/
SPECjvm2008 Performance Characterization

Kumar Shiv, Kingsum Chow, Yanping Wang, and Dmitry Petrochenko

Intel Corporation
{kumar.shiv,kingsum.chow,yanping.wang,
dmitry.petrochenko}@intel.com

Abstract. SPECjvm2008 is a new multi-threaded Java benchmark from SPEC


and it replaces the aging single threaded SPECjvm98. The benchmark is in-
tended to address several shortcomings of the earlier workloads in SPECjvm98
by replacing DB, Chart, Javac; removing Jess, adding XML, Serial, Crypto, in-
cache and out-cache versions of Scimark workloads. It is targeted for measuring
the performance of both JVM and hardware systems. In this paper we describe
the salient features of SPECjvm2008. We then take a first look at the perform-
ance of this benchmark on current multi-core platforms and study the sensitivity
of the components of the workload to basic architectural aspects such as the
number of processor cores, the processor frequency, cache and memory sub-
system. We focus our study on understanding how the behavior of this work-
load compares with other standard Java benchmarks, SPECjbb2005 and
SPECjAppServer2004, both in components of the software stack that the work-
loads touch as well as in the aspects of the platform that they exercise and draw
conclusion on the usefulness of SPECjvm2008 for practitioners of JVM and
hardware performance analysis.

Keywords: SPEC, Java Performance, Workload Characterization.

1 Introduction
The release of SPECjvm98 [6] as a client side workload stirred up a lot of interest in
performance analysis of Java workloads. Dieckmann and Holzle [1] studied the allo-
cation behavior of the SPECjvm98 Java benchmarks. Radhakrishna [2] did an in
depth analysis of micro-architectural techniques to enable efficient Java execution.
The benchmarks were also used to go beyond Java code as Li and John [3] character-
ized operating system activity in SPECjvm98 Benchmarks. However, modern ma-
chines are too fast [4] for the 10 year old benchmark and an overhaul had been long
overdue.
Now ten years later, the release of SPECjvm2008 is expected to stir a lot of interest in
how the latest overhaul of the benchmark is going to enable and encourage Java perform-
ance analysis on modern architectures. The designers of the new SPECjvm2008 have
kept that in mind and the benchmark is intended to take advantage of multiple cores,
higher frequencies, bigger caches, and larger memory bandwidths.
In this work we have performed several experiments with SPECjvm2008.
Our analysis of the workload running on the latest modern processors is intended to

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 17–35, 2009.
© Springer-Verlag Berlin Heidelberg 2009
18 K. Shiv et al.

understand the sensitivity of benchmark performance to many cores, higher frequen-


cies, and different cache and memory hierarchies. In addition, we also looked at the
effectiveness of Java runtime systems including just-in-time compilation, dynamic
optimizations, synchronizations, object allocations and other Java technologies. We
then compared the characteristics with recently released SPEC Java benchmarks such
as SPECjAppServer2004 [13] and SPECjbb2005 [14].
By studying the SPECjvm2008 and comparing with SPECjAppServer2004 and
SPECjbb2005, we hope to establish a pattern of behavior of Java workloads on mod-
ern architectures and to enable a distinction between the new benchmark and the two
more established bigger benchmarks.

2 Description of SPECjvm2008
Latest advances in processor and Java technologies have necessitated an overhaul of
the SPECjvm98 benchmark [6]. Now, 10 years later, Standard Performance Evalua-
tion Corporation (SPEC) has updated it with a new version, SPECjvm2008 [7]. An
overview of the comparison between SPECjvm98 and SPECjvm2008 is summarized
in Table 1.
SPECjvm2008 comprises many multithreaded workloads that represent a broad col-
lection of Java applications for both servers and clients. It can be used to evaluate per-
formance of Java Virtual Machines (JVM) and the underlying hardware systems. It can

Table 1. Comparison between SPECjvm98 and SPECjvm2008

Features SPECjvm98 SPECjvm2008


Target Client client and server
Multi-threading No Yes
All code is available No Yes
Number of sub-groups 7 11
Free downloadable No Yes
Include Base and Peak scores No Yes

Fixed run duration Yes No


Measurements unit Time Ops/min
Benchmark output verification Yes Yes

Single tier Yes Yes

JDK JDK 1.1 or later JDK 5.0 or later

Only 1 JVM instance is allowed Yes Yes


SPECjvm2008 Performance Characterization 19

stress various components inside the JVM, such as the Java Runtime Environment
(JRE), Just-in-time (JIT) code generation, the memory management system, threading
and synchronization features. SPECjvm2008 is also designed with modern multicore
processors in mind. A single JVM instance running the workload will generate
enough threads to stress the underlying hardware systems. It is expected to be useful
in the evaluation many hardware features such as the impact of the number of cores
and processors, the frequency of the processors, integer and floating point operations,
cache hierarchy and memory sub systems.
SPECjvm2008 comes with a set of analysis tools such as a plug-in analysis frame-
work that can gather run time information such as heap and power usage. It also
comes with a reporter that displays a summary graph of test runs. It is easy to config-
ure and run and provides quick feedback for performance analysis. SPECjvm2008 is
perhaps a little biased towards server performance as the minimum memory require-
ment is 512MB per hardware thread.
SPECjvm2008 can be run in 2 modes: base and peak runs. The base run simulates
environments in which users do not tune software to increase performance. No con-
figuration or hand tuning of the JVM is allowed. The base run has fixed run dura-
tions: 120 seconds warm-up, followed up by 240 seconds measurement interval. The
peak run simulates environments in which users are allowed to tune the JVM to in-
crease performance. It also allows feedback optimizations and code caching. The
JVM can be configured to obtain the best score possible, using command line parame-
ters and property files, which must be explained in the submission. In addition, the
peak run has no restrictions on either the warm-up time or the measurement interval.
But only 1 measurement iteration is allowed for each workload. A base submission is
required for a peak submission.
SPECjvm2008 is available for free. It can be downloaded from the SPEC website.
SPECjvm2008 is composed of 11 groups of Java SE applications for both
clients and servers. Each group represents a unique area of Java applications. The
overall score is computed by nested geometric means as described by Richard M Yoo
et al [5].

Score = k n1
X 11... X 1n1 ...nk Xk1... Xknk
The overall SPECjvm2008 score is computed by substituting k by 11 and n1..k by
the corresponding numbers of workloads in each group. Each of 11 groups has an
equal weight of the 11th root of the final composite score.
The compositions of the 11 groups of workloads are summarized in Table 2.
Tests are run in order, i.e., starting with startup.helloworld and ending with
xml.validation. A new JVM instance is launched for each “startup” workloads. After
all the “startup” workloads are run, a single JVM is launched to run the rest of the
workloads, i.e., from compiler.compiler to xml.validation. Thus, the environment left
from running each workload may impact the performance of the workloads coming
after it.
20 K. Shiv et al.

Table 2. SPECjvm2008 Benchmark Composition

Group name Number of Workloads


workloads
Startup 17 startup.helloworld,
startup.compiler.compiler,
startup.compiler.sunflow,
startup.compress,
startup.crypto.aes,
startup.crypto.rsa,
startup.crypto.signverify,
startup.mpegaudio,
startup.scimark.fft,
startup.scimark.lu,
startup.scimark.monte_carlo,
startup.scimark.sor,
startup.scimark.sparse,
startup.serial,
startup.sunflow,
startup.xml.transform,
startup.xml.validation
Compiler 2 compiler.compiler,
compiler.sunflow
Compress 1 Compress
Crypto 3 crypto.aes, crypto.rsa,
crypto.signverify
Derby 1 Derby
Mpegaudio 1 Mpegaudio
Scimark Large 5 scimark.fft.large,
scimark.lu.large,
scimark.sor.large,
scimark.sparse.large,
scimark.monte_carlo
Scimark Small 5 scimark.fft.small,
scimark.lu.small,
scimark.sor.small,
scimark.sparse.small,
scimark.monte_carlo
Serial 1 Serial
Sunflow 1 Sunflow
Xml 2 xml.transform,
xml.validation

2.1 Startup

The startup group of workloads measures the JVM startup time of each workload by
starting each on of them with a new instance of JVM. Each workload in this group is
SPECjvm2008 Performance Characterization 21

single threaded. Each of them is a startup of corresponding throughput test in the


suite. The only exception is helloworld, which does not have a corresponding
throughput test.
When a new JVM is launched for each workload within this group, only default
JVM parameters are used. Each test measures the time for a JVM to complete one
loop of corresponding throughput test. A single group score is computed by taking the
geometric mean of the 17 individual startup scores.

2.2 Compiler

The compiler group has two workloads: compiler and sunflow. The com-
piler.compiler workload measures the compilation time for the OpenJDK compiler.
The compiler.sunflow workload measures the compilation of the sunflow benchmark.
As the goal of these workloads is to evaluate the performance of the compiler, the
impact of I/O is reduced by storing input data in memory, or file cache.

2.3 Compress

The compress workload is taken from SPECjvm98. It compresses and decompresses


data using a modified Lempel-Ziv method. The input data is extended from 90KB to
3.36MB. To minimize the impact of I/O, data is buffered. Its algorithm uses internal
tables (~67KB in size) and pseudo-random access based on input data. This workload
exercises Just-in-time compiling, inlining, array access and cache performance as the
JVM generates and handles mixed length data accesses.

2.4 Crypto

The crypto group contains three different workloads to represent on different impor-
tant areas of cryptography. They test vendor implementations of the protocols as well
as JVM execution. The three workloads are crypto.aes, crypto.rsa and
crypto.signverify.
The crypto.aes workload encrypts and decrypts using the AES and DES protocols,
applying CBC/PKCS5Padding and CBC/NoPadding. The input data sizes are 100
bytes and 713 KB, respectively.
The crypto.rsa workload encrypts and decrypts using the RSA protocol for input
data sizes of 100 bytes and 16 KB.
The crypto.signverify workload signs and verifies using MD5withRSA,
SHA1withRSA, SHA1withDSA and SHA256withRSA protocols for input data sizes
of 1KB, 65KB and 1MB.
Different crypto providers can be used.

2.5 Derby

The derby workload uses an open-source database, derby [8], written in pure Java.
Multiple databases are instantiated when the workload is started. Every 4 threads
share one database instance. Synchronization is exercised in this workload. This work-
load extended IBM’s telco benchmark [12] to synthesize business logic and to stress the
use of the BigDecimal operations. These BigDecimal computations are mostly longer
22 K. Shiv et al.

than 64-bit to examine not only 'simple' BigDecimal, which can be implemented
using the long type, but also BigDecimal values that have to be stored in larger data
sizes. Thus this workload exercise both database and BigDecimal operations.

2.6 Mpegaudio

As the source for the mpegaudio workload from SPECjvm98 cannot be made avail-
able, a new version of mpegaudio is created in SPECjvm2008. It uses the MP3 library
called JLayer [9], an Mpeg Audio decoder. This workload is floating-point computa-
tion centric. Its input data set contains six MP3 files sized from 20KB to 3MB.

2.7 Scimark

Scimark, as the name implies, is based on the well known Scimark benchmark devel-
oped by NIST [10]. This group of workloads evaluates floating-point operations and
data access patterns for intensive mathematical computations. Scimark is modified for
multi-threading with different dataset sizes in SPECjvm2008.
Scimark is actually composed of two groups in SPECjvm2008, scimark.large and
scimark.small, for large and small data sets. Each thread in the workload consumes
one data set. The “large” group runs with 32MB data set to simulate out of cache
access performance while the “small” group runs with 512K data set to simulate in-
cache access performance.
Each group is composed of 5 workloads: fft, lu, sor, sparse and monte_carlo.
Scimark.monte_carlo is run once but counted twice in both scimark large and scimark
small as the workload does not work on different data set sizes.

2.7.1 Scimark.FFT
Scimark.fft simulates Fast Fourier Transformation of one-dimensional, in-place algo-
rithm with bit-reversal and Nlog(N) complexity for large (2MB) and small (512KB)
data sets.

2.7.2 Scimark.SOR
Scimark.sor simulates Jacobi Successive Over-relaxation for large (2048x2048 grid)
and small (250x250 grid) data sets. It exercises typical access patterns in finite differ-
ence applications. The algorithm exercises basic "grid averaging" memory patterns,
where each A(i,j) is assigned an average weighting of its four nearest neighbors.

2.7.3 Scimark.Sparse
Scimark.sparse matrix multiplication uses an unstructured sparse matrix stored in
compressed-row format with a prescribed sparsity structure. It exercises indirection
addressing and non-regular memory references for large and small data sets. The large
data set contains a 200000x200000 matrix in compressed form with 4000000 non-
zeros in it. The small data set contains a 25000x25000 matrix in compressed form
with 62500 non-zeros in it.
SPECjvm2008 Performance Characterization 23

2.7.4 Scimark.lu
Scimark.lu computes the LU factorization of a dense, in-place, matrix using partial
pivoting. It solves a linear system of equations using a prefactored matrix in LU form.
It exercises linear algebra kernels (BLAS) and dense matrix operations for large
(2048x2048) and small (100x100) data sets.

2.7.5 Monte-carlo
Scimark.monte_carlo approximates the value of Pi by computing the integral of the
quarter circle y = sqrt(1 - x^2) on [0,1]. It chooses random points with the unit square
and computes the ratio of those within the circle versus those outside the circle. The
algorithm exercises random-number generators, synchronized function calls and func-
tion inlining. This workload is counted once in each of the Scimark large and Scimark
small groups.

2.8 Serial

The serial workload tests the performance of serialization and deserialization of


primitives and objects using byte arrays in memory taking a data set from a JBoss
benchmark. The workload has a producer-consumer scenario where the objects are
serialized by the producer threads and deserialized by the consumer threads on the
same system. It exercise java.lang.reflect package.

2.9 Sunflow

The sunflow workload is a multi-threaded benchmark simulating graphics and visu-


alization using ray tracing. It runs several bundles of dependent threads. 4 threads are
used per bundle. The number of bundles, by default, is equal to number of hardware
threads. It is re-configurable. This workload is floating-point heavy. Its high object
allocation rate puts pressure on the memory bandwidth.

2.10 XML

The XML group contains two workloads: xml.transform and xml.validate. The
xml.transform workload exercises the JAXP implementation by executing XSLT
transformations with DOM, SAX, Stream sources. It uses XSLTC engine, which
compiles xsl style sheets into java classes. 10 real life use cases are implemented. The
xml.validation workload exercises JAXP implementation by validating XML instance
documents against XML schema. 6 real life use cases are implemented.
Both XML workloads have high object allocation rate, high level of contended
locks. They also heavily exercise string operations. Each use case has approximately
the same influence on workload score.

3 Performance Characterization of SPECjvm2008


In this section we present an initial performance characterization of SPECjvm2008. We
look at data regarding various aspects of the workload from both software and hardware
perspectives. For our baseline we have chosen to use an Intel core-2 duo based platform
with 4 quad-core sockets thus giving us a 16-core system. Most of the data was collected
24 K. Shiv et al.

Table 3. SPECjvm2008 Benchmark Component Scores

Workload Score
compiler.compiler 937.23
compiler.sunflow 1119.25
Compress 614.14
crypto.aes 214.77
crypto.rsa 2012.82
crypto.signverify 1173.08
Derby 174
mpegaudio 350.44
scimark.fft.large 15.49
scimark.lu.large 5.14
scimark.sor.large 25.99
scimark.sparse.large 18.93
scimark.fft.small 4384.62
scimark.lu.small 4903.85
scimark.sor.small 713.61
scimark.sparse.small 509.41
scimark.monte_carlo 4903.85
Sunflow 195.7
xml.transform 1540.12
xml.validation 1117.91

on systems running at 2.92 GHz with a 1066 MHz front-side bus. Each socket has 2x4M
last-level cache (LLC). The system had 16 GB of memory. We also used a platform
using pre-release i7 processors for some additional experiments. As the i7 processors
have not yet been released, we are not able to share raw performance numbers at this
time. Nevertheless, we are able to show some interesting data. Also, unless specified
otherwise, the data were collected on the core-2 duo based platform.
On the software side, we used Sun’s Hotspot JVM for Java 6, jre-6u4-perf build X64,
and ran the benchmark with a heap size of 14 GB. The garbage collector was genera-
tional, stop-the-world and parallel, and the JVM allocated data and code into large pages.
The XML components used the Xerces parser from Apache. The Operating System was
Linux, RHEL5.
Table 3 presents one set of baseline data. We have observed a fair amount, occasion-
ally more than 5%, of run-to-run variation, and this will be the cause of some differences
in data in subsequent tables. The score of each component has a metric of operations per
minute, and can be seen to vary from a low of 5 ops/min for scimark.lu.large to a high of
4904 ops/min for scimark.lu.small and scimark.monte_carlo. This wide range motivated
the benchmark developers to choose to use a geometric mean; clearly using an arithmetic
mean would have been skewed by the higher scoring components.
Table 4 looks at the effect of optimizing the benchmark performance through tun-
ing and configuring of the system. SPECjvm2008 facilitates this by requiring the
SPECjvm2008 Performance Characterization 25

reporting of two scores, base and peak, whenever peak scores are reported. We see
that the benchmark’s performance is boosted by almost 7%. The performance in-
crease is higher for some components, and is almost 40% for compiler.compiler. A
few components, however, actually lose performance. The same set of configuration
parameters need to be used for all the components, and the options that work best for
the benchmark as a whole may not be optimal for a few of the components. Bringing
up the workloads with all of the configuration parameters, specifically an option re-
ferred to in Hotspot as AggressiveOpt, which turns on more sophisticated compiler
optimizations in the JIT, now takes longer thus hurting the performance of Startup.
Performance degradation is highest for scimark.fft.large. This workload computes
the FFT of a large set of data and has a 2^N stride through the data; the data access
pattern is such that the performance is actually hurt by the use of large pages due to
ineffective cache utilization, and suffers to the extent of 30%. Most of the other com-
ponents do enjoy the use of large pages.
In Table 5, we look at some basic metrics. As points of comparison, the correspond-
ing data for SPECjbb2005 and SPECjAppServer2004 are also included. It can be seen

Table 4. SPECjvm2008 Base versus Peak Comparisons

Peak Base Peak/Base


Startup 30.43 32.38 0.940
compiler.compiler 821.71 588.23 1.397
compiler.sunflow 1103.77 871.19 1.267
compress 611.66 630.47 0.970
crypto.aes 217.43 189.08 1.150
crypto.rsa 1961.54 1940.39 1.011
crypto.signverify 1154.41 1139.08 1.013
derby 170.58 165.84 1.029
mpegaudio 350.42 346.02 1.013
scimark.fft.large 15.47 22.21 0.697
scimark.lu.large 5.14 5.18 0.992
scimark.sor.large 26.16 26.97 0.970
scimark.sparse.large 19.09 18.05 1.058
scimark.fft.small 4384.62 4102.56 1.069
scimark.lu.small 4557.69 4725.42 0.965
scimark.sor.small 715.38 695.54 1.029
scimark.sparse.small 481.98 464.83 1.037
scimark.monte_carlo 5288.46 4905.82 1.078
serial 265.07 249.31 1.063
sunflow 257.23 190.55 1.350
xml.transform 1556.09 1399.12 1.112
xml.validation 1110.92 995.87 1.116
Composite Score 340.40 318.32 1.069
26 K. Shiv et al.

that just like SPECjbb2005 but unlike SPECjAppServer2004 all components of


SPECjvm2008 require very little kernel time, and the processor is almost always (>99%)
in user mode. SPECjAppServer2004 has a high level of network traffic requiring consid-
erable OS support. SPECjvm2008 has no corresponding I/O requirement. At first blush,
many of the SPECjvm2008 components have a remarkably high level of context
switches. Scimark.lu.large, for one example, has more than 5000 context switches per
operation. Four other components suffer from more than a thousand during an operation.
On digging deeper, however, we see that the rate while being somewhat high, is not
excessively so. Scimark.lu.large only delivers about 5 operations per minute, and this
means that we only observe about 26000 context switches in a minute. Compare this to
SPECjAppServer2004, that only suffers about 40 context switches per transaction (JOP),
but delivers about 2000 transactions per second (JOPS), and we can see that the context
switch rate in SPECjAppServer2004 is 40 times higher than in Scimark.lu.large. The
SPECjvm2008 component with the highest context switch rate is Derby and its rate is
about 30% of the SPECjAppServer2004 rate.
We next look at the object allocation rate and garbage collection rate for the various
SPECjvm2008 components, and again we present the data for SPECjbb2005 and

Table 5. Some Fundamental Metrics

%user %system cswch/op


compiler.compiler 99.57 0.27 33.92
compiler.sunflow 99.44 0.41 30.00
compress 99.93 0.02 50.78
crypto.aes 99.94 0.01 141.20
crypto.rsa 98.81 0.87 137.75
crypto.signverify 99.91 0.05 27.60
derby 97.80 1.06 2065.01
mpegaudio 99.79 0.19 87.01
scimark.fft.large 98.83 0.22 1895.77
scimark.lu.large 93.72 0.08 5409.42
scimark.sor.large 97.67 0.09 1073.89
scimark.sparse.large 92.70 0.35 1674.87
scimark.fft.small 99.97 0.01 6.94
scimark.lu.small 99.92 0.02 6.37
scimark.sor.small 99.97 0.01 40.97
scimark.sparse.small 99.82 0.03 59.13
scimark.monte_carlo 99.97 0.03 5.71
sunflow 99.45 0.08 414.96
xml.transform 88.75 0.71 417.58
xml.validation 99.93 0.02 30.47
SPECjbb2005 99.79 0.20 2.00
SPECjAppServer2004 78.63 19.50 37.00
SPECjvm2008 Performance Characterization 27

SPECjAppServer2004 for comparison. Table 6 presents this data, and we can immediately
see that apart from two components, compiler.compiler and compiler.sunflow, the rest have
a very demand on the garbage collection infrastructure of the JVM. Two of the compo-
nents in fact make no demands on the garbage collector at all, and eleven others spend less
than 0.1% of time in garbage collection. SPECjbb2005 on the other hand spends more than
2% of time in GC and SPECjAppServer2004 spends 7.5% of time in GC.
Not surprisingly, the object allocation data shows the same pattern, with com-
piler.compiler and compiler.sunflow having allocation rates lying between the alloca-
tion rates of SPECjbb2005 and SPECjAppServer2004, while most other components
have relative low allocation rates. Four components, though, diverge from this pat-
tern, and show high allocation rates and low garbage collection usage. Since the rate
at which GC is invoked is directly related to the allocation rate, it follows that these
components are spending less time in GC, because each garbage collection goes faster
for these components. We can theorize that these components have far fewer live
objects when GC is invoked, but we have not yet fully tested this theory. Derby, for
instance, has a high object allocation rate due to the frequent allocation of immutable
long BigDecimal objects, which do not stay alive very long.

Table 6. Allocation and Garbage Collection Rate

Alloc (MB/S) Alloc (MB/OP) GC %


compiler.compiler 2422.45 155.10 3.97
compiler.sunflow 2590.88 143.27 2.31
Compress 141.04 13.72 0.07
crypto.aes 868.22 237.53 0.16
crypto.rsa 298.96 9.14 0.02
crypto.signverify 875.96 44.80 0.05
Derby 3041.56 1058.30 0.46
Mpegaudio 317.34 54.73 0.01
scimark.fft.large 11.45 44.42 0.05
scimark.lu.large 8.31 97.04 0.06
scimark.sor.large No GC during measurement
scimark.sparse.large 18.27 57.48 0.01
scimark.fft.small 604.26 8.27 0.04
scimark.lu.small 1277.90 15.95 0.09
scimark.sor.small No GC during measurement
scimark.sparse.small 160.64 18.88 0.01
scimark.monte_carlo 9.95 0.12 0.03
Sunflow 3405.07 1097.76 0.32
xml.transform 2832.37 109.14 0.27
xml.validation 2343.78 126.39 0.38
SPECjbb2005 3655.00 0.01 2.20
SPECjAppServer2004 1100.00 0.55 7.50
28 K. Shiv et al.

Turning our attention to hardware performance metrics, we look at the CPI and Path-
length (Instructions Retired per Operation) for each component of SPECjvm2008, and
once again provide the data for SPECjbb2005 and SPECjAppServer2004 as well. The
CPI data shows a very wide range all the way from 0.35 for a couple of the scimark
workloads, sparse.small and monte_carlo, to 37 for scimark.fft.large. While the range is
large, only a few of the components have CPI values that are close to values seen for the
established benchmarks, SPECjbb2005 and SPECjAppServer2004.
Pathlength, the number of instructions executed per benchmark operation, shows a
similarly wide range from 860 million instructions for scimark.fft.small to 35 billion
instructions for scimark.lu.large. It is interesting to note that the SPECjvm2008 com-
ponent pathlengths are much larger than the pathlengths of SPECjAppServer2004 and
SPECjbb2005. The developers of the new benchmark have defined each compo-
nent’s operation to be a bundle of the underlying component transactions thus leading
to significantly higher pathlengths.
We next look deeper at the CPI data. Since there is a wide range of CPI, and high
values of CPI for workloads are frequently due a strong memory dependency, we
compared the memory requirements for each component, and we present that data in
Table 8. Interestingly, while there is indeed a correlation that can be seen and several

Table 7. CPI and Pathlength

Workload CPI Pathlength


compiler.compiler 2.24 1,342,140,681.48
compiler.sunflow 1.78 1,411,940,351.17
compress 0.66 6,935,664,016.67
crypto.aes 0.43 30,706,203,731.99
crypto.rsa 0.42 3,365,033,523.97
crypto.signverify 0.46 5,213,088,932.49
derby 3.36 4,778,948,803.51
mpegaudio 0.65 12,402,705,156.31
scimark.fft.large 37.36 4,857,444,571.78
scimark.lu.large 15.49 35,367,955,150.80
scimark.sor.large 8.88 12,196,916,787.61
scimark.sparse.large 6.69 22,270,815,826.80
scimark.fft.small 0.74 862,334,186.50
scimark.lu.small 0.51 1,131,124,031.40
scimark.sor.small 1.31 3,008,152,045.14
scimark.sparse.small 0.35 15,584,203,597.27
scimark.monte_carlo 0.35 1,655,077,938.70
sunflow 0.85 16,871,976,705.39
xml.transform 1.09 1,498,012,657.15
xml.validation 1.26 1,991,052,637.11
SPECjbb2005 1.21 67,322.00
SPECjAppServer2004 2.22 5,263,947.00
SPECjvm2008 Performance Characterization 29

Table 8. Memory Bandwidth Requirements

Workload MB/op MB/sec


compiler.compiler 155.10 2422
compiler.sunflow 143.27 2591
compress 13.72 141
crypto.aes 237.53 868
crypto.rsa 9.14 299
crypto.signverify 44.80 876
derby 1058.30 3042
mpegaudio 54.73 317
scimark.fft.large 44.42 11
scimark.lu.large 97.04 8
scimark.sor.large No GC during measurement
scimark.sparse.large 57.48 18
scimark.fft.small 8.27 604
scimark.lu.small 15.95 1278
scimark.sor.small No GC during measurement
scimark.sparse.small 18.88 161
scimark.monte_carlo 0.12 10
sunflow 1097.76 3405
xml.transform 109.14 2832
xml.validation 126.39 2344
SPECjbb2005 0.02 5946
SPECjAppServer2004 2.00 4091

of the workloads without lower memory bandwidth requirements have lower CPIs,
the workloads with the highest CPIs display very low bandwidth demand. Specifi-
cally, scimark.fft.large has a CPI of 37 and memory bandwidth requirement of 9
MB/s. SPECjbb2005, as a point of comparison, has a CPI of 1.22 and a memory
bandwidth requirement of 6 GB/s. This is true to a somewhat lesser extent for
scimark.sor.large, scimark.lu.large, and scimark.sparse.large. These workloads do not
have a high CPI because of excessive memory bandwidth demands. However, this
does not necessarily rule out memory latency as a cause of their high CPI.
Intel processors provide a range of performance event counters. In Table 9 we ex-
amine some key metrics, last-level cache (LLC) MPI (Misses per Instruction), ITLB
and DTLB misses, number of floating-point instructions, and the HITM metric to
understand the sharing behavior of each component.
The LLC MPI data gives us a clear pointer to the cause of the very high CPIs suf-
fered by four of the scimark components. Scimark.fft.large, the workload with the
highest CPI, has an MPI of 0.05, or 1 cache miss every 20 instructions, a rate that is
approximately 20 times the rate of cache misses in SPECjbb2005 and
SPECjAppServer2004. The memory latency seen by these cache misses causes the
30 K. Shiv et al.

high CPI. The high CPI restricts the performance of the workload strongly, and the
resulting low throughput creates an appearance of low memory bandwidth require-
ment. The performance of these four workloads is therefore strongly dependent on
memory latency. Of the remaining components, several have negligible cache misses,
while the few (compiler.*, xml.*, sunflow, derby) with moderate CPI have MPIs of
the same order of magnitude seen in SPECjbb2005 and SPECjAppServer2004. It is
not surprising that Derby, with its high allocation rate of immutable BigDecimal ob-
jects, has a significant MPI of 0.0057.
One criticism that can perhaps be leveled at SPECjbb2005 and
SPECjAppServer2004 is the low usage of floating-point. Some of the components of
SPECjvm2008 on the other hand can be seen to have significant levels of floating-
point usage. Derby, especially, has a floating-point instruction usage rate of 0.01, or 1
out of every 100 instructions.

Table 9. Some Processor Metrics

MPI Floating ITLB DTLB HITM /


Point Miss Miss L2 Data
Operations retired Request
Retired Miss
compiler.compiler 0.0049 16,725 1,136,892 2,341,261 0.126
compiler.sunflow 0.0035 14,693 889,787 1,700,167 0.138
compress 0.0003 21,575 472 59,699 0.015
crypto.aes 0.0002 68,220 2,927 36,211 0.146
crypto.rsa 0.0001 207,767 38,639 1,039,284 0.206
crypto.signverify 0.0002 105,163 7,138 415,907 0.080
derby 0.0057 52,579,290 2,085,620 4,583,880 0.245
mpegaudio 0.0000 41,137 52,516 926,838 0.662
scimark.fft.large 0.0530 855,695 6,684 270,140 0.336
scimark.lu.large 0.0216 17,839,925 15,442 871,064 0.333
scimark.sor.large 0.0111 506,323 4,967 146,314 0.594
scimark.sparse.large 0.0208 695,015 5,156 1,378,819 0.007
scimark.fft.small 0.0002 39,424 168 1,139 0.196
scimark.lu.small 0.0003 3,948 223 17,911 0.124
scimark.sor.small 0.0000 19,519 299 5,283 0.212
scimark.sparse.small 0.0000 75,351 442 7,483 0.616
scimark.monte_carlo 0.0000 2,740 82 962 0.573
sunflow 0.0011 1,210,310 18,397 4,326,069 0.169
xml.transform 0.0015 125,201 349,122 357,000 0.253
xml.validation 0.0016 159,639 1,967,790 656,078 0.303
SPECjbb2005 0.0028 110 7 90 0.000
SPECjAppServer2004 0.0029 7,904 791 27,657 0.050
SPECjvm2008 Performance Characterization 31

Most of the components have small code footprints. Once again, Derby stands out
as the exception, suffering an ITLB miss every 2500 instructions. None of the work-
loads face much DTLB pressure. Some of the DTLB-miss metrics look high until we
recall the high pathlengths of these workloads.
Both SPECjbb2005 and SPECjAppServer2004 have negligible HITM rates indicat-
ing low sharing of data between the LLCs on the sockets. While these benchmarks
inherently have low sharing, the HITM metric is also lowered by the benchmarks
being run with multiple JVMs, 1 JVM per LLC. SPECjvm2008 run-rules preclude
the use of multiple JVMs which allows us to see the level of sharing amongst the
threads. The more significant cases are the components which have both higher MPI
and high HITM rates. Derby, the xml workloads, and some of the scimark compo-
nents, all exhibit high levels of cache-to-cache memory traffic.
As the number of cores in a chip continues to increase, the number of cores in even
clients and small servers is increasing rapidly. Therefore the scaling of these work-
loads with the number of processors is of some interest.
Table 10 presents the scaling data, by showing the relative performance of each
component to its performance on 1 processor. Since our system has 16 processors,
we have tested the scaling from 2 to 16 processors.
The start-up time of the workloads is unaffected by the number of processor cores
available, since much of the JVM initialization code is single threaded. Several other
workloads, such as compress, crypto.* and mpegaudio exhibit excellent scaling, while
a few show super-linear behavior. Since there is sufficient run-to-run variation, this

Table 10. Thread Scaling

Workload 1T 2T 4T 6T 8T 10T 12T 14T 16T


Startup 1 0.98 1.01 1.02 1.01 0.99 1.03 0.99 0.96
compiler.compiler 1 2 3.83 5.75 7.26 8.16 8.58 8.86 7.83
compiler.sunflow 1 2 4 6.02 7.21 8.28 8.94 9.17 9.57
compress 1 2 4.14 6.22 8.37 10.37 12.37 14.43 15.39
crypto.aes 1 1.99 3.98 5.93 7.92 9.67 11.94 12.83 15.64
crypto.rsa 1 1.8 4 5.69 7.8 9.4 11.69 13.3 15.3
crypto.signverify 1 2 4.07 6.07 8.08 10.01 12.21 14.02 16.03
derby 1 1.66 3.15 4.69 5.61 5.84 5.8 5.74 5.47
mpegaudio 1 2.04 4.11 6.13 8.25 10.07 12.32 14.34 16.4
scimark.fft.large 1 1.71 2.54 2.97 3.25 3.4 3.43 3.8 3.79
scimark.lu.large 1 1.89 2.62 2.63 2.57 2.52 2.58 2.44 2.42
scimark.sor.large 1 1.97 3 2.87 2.9 2.85 2.83 2.8 2.79
scimark.sparse.large 1 1.83 2.62 2.8 3.47 3.31 3.33 3.35 3.33
scimark.fft.small 1 2 4 5.8 7.8 9.8 11.2 13.2 15.2
scimark.lu.small 1 1.74 3.75 5.25 6 8 9.26 11.25 11.85
scimark.sor.small 1 2 4 5.79 8.05 10.09 12.09 14.17 16.13
scimark.sparse.small 1 2 4.03 5.76 7.93 9.78 11.85 13.96 15.88
scimark.monte_carlo 1 2.01 4.35 6.03 8.37 10.72 12.73 14.07 18.42
serial 1 2.11 4.08 6.25 8.33 10.31 12.22 14.61 24.36
sunflow 1 0.99 3.95 8.49 13.31 15.95 14.67 15.47 18.3
xml.transform 1 2 3.87 5.72 7.37 8.26 8.63 8.89 9.44
xml.validation 1 1.92 3.43 5.16 7.12 8.35 9.88 10.98 11.55
32 K. Shiv et al.

data should be treated with caution. While these data points may well be noisy, it
must also be noted that this kind of scaling is not theoretically impossible; the avail-
ability of more cores allows the JVM to use more threads for compilation and optimi-
zation, and this can allow the generation of better code. This, we must emphasize, is
just a theory. This benchmark is still new, and it will take some time and additional
experiments to filter out the noisy data.
Intel recently announced that it would release a new Xeon micro-architecture, the
i7. We performed a few experiments on a pre-release platform and present those re-
sults next. Table 11 presents the ratio between Peak and Base for the I7 is similar to
that seen with the Core2-Duo in most respects. One notable exception is
scimark.fft.small which now suffers 14% degradation whereas our earlier results
showed a 7% gain. This workload is sensitive to data layout. Data layout change
causes different effects since 2 processors having very different cache architectures.
The Core2 has a two-level cache system while the i7 has a three-level cache. The
second level cache on the i7 is much smaller (only 256K) relying on the large (8M)
third level cache to reduce accesses to memory. However, as a result, the cost of
accessing a line from the last-level third level cache in i7 is more than the cost of
accessing a line from the second level cache. For most workloads, the bigger third

Table 11. i7 Processor Baseline Scores

Peak/Base
compiler.compiler 1.448
compiler.sunflow 1.096
compress 1.036
crypto.aes 0.997
crypto.rsa 1.000
crypto.signverify 1.003
derby 1.103
mpegaudio 1.013
scimark.fft.large 0.628
scimark.lu.large 1.029
scimark.sor.large 1.033
scimark.sparse.large 1.257
scimark.fft.small 0.859
scimark.lu.small 1.005
scimark.sor.small 1.001
scimark.sparse.small 0.997
scimark.monte_carlo 1.661
serial 1.112
sunflow 1.087
xml.transform 1.015
xml.validation 1.155
Composite Score 1.073
SPECjvm2008 Performance Characterization 33

level cache provides a performance boost. Here, however, scimark.fft.small has a


cache footprint that is too large to fit in the 256K second-level cache on the i7 plat-
form but fits easily into the last level cache on both systems. As a result of these dif-
ferent latencies to different levels of the cache hierarchy, the use of large pages now
penalizes scimark.fft.small as well.
The i7 processor core includes SMT (Simultaneous Multi Threading) which allows
two software threads to execute simultaneously on 1 core. In Table 12 we look at the
effect of SMT on SPECjvm2008 performance, and note that the benchmark as a
whole enjoys a 22% boost. While almost all the components benefit to some extent,
scimark.sor.small almost doubles. SMT benefit can often be limited by cache avail-
ability; the data footprint of scimark.sor.small is such that this does not happen. On
the other hand, scimark.lu.small suffers due to contention for the 256K second-level
cache. The contention for 256K L2 cache is critical because scimark.lu.small work-
load performance is limited by L2 LLC throughput.
Finally, in Table 13 we look at the extent to which SPECjvm2008 benefits because
of faster processor core frequencies. We ran several components of the benchmark at
two frequencies, 2.8 GHz and 2.93 GHz. The 4.6% increase in frequency leads to
performance gains of 4-5% in all these cases. Even Derby which is more

Table 12. Effect of SMT

SMT Gain
compiler.compiler 1.161
compiler.sunflow 1.173
Compress 1.254
crypto.aes 1.387
crypto.rsa 1.189
crypto.signverify 1.059
Derby 1.398
Mpegaudio 1.205
Scimark.fft.large 1.061
Scimark.lu.large 1.043
Scimark.sor.large 1.508
Scimark.sparse.large 1.085
Scimark.fft.small 1.018
Scimark.lu.small 0.890
Scimark.sor.small 1.925
Scimark.sparse.small 1.039
Scimark.monte_carlo 1.011
Serial 1.184
Sunflow 1.254
xml.transform 1.199
xml.validation 1.219
Composite Score 1.216
34 K. Shiv et al.

Table 13. Frequency Scaling of i7 Platforms

Freq Gain
compiler.compiler 1.044
compiler.sunflow 1.041
compress 1.045
crypto.aes 1.055
crypto.rsa 1.054
crypto.signverify 1.048
derby 1.045
mpegaudio 1.046
scimark.fft.large 1.214
scimark.lu.large 1.043
scimark.sor.large 1.040
scimark.sparse.large 1.055
scimark.fft.small 1.000
scimark.lu.small 1.009
scimark.sor.small 1.048
scimark.sparse.small 1.048
scimark.monte_carlo 0.880
serial 1.047
sunflow 1.052
xml.transform 1.050
xml.validation 1.042
Composite Score 1.042

memory-dependent than the other components benefits fully from the frequency increase.
The i7 platform uses QPI (Quick-Path Interconnect) and has lower memory access laten-
cies. Actual memory latency is also reduced by improved hardware prefetchers.

4 Analysis and Conclusions


In the early days of studying Java performance, it was very popular to use SPECjvm98.
The primary reasons were that it was the first SPEC java benchmark, that it was simple
to use, and that by providing several components it allowed the performance analyst to
study several different aspects of the effect of the workload on the platform. This new
benchmark continues to provide the latter 2 benefits. However, SPECjvm2008 contains
a wide range of Java tests with significantly different system characteristics. It is a great
challenge for software optimizations and the system under test.
The fact that the reported metric is the geometric mean of 11 components, several of
which have sub-components, implies that platform improvements that affect only 1
component are unlikely to change the benchmark reported metric. For example, dou-
bling the performance of Derby, while leaving other components unchanged, will
SPECjvm2008 Performance Characterization 35

change the reported SPECjvm08 performance by just 6%. We expect therefore that keen
interest will be focused on individual component scores as much as the reported score.
The workloads in SPECjvm2008 present many opportunities for the JVM to im-
prove code generation, threading, memory management, and lock algorithm tuning.
Many such changes could impact all components though to different degrees. For
example, improvements in object allocation will benefit all, but some components like
Derby will benefit more. Other changes will benefit in a more localized manner. Of
particular interest is floating-point behavior. Previous benchmarks had not stressed
floating-point, and there was no generally-accepted way of studying the platform and
JVM improvements in this regard. Workloads like Derby should mitigate that. Similar
comments can be made about string and XML behavior due to the XML components.
Many workloads in SPECjvm2008 can also be used to evaluate current and future
hardware features especially on memory subsystem and lock optimization. Our con-
clusion based on our first analysis of this new benchmark is that it appears to be a
valuable addition to our toolkit. While it cannot replace SPECjbb2005 or
SPECjAppServer2004, and it may never be as important and as representative as
those two, it provides behavior that is different enough to make it attractive to the
performance analyst.

References
1. Dieckmann, S., Holzle, U.: The allocation behavior of the SPECjvm98 Java benchmarks.
In: Performance evaluation and benchmarking with realistic applications, pp. 77–108. MIT
Press, Cambridge (2001)
2. Radhakrishnan, R.: Microarchitectural Techniques to Enable Efficient Java Execution, Ph.
D. Dissertation, University of Texas at Austin (2000)
3. Li, T., John, L.K.: Characterizing Operating System Activity in SPECjvm98 Benchmarks.
In: John, L.K., Maynard, A.M.G. (eds.) Characterization of Contemporary Workloads, pp.
53–82. Kluwer Academic Publishers, Dordrecht (2001)
4. Excelsior JET Benchmarks,
http://web.archive.org/web/20071217043141,
http://www.excelsior-usa.com/jetbenchspecjvm.html
5. Yoo, R.M., Lee, H.-H.S., Lee, H., Chow, K.: Hierarchical Means: Single Number Bench-
marking with Workload Cluster Analysis. In: IEEE International Symposium on Workload
Characterization (IISWC 2007), Boston, MA, USA, September 27-29 (2007)
6. SPECjvm98 Benchmarks, http://www.spec.org/jvm98/
7. SPECjvm2008 Benchmarks, http://www.spec.org/jvm2008
8. Apache derby, http://db.apache.org/derby/
9. JLayer, http://www.javazoom.net/javalayer/javalayer.html
10. Scimark 2.0 Benchmark, http://math.nist.gov/scimark2/
11. Sunflow, http://sunflow.sourceforge.net/
12. IBM Telco Benchmark,
http://www2.hursley.ibm.com/decimal/telco.html
13. SPECjApp Server 2004 Benchmark, http://www.spec.org/jAppServer2004
14. SPECjbb 2005 Bechmark, http://www.spec.org/jbb2005
Performance Characterization of
ItaniumR
2-Based Montecito
Processor

Darshan Desai2 , Gerolf F. Hoflehner2 , Arun Kejariwal1 , Daniel M. Lavery2,


Alexandru Nicolau1 , Alexander V. Veidenbaum1 , and Cameron McNairy2
1
Center for Embedded Computer Systems
University of California, Irvine
Irvine, CA 92697, USA
2
Intel Compiler Lab
Intel Corporation
Santa Clara, CA 95050, USA

Abstract. This paper presents the performance characteristics of the


IntelR
Itanium
R
2-based Montecito processor and compares its perfor-
mance to the previous generation Madison processor. Measurements on
both are done using the industry-standard SPEC CPU2006 benchmarks.
The benchmarks were compiled using the Intel Fortran/C++ optimizing
compiler and run using the reference data sets. We analyze a large set of
processor parameters such as cache misses, TLB misses, branch predic-
tion, bus transactions, resource and data stalls and instruction frequen-
cies. Montecito achieves 1.14× and 1.16× higher (geometric mean) IPC
on integer and floating-point applications. We believe that the results
and analysis presented in this paper can potentially guide future IA-64
compiler and architectural research.

1 Introduction
The Itanium 2 family of processors, including the Itanium 2-based dual core
(also known as Montecito), provide a fast, wide, and in-order execution core
coupled to a fast, wide, out of order memory sub-system and system interface
[1]. The processor has two dual-threaded cores integrated on die with more than
26.5MB of cache in a 90nm process with 7 layers of copper interconnect. Other
improvements over its predecessor include the integration of 2 cores on-die, each
with a dedicated 12MB L3 cache, a 1MB L2I cache and dual-threading [2]. In
this paper we analyze the key features of Montecito’s microarchitecture which
yield better performance than its predecessor (Madison) on both integer and
floating-point applications.
The main contributions of the paper are as follows:

❚ First, we present a description of Montecito’s microarchitecture, discuss


and analyze the key enhancements which result in better performance on
Montecito than its predecessor. For this, we present detailed profiling data

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 36–56, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Performance Characterization of Itanium
R
2-Based Montecito Processor 37

corresponding to a large set of performance metrics such as cache misses,


branch prediction, resource and data stalls.
❚ Second, we present a detailed characterization of the released SPEC CPU2006
suite [3]. To the best of our knowledge, this is the first work to present the
behavior of CPU2006 on two generations of the IA-64 architecture.
❚ Third, we present the relative impact of various performance bottlenecks.
The architects and compiler designers can use these results to accurately
identify and target the key areas for future performance improvements.

1.1 Data Collection

We obtained the per- Table 1. Experimental Setup


formance data on a 1.6 Processor Intel Itanium 2-based (Montecito) Processor, 1.6 GHz
R R

GHz Montecito proces- Memory 4 GB


L1 D-Cache 16 KB (4-way, line size: 64 bytes)
sor using the Caliper L1 I-Cache 16 KB (4-way, line size: 64 bytes)
[4] performance mon- L2 D-Cache 256 KB (8-way, line size: 128 bytes)
itoring tool. The de- L2 I-Cache 1 MB (8-way, line size: 128 bytes)
L3 Cache 12 MB (12-way, line size: 128 bytes)
tailed configuration is Compiler Intel Fortran/C++ compiler (version 9.1)
R

given in Table 1. The OS Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
kernel 2.6.9-36.EL #1 SMP
benchmarks were com-
piled using the Intel
Fortran/C++ optimizing compiler (version 9.1). The compiler supports a
wide variety of optimizations such as software pipelining, predication, software
prefetching and whole program optimizations.
The events monitored for each metric such as IPC (instructions per cycle) are
listed at the start of the corresponding sections. The event monitoring process is
non-intrusive as it is in-built in the hardware and does not require any special setup.
The data collected provides valuable insights of the system behavior, especially, the
role played by buses, I/O and disc, which are typically not modeled in simulators.
The rest of the paper is organized as follows: Section 2 presents an overview
of the Montecito microarchitecture. Sections 3–10 provide an in-depth perfor-
mance characterization results for Montecito and compares it with the previous
generation Madison processor. Finally, we conclude in Section 11.

2 Processor Description

In the following subsections we briefly introduce the core and then the memory
sub-system of the Intel’s Montecito processor. A high level block diagram of
Montecito is shown in Figure 1.

2.1 Execution Core

The Itanium 2 Execution Core consists of a front-end, which is responsible for


delivering instructions ready to execute and a back-end, which completes execu-
tion and forwards requests to the memory sub-system.
38 D. Desai et al.

The front-end, with two levels of branch prediction, two levels of translation
look-aside buffers (TLBs) and a zero-cycle branch predictor, feeds two bundles
(with 3 instructions each) into the 8 bundle deep instruction buffer every cy-
cle. Instruction fetch and branch prediction require only two pipe-stages (the
Montecito pipeline is shown in Figure 2) — the IPG and ROT stages.
The instruction buffer allows the front-
end to continue to deliver instructions to
the back-end even when the back end is
stalled and can be completely bypassed
adding no pipe stages to execution. The
instruction buffer delivers two bundles
of any alignment to the remaining six
pipeline stages. The dispersal logic de-
termines issue groups from the two old-
est bundles in the instruction buffer and
allocates up to six instructions to the
11 available functional units (two inte-
ger, four memory, two floating point, and
three branch) in the EXP stage. These
instructions form an issue group and travel
down the back-end pipeline and experi-
ence stall conditions in unison.
The register renaming logic maps vir-
tual registers in the instruction to physi-
cal registers in the REN stage to support
software pipelining and stacked registers Fig. 1. Block diagram of a single core of
which are managed by the Register Save Montecito
Engine (RSE) (which provides seemingly IPG ROT EXP REN REG EXE FP1 FP2 FP3
DET WRB
FP4

unlimited virtual registers). IPG: Instruction pointer generation and fetch

Further, 32 of the integer registers are ROT: Instruction rotation


EXP: Instruction template decode, expand and disperse
direct mapped and do not require renam- REN: Rename (for register stack and rotating registers and decode)
ing, while 96 registers are stacked. The REG: Register file read
EXE: ALU execution
physical register identifiers access the ac- DET: Exception detection
tual 128 integer or 128 floating point regis- WRB: Write back
FPx: Floating−point pipe stage
ter file in the REG stage. The register files
are highly ported to support 6 instructions Fig. 2. Montecito pipeline
accesses per cycle (12 integer since most
integer instructions require 2 sources and eight floating-point read ports since
floating point operations may require 4 sources per instruction).
The instructions in the issue group perform their operation in the EXE pipe
stage which acts as the primary coupling point between the L1D cache and the
execution core. Scoreboard logic which tracks long latency operations may stall
the instructions in the issue group at the EXE stage to prevent an instruction
from accessing the older instruction destination until the register is written.
The full bypass network allows nearly immediate access to previous instruction
Performance Characterization of Itanium
R
2-Based Montecito Processor 39

results. Some instructions may fault or trap, while branch instructions may be
mis-predicted.

2.2 Memory Subsystem


Montecito supports three levels of on-chip cache. Each core contains a complete
cache hierarchy, with nearly 13.3 MB per core and a total of nearly 27 MB of
processor cache. The first-level data cache (L1D) is a multi-ported, 16KB, four-
way set associative, physically-addressed cache with a 64-byte line size. The L1D
is non-blocking and in-order. Lower virtual address bits 11:0, which represent
the minimum virtual page, are never translated and are used for cache indexing.
The access latency of the L1D is one cycle unless the use is for an address of
another load operation (i.e., pointer chasing) in which case it is two cycles. The
L1D enforces a write-through, with no write-allocate policy. All stores go to
the second-level cache whether they hit or miss in the L1D. If a store hits in the
L1D, the data is kept in a store buffer until the L1D array becomes available for
update (see Figure 6). These store buffers are capable of merging store data and
forwarding it to later loads with restrictions. The L1D allocates on load misses
according to temporal hints, load type, and available resources.
The major enhancement to the Montecito cache hierarchy starts at the L2 caches
where the L2 cache is split into dedicated instruction and data caches. This sep-
aration makes it possible to have dedicated access paths to the caches, thereby
eliminating contention and capacity pressures at the L2 caches. The L2I holds 1
MB, is 8-way set associative and has a 128-byte line size but has the same seven-
cycle instruction-access latency as the smaller Itanium 2 unified cache. The tag and
data arrays of L2I are single ported, but the control logic supports out-of-order and
pipelined accesses, which enable a high utilization rate. L2D has the same structure
and organization as the unified 256-Kbyte L2 cache of Itanium 2 but with several
microarchitectural improvements to increase throughput and reduce latency and
core stalls. In Itanium 2, any accesses to the same cache line beyond the first access
that misses L2 will access the L2 tags periodically (recirculate) until the tags detect
a hit. The repeated tag accesses consume bandwidth from the core and increase the
L2 miss latency. The L2D suspends such secondary misses until the L2D fill occurs.
At that point, the fill immediately satisfies the suspended request. This approach
greatly reduces bandwidth contention and final latency. The L2D also manages the
32-entry L2 OzQ more efficiently, through pseudo-compression, to increase concur-
rency and reduce core stalls.
The L3 is a multi-way (actual number of ways depends on the model and
configuration of the particular processor) on-chip cache. It has a 128 byte line
size matching the L2 and only supports entire line accesses. The L3 tags and
data arrays are single ported, but pipelined allowing several accesses to each to
be in flight at the same time. The L3 will allocate for read misses according to
temporal hints, but will not allocate for L2 dirty victim misses (L3 write request).
The hardware page walker (HPW) is the third level of address translation and
performs page look-ups from the virtual hash page table (VHPT). On a L2
DTLB/ITLB miss, the HPW will access the L2 cache and (if necessary) L3
40 D. Desai et al.

Fig. 3. IPC

cache and the memory to obtain the page entry. If the HPW does not find the
page, it will generate a page fault.

3 IPC

The IPC (instructions per cycle) value signifies the amount of instruction level
parallelism (ILP) that can be achieved using a given compiler and processor. The
IPC was computed by taking the ratio of the number of events corresponding to
the following hardware performance counters:
IA64 INST RETIRED: This event counts the number of retired Itanium instruc-
tions. This also includes the NOPS instructions and instructions which were
squashed due to predicate off. We subtract the latter (which are measured using
the counters NOPS RETIRED and PREDICATE SQUASHED RETIRED) to compute the
effective IPC.
CPU OP CYCLES: This event counts the number CPU operating cycles.
From Figure 3 we observe that Montecito achieves higher IPC than its prede-
cessor Madison, across the entire CPU2006 suite. To compare the performance
of Montecito and Madison, we first compute the ratio of IPC on Montecito and
Madison for each benchmark and then compute the geometric mean of the ra-
tios. Our analyis shows that Montecito achieves 1.14× and 1.16× higher IPC on
CINT2006 and CFP2006 respectively. The higher IPC value can be attributed
to a number of reasons: larger caches and other cache-related microarchitectural
enhancements, discussed further in Section 4 and better TLB performance, dis-
cussed further in Section 6. The low IPC value of applications such as 429.mcf,
471.omnetpp, 450.soplex and 459.GemsFDTD can, in part, be ascribed to the
large amount of L3 cache misses (see Figure 4). Also, note that in applications
such as 456.hmmer, 444.namd, an IPC of more than 4 is achieved.

4 Cache Performance

In this section, we present a detailed analysis of the performance of the data


cache. Recall that each core on Montecito has a unified L3 cache. Therefore, while
evaluating the L3 performance w.r.t. the data stream, it very critical to measure
Performance Characterization of Itanium
R
2-Based Montecito Processor 41

Fig. 4. Number of L1D, L2D and L3 data misses per 1000 retired instructions

Fig. 5. L2D Buffers

the L3 misses which occur due to data reads and writes only. We measured
the performance of the data cache using the following hardware performance
counters:
❶ L1D READ MISSES.ALL: This event counts the number of L1D read misses.
❷ L2D MISSES: This event counts the number of L2D misses (in terms of the
L2D cache line requests sent to L3).
❸ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses.
❹ L3 WRITES.DATA WRITE.MISS: This event counts the number of L3 store
misses (excludes L2D write backs, includes L3 read for ownership requests
that satisfy stores).

The total number of L3 data misses is computed as the sum of ❸ and ❹. This
does not include the L3 instruction misses. From Figure 41 we observe that, on
an average, Montecito incurs fewer data cache misses as compared to Madison
at any level of the cache hierarchy. This can be attributed to the larger caches
on Montecito. For example, A reduction of up to 1.38× (429.mcf) and 1.76×
(470.lbm) in L1D and L2D cache misses is achieved respectively. In general,
we note that the reduction in data cache misses is higher in CINT2006 than in
CFP2006. Even then, the L1D miss rate is higher in CINT2006 than CFP2006.
1
L3D in the figure refers to (unified) L3 data read and write misses. It should not be
interpreted as misses corresponding to a separate L3 data cache.
42 D. Desai et al.

For applications such as 433.milc and 459.GemsFDTD, we observe that the


number of L2D cache misses is higher than L1D cache misses. This stems from
the fact that in Itanium floating-point loads do not access the L1D. The benefit
of this is that it enables the issue of floating-point loads on any of the 4 memory
ports with minimal restrictions. It also explains the higher L2D and L3 data
cache miss rate in CFP2006 as compared to CINT2006. For example, CFP2006
incurs over 2-fold L3 data cache misses compared to CINT2006. This in turn
increases the memory bus pressure thereby affecting performance adversely. The
higher number of floating-point loads also explains the higher L2D cache miss
rate than L1D cache miss rate in the integer application 429.mcf. In the rest of
this subsection, we discuss the performance of the individual L2D buffers.

4.1 OzQ Buffer


The L2 OzQ and its control logic provide the non-blocking and re-ordering ca-
pabilities of the L2 (see Figure 6). This structure holds up to 32 operations that
cannot be satisfied by the L1D. L1D requests which are conflict-free at the L2
require fewer than 32 L2 OzQ entries to full streaming. The additional entries
allow the L1D pipeline to continue servicing core memory requests that hit and
issuing additional requests of the L2 while the L2 resolves the conflicts.
The OzQ control logic maintains 3 round robin pointers to track head, tail,
and issue. New requests are allocated at the tail pointer which indicates where
the youngest OzQ operation exists. The head pointer is the oldest operation in
the OzQ. The issue pointer is always between head and tail and indicates where

Fig. 6. Memory subsystem and system interface


Performance Characterization of Itanium
R
2-Based Montecito Processor 43

the issue logic should look for new operations to send down the L2 pipeline.
A consequence of this head and tail organization is that holes may appear in
the OzQ from operations that have issued (OzQ entries between head and tail
that are no longer valid). The OzQ is not compressed when these holes develop.
Without compression, these holes are not available to new L1D requests. Thus,
there may be instances where the OzQ control logic indicates that there is no
more room for new L1D requests, despite the fact that only a few OzQ entries
are valid. Every cycle the L2 OzQ searches 16 requests, starting at head, for
requests to issue to the L2 data array (L2 hits), the system bus/L3 (L2 misses),
or back to the L1D for another L2 tag lookup (recirculate).
The L2 OzQ control logic allocates up to four contiguous entries per cycle
starting from the last entry allocated in the previous cycle (the tail). If there
are too few entries available (between 4 and 12), the L1D pipeline is stalled to
prohibit any additional operations being passed to the L2. Requests are removed
from the L2 OzQ when they complete at the L2 - that is when a store updates
the data array, or when a load returns correct data to the core, or when an L2
miss request is accepted by the system bus/L3.
Whenever the OzQ is full, there is an increased L2D back pressure which
results in back-end stalls. Figure 5 reports the time (as percentage of the total
execution time) for which OzQ was full. We measured the number of times the
L2D OzQ was full using the L2D OZQ FULL hardware performance counter. From
the figure we see that the OzQ is rarely (> 2% on an average) full in the case of
CINT2006. On the contrary, in the case of applications such as 410.bwaves and
433.milc of CFP2006, the OzQ is full for more than 50% of the total execu-
tion time. This results in a high percentage of data stalls, see Figure 12, which
adversely affects the overall performance. Support for elimination/minimization
of the number of holes in the OzQ can potentially reduce the number of data
stalls. Alternatively, a larger OzQ may yield better performance.

4.2 Fill Buffer

An entry in the L2D Fill Buffer is allocated when L2 speculatively issues an


access to the L3/system. The 16 entries in the fill buffer correspond to the maxi-
mal 16 simultaneous outstanding L3 memory requests. When the buffer gets full
L3 requests have to be served before new requests can be submitted. We mea-
sured the number of times the L2D Fill buffer was full using the L2D FILLB FULL
hardware performance counter. From Figure 5 we see that the fill buffer is rarely
full in the case of CINT2006. In contrast, akin to OzQ, the fill buffer is full
for ≈ 10% of the total execution time for applications such as 410.bwaves and
437.leslie3d. However, on an average, the fill buffer is full for only 3% of the
total execution time for CFP2006.

4.3 OzD Buffer

Stores that miss in the L2 record data in the 24 entry L2 Oz Data buffer and their
address in the OzQ. The data needs to be merged with the 128 bytes delivered
44 D. Desai et al.

Fig. 7. Number of L1I, L2I and L3 data misses per 1000 retired instructions

from the L3/system interface.2 When the buffer is full for a missing store request,
the processor stalls until entries can be freed. We measured the number of times
the L2D Oz data buffer was full using the L2D OZD FULL hardware performance
counter. From Figure 5 we note that the OzD buffer is rarely (> 1% on an
average) full in both CINT2006 and CFP2006. This suggests that the OzD buffer
is not a performance bottleneck for CPU2006.

4.4 Victim Buffer

The victim buffer holds L2 dirty victim data until it can be issued to the
L3/system interface. Operations are issued, up to four at a time, to access the L2
data array when the conflicts are resolved and resources are available. The buffer
can hold up to 16 entries. If the buffer is full for a request that misses the L2, the
request will recirculate. This in turn increases the L2D back pressure and can
cause back-end stalls. We measured the number of times the L2D victim buffer
was full using the L2D VICTIMB FULL hardware performance counter. From Fig-
ure 5 we note that the victim buffer is rarely (> 1%) full in both CINT2006 and
CFP2006. From this we conclude that victim buffer getting full does not impact
the overall performance in a significant manner.

4.5 Instruction Cache

Recall that Montecito has a unified L3 cache. Therefore, while evaluating the
L3 performance w.r.t. the instruction stream, it very critical to measure the L3
misses which occur due to instruction reads only. We measured the performance
of the instruction cache using the following hardware performance counters:
➀ L2I DEMAND READS: This event counts the number of L1I and ISB (instruc-
tion stream buffer) misses regardless of whether they hit or miss in the RAB
(Request Address Buffer).
2
Assume there is L2 miss and L3 hit. The L3 cache line size is 128 byte. The memory
system reads 128 byte out of the L3, merges the data from the L2 Oz data buffer
and writes it back to the L3.
Performance Characterization of Itanium
R
2-Based Montecito Processor 45

Fig. 8. Data Misspeculation

➁ L2I PREFETCHES: This event counts the number of prefetch requests issued
to the L2I.
➂ L2I READS.MISS.ALL: This event counts the fetches which miss the L2I-cache.
➃ L3 READS.ALL.MISS: This event counts the L3 read misses.
➄ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses.
The L1I misses are computed as the sum of ➀ and ➁, whereas the L3 instruc-
tion misses are computed as difference of ➃ and ➄. From Figure 73 we see that
the integer program incur higher number of L1I misses than the floating-point
programs. This is due to the fact that integer codes are very control-flow inten-
sive and thus very irregular in nature, which results in higher instruction cache
misses. Except 483.xalancbmk, the number of L2I misses are negligible in both
CINT2006 and CFP2006. This is primarily due to the presence of large L2I cache
(1MB, see Table 1 for the detailed configuration).

5 Data Speculation
Itanium supports data speculation for Source: Assembly:
scheduling a load in advance of one int &g;
int &h;
ld4.a
add
rx=[ra] ;;
ry=rx,1 // t = *h+1
or more stores. The advanced load foo() {
int t;
...
st4 [rb] = 1 // *g = 1
records information including mem- *g = 1;
t = *h + 1;
chk.a rx, rec_code
resume: ...
ory address, size and target register }
...
rec_code:
number into a hardware structure, ld4 rx=[ra] ;;
add ry=rx,1
the Advanced Load Address Table br resume ;;

(ALAT). The ALAT is implemented


as fully associative data cache with 32 entries, tagged by the physical register
number. An ALAT entry is invalidated when a subsequent store address col-
lides (overlaps). This condition is checked by the chk.a instruction: in the case
of a collision, program execution resumes at the compiler generated recovery
code, which executes the non-speculative version of the load and returns to the
point after the chk.a. Let us consider the example code shown on the right. In
3
L3D in the figure refers to (unified) L3 instruction misses. It should not be interpreted
as misses corresponding to a separate L3 instruction cache.
46 D. Desai et al.

the above example, assuming the compiler does not have enough information
whether or not the addresses of g and h overlap. In this case one can use data
speculation to hoist the load above the store. We measured the data misspecu-
lation rate using the following counters:
INST FAILED CHKA LDC ALAT.ALL: This provides information on the number of
failed advanced check load (chk.a) and check load (ld.c) instructions that reach
retirement.
INST CHKA LDC ALAT.ALL:This provides information on the number of all advanced
check load (chk.a) and check load (ld.c) instructions that reach retirement.
Figure 8 shows the data misspeculation percentage for CPU2006 on Montecito.
From the figure we see that only two applications, viz., 435.gromacs and
454.calculix incur data misspeculation for more than 5%. On an average,
CINT2006 and CFP2006 incur a data misspeculation rate of 0.65% and 2.62%
respectively. Since chk.a and ld.c constitute only 0.28% and 0.46% of the to-
tal number of retired instructions in CINT2006 and CFP2006 respectively, data
misspeculation does not play a key role in determining the overall performance.

6 TLB Performance
Akin to other processor parameters, the TLB performance is also directly de-
pendent on the nature of the applications [5]. In this section we report the data
and instruction TLB performance using CPU2006.

6.1 DTLB
This subsection compares the DTLB performance of Montecito and Madison.
Both the processors have a 2-level DTLB. We measured the performance of each
DTLB level using the following hardware performance counters:
L1DTLB TRANSFER: This event counts the number of times an L1DTLB miss hits
in the L2DTLB for an access counted in L1D READS.
L2DTLB MISSES: This event counts the number of L2DTLB misses (which is the
same as references to HPW (hardware page walker); DTLB HIT=0) for demand
requests [6].

Fig. 9. DTLB Performance


Performance Characterization of Itanium
R
2-Based Montecito Processor 47

Fig. 10. ITLB Performance

From Figure 9 we observe that the DTLB performance of Montecito is better


than that of Madison for both CINT2006 and CFP2006. In general, we note
that integer application incur higher number of DTLB misses than floating-
point applications. This suggests that the former have a more irregular memory
access pattern than the latter. Contrary to intuition, we note that the num-
ber of L2DTLB misses is higher than L1DTLB misses in CFP2006. This stems
from the fact that in Montecito floating-point memory operations bypass the
L1DTLB and access the L2DTLB directly (for details see page 94 in [6]). This
corresponds to an increase in hardware page walker utilization which may affect
performance adversely. Use of larger pages (> 16 KB) for data can potentially
mitigate the above in such cases. The high number of floating-point operations
also explains the high L2DTLB miss rate in the integer applications 429.mcf
and 483.xalancbmk.
Applications such as 434.zeusmp incur a small cache miss rate; likewise the
branch prediction rate is also negligible. However, from Figure 9 we note that
434.zeusmp incurs large number of DTLB misses which serves as the primary
performance bottleneck. For such applications, exploration of the TLB design
space is required to achieve better performance.

6.2 ITLB

This subsection compares the ITLB performance of CPU2000 and CPU2006. The
Itanium 2-based Montecito has a 2-level ITLB. We measured the performance
of each ITLB level using the following hardware performance counters:
ITLB MISSES FETCH.L1ITLB: This event counts the number of misses in L1ITLB,
even if L1ITLB is not updated for an access (Uncacheable/nat page/not present
page/faulting/some flushed).
ITLB MISSES FETCH.L2ITLB: This event counts the total number of misses in
L1ITLB which also missed in L2ITLB.
Unlike DTLB, from Figure 10 we note that the ITLB performance of Mon-
tecito is the same as that of Madison for both CINT2006 and CFP2006. Akin
to the DTLB behavior, we observe that integer application incur higher num-
ber of DTLB misses than floating-point applications. This suggests that integer
48 D. Desai et al.

applications have a more irregular memory access pattern than floating-point


applications.

7 Memory Bus Transactions


In this section we analyze the memory bus pressure exerted by the CPU2006
applications on Montecito and contrast it with that of Madison. For this, we
measured, using the following hardware performance counters, the bus mem-
ory transactions and correlate them with the number of L3 data read misses.
BUS MEMORY.ALL.SELF: This event counts the number of bus memory trans-
actions (i.e., memory-read-invalidate, reserved-memory-read, memory-read, and
memory-write transactions).
L3 READS.DATA READ.MISS: This event counts the number of L3 Load Misses
(excludes reads for ownership used to satisfy stores).
From Figure 11 we observe that the memory bus pressure is, on an average,
higher in CFP2006 than in CINT2006 on Montecito. This can be ascribed to the
higher L3 data miss rates incurred by floating-point applications, which have a
larger memory footprint than integer applications. On the other hand, we note
that CINT2006 incurs a higher reduction in memory bus pressure, 1.51 vs. 1.1
for CFP2006, while migrating from Madison to Montecito. This correlates to the
higher reduction in L3 data cache misses for integer, as compared to floating-
point, applications.

Fig. 11. Memory Bus Pressure

8 Stalls
In this section we analyze the relative impact of the various resource and data
stalls.

8.1 Resource Stalls


Figure 12 shows the different components [7] of the total execution time for
applications in CINT2006 and CFP2006. The descriptions of the different com-
ponents is given below:
Performance Characterization of Itanium
R
2-Based Montecito Processor 49

Fig. 12. Resource Stalls incurred

❐ Data stalls correspond to full pipe bubbles in the main pipe caused by the
L1D or the execution unit (discussed further in the next subsection).
❐ RSE stalls correspond to full pipe bubbles in the main pipe caused by the
Register Stack Engine.We measured this using the BE RSE BUBBLE.ALL hard-
ware performance counter.
❐ Branch misprediction stalls correspond to full pipe bubbles in the main pipe
due to flushes. We measure this using the BE FLUSH BUBBLE.ALL hardware
performance counter.
❐ Front end stalls in the figure correspond to full pipe bubbles in the main
pipe due to the front end. The front end can in turn be stalled due to the
following reasons: FEFLUSH, TLBMISS, IMISS, branch, FILL-RECIRC,
BUBBLE, IBFULL (listed in priority from high to low). We measured this
using BACK END BUBBLE.FE hardware performance counter.
❐ Scoreboarding corresponds to full pipe bubbles in the main pipe due to the
FPU. We measured the above using the BE L1D FPU BUBBLE.ALL hardware
performance counter.
From the figure we see that data stalls are most prominent amongst the different
types of stalls mentioned above. More importantly, note that in applications
such as 429.mcf, 471.omnetpp and 433.milc, data stalls account for more than
50%(!) of the total execution time. This can, in part, be attributed to their high
L3 cache miss rate (refer to Figure 4). This highlights the high sensitivity of the
performance of the emerging applications, represented by CPU2006, w.r.t. the
cache sub-system. Further, we note that stalls due to branch mispredictions are
second to data stalls. Specifically, the stalls due to branch mispredictions account
for 5% and 1.6% of the total execution time, on an average, in CINT2006 and
CFP2006. On the other hand, front end stalls account for 4.6% and 1.4% of the
total execution time, on an average, in CINT2006 and CFP2006.

8.2 Data Stalls


The breakdown of data stalls is shown in Figure 13. As mentioned earlier, the
data stalls occur due to either the L1D or the execution unit. A L1D stall can
50 D. Desai et al.

Fig. 13. Data Stalls incurred

potentially occur due to several reasons such as: store buffer being full, due to
a recirculate, due to a hardware page walker, due to a store in conflict with a
returning fill, due to L2D back pressure or due to L2DTLB to L1DTLB transfer.
Register load stalls were measured using the following hardware performance
counters and are computed as ①-②+③:
① BE EXE BUBBLE.GRALL: This corresponds to the case when the back-end was
stalled by EXE due to GR/GR or GR/load dependency.
② BE EXE BUBBLE.GRGR: This corresponds to the case when the back-end was
stalled by EXE due to GR/GR dependency.
③ BE EXE BUBBLE.FRALL: This corresponds to the case when the back-end was
stalled by EXE due to FR/FR or FR/load dependency.
Other data stall components were using the following hardware performance
counters:
BE L1D FPU BUBBLE.L1D HPW: This measures the back-end stalls due to Hardware
Page Walker.
BE L1D FPU BUBBLE.L1D PIPE RECIRC: This measures the back-end stalls due to
a recirculate. The most predictable reason for a request to recirculate is that the
request misses a line that is already being serviced by the system bus/L3, but has
not yet returned to the L2. The L2 only retires L2 hits and primary L2 misses to
an L2 line. It does not retire multiple L2 miss requests; additional misses remain
in the L2 OzQ and recirculate until the tag lookup returns a hit. The request
then issues from the L2 OzQ and returns data (for a load) or updates the array
(for a store) as a normal L2 hit request.
BE L1D FPU BUBBLE.L1D L2BPRESS: This measures the back-end stalls due to
L2D Back Pressure (L2BP).
BE L1D FPU BUBBLE.L1D TLB: This measures the back-end stalls due to L2DTLB
to L1DTLB transfer.
Note that the various component of data stalls are not mutually exclusive.
In other words, there may be overlap between the different components. From
Figure 13 we note that the register load stalls dominate CINT2006, except
for 462.libquantum and 456.hmmer, in which recirculates dominate the data
stalls. On the other hand, in CFP2006, 11 out of 17 benchmarks are dominated
by register load stalls while others such as 433.milc and 459.GemsFDTD are
Performance Characterization of Itanium
R
2-Based Montecito Processor 51

Fig. 14. Branch Misprediction %

dominated by either the L2BP and/or L2 recirculates. The latter stems from a
high number of L3 data cache misses (see Figure 4). From the data we conjecture
that applications in which register-load stalls are not the dominating component
are memory bandwidth bound.

9 Branch Prediction
The Itanium 2 processors branch prediction performance relies on a two-level
prediction algorithm and two levels of branch history storage. The first level
of branch prediction storage is tightly coupled to the L1I cache. This coupling
allows a branches taken/not taken history and a predicted target to be delivered
with every L1I demand access in one cycle. The branch prediction logic uses
the history to access a pattern history table and determine a branches final
taken/not taken prediction, or trigger, according to the Yeh-Patt algorithm [8].
The L2 branch cache saves the histories and triggers of branches evicted from
the L1I so that they are available when the branch is revisited, providing the
second storage level.
We measured the branch misprediction rate using the following hardware per-
formance counters:
BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of retired
branches, regardless of the prediction result. We denote this by ➀.
BR MISPRED DETAIL.ALL.CORRECT PRED: This event counts the number of cor-
rectly predicted (both outcome and target) retired branches. We denote this
by ➁.
The branch misprediction percentage is computed as follows:
➀−➁
Branch Misprediction % = × 100

Figure 14 shows the branch misprediction percentage on Montecito and Madi-
son for the applications in CPU2006. From the figure we see that, as expected,
CINT2006 incurs a higher branch misprediction rate than CFP2006. This ex-
plains the higher number of stalls caused due to branch misprediction for integer
codes as compared to floating-point codes (refer to Figure 12). the performance
of the branch predictor on the two machines is almost the same. In the rest of
52 D. Desai et al.

Fig. 15. Branch Classification

the section, we present the classification of branches and report results for the
prediction accuracy for each type of branch.

9.1 Branch Classification


The branches can be Table 2. Branch Misprediction Penalty
broadly classified into the Branch Type Whether prediction Target Prediction Penalty (cycles)
following categories: IP IP Relative Correct Correct 0
IP Relative Correct Incorrect 1
Relative branches, Indi- Return Correct Correct 1
rect branches and Return Return Correct InCorrect 6+
Indirect Correct Correct 2
branches. The breakdown Indirect Correct InCorrect 6+
of the total number of Any Incorrect n/a 6+
branches see Figure 19) is
shown in Figure 15. The description of each branch type is given later in this
subsection. From the figure we see that IP Relative branches account for more
than 90% of the total branches in both CINT2006 and CFP2006. The penalty
associated with each type of branch is given in Table 2.
On the Itanium 2, the “whether” branch hints are .sptk, .spnt, .dptk, .dpnk
(sp=static prediction, dp=dynamic prediction). They are confidence hints gen-
erated by the compiler. For example, .sptk means the code generator is certain
that the branch will be taken, while .dptk means the code generator ‘thinks’
the branch will be taken, but is not sure. At the counter level, the WRONG PATH
events count the number of mispredicted branches due to the wrong “whether
prediction” hint. On the other hand, if the target address is predicted wrong, it
gets accounted under WRONG TARGET.
IP Relative Branches. We measured the misprediction rate of IP relative
branches on Montecito using the following hardware performance counters:
BR MISPRED DETAIL.IPREL.CORRECT PRED: This event counts the number of cor-
rectly predicted (outcome and target) IP relative branches.
BR MISPRED DETAIL.IPREL.WRONG PATH: This event counts the number of mis-
predicted IP relative branches due to wrong branch direction.
BR MISPRED DETAIL.IPREL.WRONG TARGET: This event counts the number of mis-
predicted IP relative branches due to wrong target for taken branches.
Performance Characterization of Itanium
R
2-Based Montecito Processor 53

Fig. 16. Misprediction behavior of IP relative branches

Fig. 17. Misprediction behavior of indirect branches

For better readability, we only show the percentage of the latter two in Fig-
ure 16. From the figure, we note that a high prediction accuracy is achieved
for the IP relative branches. Specifically, an accuracy of 95.7% and 98.39% is
achieved, on an average, for CINT2006 and CFP2006 applications respectively.
Improving the prediction accuracy for IP relative branches can potentially boost
the performance of integer codes, albeit by a small amount.

Indirect Branches. Indirect branches are predicted on the basis of the current
value in the referenced branch register. There is always a 2 cycle penalty for cor-
rectly predicted indirect branch. We measured the misprediction rate of indirect
branches in CPU2006 on Montecito using the following hardware performance
counters:
BR MISPRED DETAIL.NRETIND.CORRECT PRED: This event counts the number of
correctly predicted (outcome and target) non-return indirect branches.
BR MISPRED DETAIL.NRETIND.WRONG PATH: This event counts the number of mis-
predicted non-return indirect branches due to wrong branch direction.
BR MISPRED DETAIL.NRETIND.WRONG TARGET: This event counts the number of
mispredicted non-return indirect branches due to wrong target for taken branches.
54 D. Desai et al.

Fig. 18. Misprediction behavior of return branches

For better readability, we only show the percentage of the latter two in Fig-
ure 18. From the figure we note that indirect branches incur a large misprediction
rate. Specifically, 50.79% and 54.2% of the total indirect branches are mispre-
dicted in CINT2006 and CFP2006 respectively. In each case, the misprediction
occurs due to the wrong target. On the other hand, from Figures 15 and 19, we
note that indirect branches constitute a small (> 1%) percentage of the total
number of branches. From this, we conclude that improving the prediction ac-
curacy of indirect branches is unlikely to benefit the overall performance in a
significant fashion.
Return Branches. All predictions for return branches come form an eight-
entry return stack buffer (RSB). A branch call pushes both the caller’s IP and
its current function state onto the RSB. A return pops off this information.
There is always a 1 cycle penalty for a correctly predicted return. We measured
the misprediction rate of return branches in CPU2006 on Montecito using the
following hardware performance counters:
BR MISPRED DETAIL.RETURN.CORRECT PRED: This event counts the number of
correctly predicted (outcome and target) return type branches. This occurs when
the return address popped from the RSB does not match the actual return
address. The RSB has 8 entries, so in applications with call stacks deeper than
8 such mispredicts are likely to occur.
BR MISPRED DETAIL.RETURN.WRONG PATH: This event counts the number of mis-
predicted return type branches due to wrong branch direction.
BR MISPRED DETAIL.RETURN.WRONG TARGET: This event counts the mispredicted
return type branches due to wrong target for taken branches. This can happen
in two cases. First, for predicated returns [(qp) br.ret], e.g., the return is
predicted taken although the qualifying predicate (qp) is clear. Second, when
the return branch inherits a “wrong” predication hint from another branch that
has been issued within a 2 bundle window of the return.
For better readability, we only show the percentage of the latter two in Fig-
ure 18. From the figure we observe that, on average, RET incur misprediction of
> 1% in both CINT2006 and CFP2006. From this and Table 2 we conclude that
reduction in mispredictions due to RETs will not yield significant performance
gains.
Performance Characterization of Itanium
R
2-Based Montecito Processor 55

Fig. 19. Breakdown of retired instructions

10 Instruction Breakdown
Figure 19 presents the instruction breakdown for both CINT2006 and CFP2006.
We measured this using the following hardware performance counters:
LOADS RETIRED: The event counts the number of retired loads, excluding pred-
icated off loads. The count includes integer, floating-point, RSE, semaphores,
VHPT, uncacheable loads and check loads (ld.c) which missed in ALAT and
L1D (because this is the only time this looks like any other load). Also included
are loads generated by squashed HPW walks.
STORES RETIRED: The event counts the number of retired stores, excluding those
that were predicated off. The count includes integer, floating-point, semaphore,
RSE, VHPT, uncacheable stores.
NOPS RETIRED: This event provides information on number of retired nop.i,
nop.m, and nop.b and nop.f instructions, excluding nop instructions that were
predicated off.
BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of branches
retired of all types, regardless of the prediction result.
PREDICATE SQUASHED RETIRED: This event provides information on number of
instructions squashed due to a false qualifying predicate. Includes all non-B-
syllable instructions which reached retirement with a false predicate.
FP OPS RETIRED: This event provides information on number of retired floating-
point operations, excluding all predicated off instructions.
From the figure we see that loads and stores constitute, on an average, 18%
and 17% of the total number of retired instructions in CINT2006 and CFP2006
respectively. Also, we note that the percentage of NOPs is higher, on an average,
in CFP2006 (32%) than in CINT2006 (18.2%). This is due to the longer latency
of the floating-point instructions, e.g., the floating-point multiply add (fma) has
a 5 cycle latency [6].

11 Conclusion
This paper presented detailed performance characterization, using the built-in
hardware performance counters, of the of the dual-core dual-threaded Itanium
56 D. Desai et al.

Montecito processor. To the best of our knowledge, this is the first work which
uses the SPEC CPU2006 benchmark suite for evaluation of an IA-64 architec-
ture. It also compared the performance of Montecito with the previous generation
Madison processor.
Based on our analysis we make the following conclusions:
❐ First, Montecito achieves, on a geometric mean basis, 14% and 16% higher
IPC for the integer and floating-point applications respectively. These gains
are primarily due to the better cache design on Montecito as compared to
Madison.
❐ Second, a relatively low IPC value is achieved for the C++ benchmarks and
429.mcf in CINT2006 and 5 applications in CFP2006. This is primarily due
to a high cache miss rate and/or a high DTLB miss rate.
❐ Third, the performance gain achievable using an oracle branch predictor
on Itanium is only 5% and 1.5%, on an average, for integer and floating-
point applications respectively. From this, we conclude that the performance
potential for a “better” branch predictor on an Itanium-based platform is
relatively low for the SPEC CPU2006 benchmarks.

Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable
feedback.

References
1. Naffziger, S., Stackhouse, B., Grutkowski, T., Josephson, D., Desai, J., Alon, E.,
Horowitz, M.: The implementation of a 2-core multi-threaded itaniumR -family
processor. IEEE Journal of Solid-State Circuits 41(1), 197–209 (2006)
2. McNairy, C., Bhatia, R.: Montecito: A dual-core, dual-thread Itanium processor.
IEEE Micro. 25(2), 10–20 (2005)
3. SPEC CPU (2006), http://www.spec.org/cpu2006
4. Caliper, http://ieeexplore.ieee.org/iel5/4434/19364/00895108.pdf
5. Kandiraju, G.B., Sivasubramaniam, A.: Characterizing the d-TLB behavior of
SPEC CPU 2000 benchmarks. In: Proceedings of the 2002 ACM SIGMETRICS
International Conference on Measurement and Modeling of Computer Systems, Ma-
rina Del Rey, CA, pp. 129–139 (2002)
6. Dual-Core Update to the IntelRItaniumR2 Processor Reference Manual, Revision
0.9 (January 2006),
http://download.intel.com/design/Itanium2/manuals/30806501.pdf
7. Cvetanovic, Z., Bhandarkar, D.: Performance characterization of the Alpha 21164
microprocessor using TP and SPEC workloads. In: Proceedings of the 2nd Interna-
tional Symposium on High-Performance Computer Architecture, San Jose, CA, pp.
270–280 (February 1996)
8. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch
prediction. In: Proceedings of the 19th International Symposium on Computer Ar-
chitecture, Queensland, Australia, pp. 124–134 (1992)
A Tale of Two Processors: Revisiting the RISC-CISC
Debate

Ciji Isen1, Lizy K. John1, and Eugene John2


1
ECE Department, The University of Texas at Austin
2
ECE Department, The University of Texas at San Antonio
{isen,ljohn}@ece.utexas.edu, [email protected]

Abstract. The contentious debates between RISC and CISC have died down,
and a CISC ISA, the x86 continues to be popular. Nowadays, processors with
CISC-ISAs translate the CISC instructions into RISC style micro-operations
(eg: uops of Intel and ROPS of AMD). The use of the uops (or ROPS) allows
the use of RISC-style execution cores, and use of various micro-architectural
techniques that can be easily implemented in RISC cores. This can easily allow
CISC processors to approach RISC performance. However, CISC ISAs do have
the additional burden of translating instructions to micro-operations. In a 1991
study between VAX and MIPS, Bhandarkar and Clark showed that after cancel-
ing out the code size advantage of CISC and the CPI advantage of RISC, the
MIPS processor had an average 2.7x advantage over the studied CISC proces-
sor (VAX). A 1997 study on Alpha 21064 and the Intel Pentium Pro still
showed 5% to 200% advantage for RISC for various SPEC CPU95 programs. A
decade later and after introduction of interesting techniques such as fusion of
micro-operations in the x86, we set off to compare a recent RISC and a recent
CISC processor, the IBM POWER5+ and the Intel Woodcrest. We find that the
SPEC CPU2006 programs are divided between those showing an advantage on
POWER5+ or Woodcrest, narrowing down the 2.7x advantage to nearly 1.0.
Our study points to the fact that if aggressive micro-architectural techniques for
ILP and high performance can be carefully applied, a CISC ISA can be imple-
mented to yield similar performance as RISC processors. Another interesting
observation is that approximately 40% of all work done on the Woodcrest is
wasteful execution in the mispredicted path.

1 Introduction
Interesting debates on CISC and RISC instruction set architecture styles were fought
over the years, e.g.: the Hennessy-Gelsinger debate at the Microprocessor Forum [8]
and Bhandarkar publications [3, 4]. In the Bhandarkar and Clark study of 1991 [3],
the comparison was between Digital's VAX and an early RISC processor, the MIPS.
As expected, MIPS had larger instruction counts (expected disadvantage for RISC)
and VAX had larger CPIs (expected disadvantage for CISC). Bhandarkar et al. pre-
sented a metric to indicate the advantage of RISC called the RISC factor. The average
RISC factor on SPEC89 benchmarks was shown to be approximately 2.7. Not even
one of the SPEC89 program showed an advantage on the CISC.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 57–76, 2009.
© Springer-Verlag Berlin Heidelberg 2009
58 C. Isen, L.K. John, and E. John

The Microprocessor forum debate between John Hennessy and Pat Gelsinger in-
cluded the following 2 quotes:
"Over the last five years, the performance gap has been steadily diminishing. It
is an unfounded myth that the gap between RISC and CISC, or between x86 and
everyone else, is large. It's not large today. Furthermore, it is getting smaller."
- Pat Gelsinger, Intel
"At the time that the CISC machines were able to do 32-bit microprocessors,
the RISC machines were able to build pipelined 32-bit microprocessors. At the time
you could do a basic pipelining in CISC machine, in a RISC machine you could do
superscalar designs, like the RS/6000, or superpipelined designs like the R4000. I
think that will continue. At the time you can do multiple instruction issue with rea-
sonable efficiency on an x86, I believe you will be able to put second-level caches, or
perhaps even two processors on the same piece of silicon, with a RISC machine."
- John Hennessy, Stanford
Many things have changed since the early RISC comparisons such as the VAX-
MIPS comparison in 1991 [3]. The debates have died down in the last decade, and
most of the new ISAs conceived during the last 2 decades have been mainly RISC.
However, a CISC ISA, the x86 continues to be popular. It translates the x86 macro-
instructions into micro-operations (uops of Intel and ROPS of AMD). The use of the
uops (or ROPS) allows the use of RISC-style execution cores, and use of various mi-
cro-architectural techniques that can be easily implemented in RISC cores. A 1997
study of the Alpha and the Pentium Pro [4] showed that the performance gap was nar-
rowing, however the RISC Alpha still showed significant performance advantage.
Many see CISC performance approaching RISC performance, but exceeding it is
probably unlikely. The hardware for translating the CISC instructions to RISC-style is
expected to consume area, power and delay. Uniform-width RISC ISAs do have an
advantage for decoding and runtime translations that are required in CISC are defi-
nitely not an advantage for CISC.
Fifteen years after the heated debates and comparisons, and at a time when all the
architectural ideas in Hennessy's quote (on chip second level caches, multiple proc-
essors) have been put into practice, we set out to compare a modern CISC and RISC
processor. The processors are Intel's Woodcrest (Xeon 5160) and IBM's POWER5+
[11, 16]. A quick comparison of key processor features can be found in Table 1.
Though the processors do not have identical micro-architectures, there is a signifi-
cant similarity. They were released around the same time frame and have similar
transistor counts (276 million for P5+ and 291 million for x86). The main differ-
ence between the processors is in the memory hierarchy. The Woodcrest has larger
L2 cache while the POWER5+ includes a large L3 cache. The SPEC CPU2006 re-
sults of Woodcrest (18.9 for INT/17.1 for FP) are significantly higher than that of
POWER5+ (10.5 for INT/12.9 for FP). The Woodcrest has a 3 GHz frequency
while the POWER5 has a 2.2 GHz frequency. Even if one were to scale up the
POWER5+ results and compare the score for CPU2006 integer programs, it is clear
that even ignoring the frequency advantage, the CISC processor is exhibiting an ad-
vantage over the RISC processor. In this paper, we set out to investigate the per-
formance differences of these 2 processors.
A Tale of Two Processors: Revisiting the RISC-CISC Debate 59

Table 1. Key Features of the IBM POWER5+ and Intel Woodcrest [13]

IBM POWER5+ Intel-Woodcrest(Xeon 5160)


Bit width 64bit 32/64bit
Cores/chip*Thread/core 2x2 2x1
Clock Frequency 2.2GHz 3.GHz
L1 I/D 2x64/32k 2x32k/32k
L2 1.92M 4M
L3 36M (off-chip) None
Execution Rate/Core 5 issue 5uops
Pipeline Stages 15 14
Out of Order 200 inst 126 uops
Memory B/W 12.8GB/s 10.5GB/s
Process technology 90nm 65nm
Die Size 245mm2 144nm2
Transistors 276 million 291 million
Power (Max) 100W 80W
SPECint/fp2006 [cores] 10.5 / 12.9 18.9 / 17.1 [4]
SPECint/fp2006_rate[cores] 197 / 229 [16] 60.0 / 44.1 [4]

Other interesting processor studies in the past include a comparison of the


PowerPC601 and Alpha 21064 [12], a detailed study of the Pentium Pro processor
[5], a comparison of the SPARC and MIPS [7], etc.

2 The Two Chips

2.1 POWER5+

The IBM POWER5+ is an out of order superscalar processor. The core contains one
instruction fetch unit, one decode unit, two load/store pipelines, two fixed-point exe-
cution pipelines, two floating-point execution pipelines, and two branch execution
pipelines. It has the ability to fetch up to 8 instructions per cycle and dispatch and re-
tire 5 instructions per cycle. POWER5+ is a multi-core chip with two processor cores
per chip. The core has a 64KB L1 instruction cache and a 32KB L1 data cache. The
chip has a 1.9MB unified L2 cached shared by the two cores. An additional 36MB L3
cache is available off-chip with its controller and directory on the chip.
The POWER5+ memory management unit has 3 types of caches to help address
translation: a translation look-aside buffer (TLB), a segment look-aside buffer (SLB)
and an effective-to-real address table (ERAT). The translation processes starts its
search with the ERAT. Only on that failing does it search the SLB and TLB. This
processor supports simultaneous multithreading.
60 C. Isen, L.K. John, and E. John

2.2 Woodcrest

The Xeon 5160 is based on


Intel’s Woodcrest micro-
architecture, the server
variant of the Core micro-
architecture. It is a dual core,
64 bit, 4-issue superscalar,
moderately pipelined (14
stages), out-of-order MPU,
and implemented in a 65nm
process. The processor can
address 36 bits of physical
memory and 48 bits of vir-
Fig. 1. IBM POWER5+ Processor [16]
tual. An 8 way 32KB L1 I
cache, a dual ported 32KB
L1D cache along with a
shared 4MB L2 cache feeds
data and instruction to the
core. Unlike the POWER5+
it has no L3 cache. The
branch prediction occurs in-
side the Instruction Fetch
Unit. The Core micro-
architecture employs the tra-
ditional Branch Target Buffer
(BTB), a Branch Address
Calculator (BAC) and the Re-
turn Address Stack (RAS) Fig. 2. Front-End of the Intel Woodcrest processor [17]
and two more predictors. The
two predictors are: the loop
detector (LD) which predicts loop exits and the Indirect Branch Predictor (IBP) which
picks targets based on global history, which helps for branches to a calculated
address. A queue has been added between the branch target predictors and the instruc-
tion fetch to hide single cycle bubbles introduced by taken branches. The x86 instruc-
tions are generally broken down into simpler micro-operations (uops), but in certain
specialized cases, the processor fuses certain micro-operations to create integrated
or chained operations. Two types of fusion operations are used: macro-fusion and
micro-fusion.

3 Methodology
In this study we use the 12 integer and 17 floating-point programs of the SPEC
CPU2006 [18] benchmark suite and measure performance using the on chip perform-
ance counters. Both POWER5+ and Woodcrest microprocessors provide on-chip
logic to monitor processor related performance events. The POWER5+ Performance
A Tale of Two Processors: Revisiting the RISC-CISC Debate 61

Monitor Unit contains two dedicated registers that count instructions completed and
total cycles as well as four programmable registers, which can count more than 300
hardware events occurring in the processor or memory system. The Woodcrest archi-
tecture has a similar set of registers, two dedicated and two programmable registers.
These registers can count various performance events such as, cache misses, TLB
misses, instruction types, branch misprediction and so forth. The perfex utility from
the Perfctr tool is used to perform the counter measurements on Woodcrest. A tool
from IBM was used for making the measurements on POWER5+.
The Intel Woodcrest processor supports both 32-bit as well as 64-bit binaries. The
data we present for Woodcrest corresponds to the best runtime for each benchmark
(hence is a mix of 64-bit and 32-bit applications). Except for gcc, gobmk, omnetpp,
xalancbmk and soplex, all other programs were in the 64-bit mode. The benchmarks
for POWER5+ were compiled using Compilers: XL Fortran Enterprise Edition 10.01
for AIX and XL C/C++ Enterprise Edition 8.0 for AIX. The POWER5+ binaries were
compiled using the flags:
C/C++ -O5 -qlargepage -qipa=noobject -D_ILS_MACROS -qalias=noansi -
qalloca + PDF (-qpdf1/-qpdf2)
FP - O5 -qlargepage -qsmallstack=dynlenonheap -qalias=nostd + PDF (-qpdf1/-
qpdf2).
The OS used was AIX 5L V5.3 TL05. The benchmarks on Woodcrest were com-
piled using Intel’s compilers - Intel(R) C Compiler for 32-bit applications/ EM64T-
based applications Version 9.1 and Intel(R) Fortran Compiler for 32-bit applications/
EM64T-based applications, Version 9.1. The binaries were compiled using the flag:
-xP -O3 -ipo -no-prec-div / -prof-gen -prof-use.
Woodcrest was configured to run using SUSE LINUX 10.1 (X86-64).

4 Execution Characteristics of the Two Processors

4.1 Instruction Count (path length) and CPI

According to the traditional RISC vs. CISC tradeoff, we expect POWER5+ to have a
larger instruction count and a lower CPI compared to Intel Woodcrest, but we observe
that this distinction is blurred. Figure 3 shows the path length (dynamic instruction
count) of the two systems for SPEC CPU2006. As expected, the instruction counts in
the RISC POWER5+ is more in most cases, however, the POWER5+ has better in-
struction counts than the Woodcrest in 5 out of 12 integer programs and 7 out of 17
floating-point programs (indicated with * in Figure 3). The path length ratio is de-
fined as the ratio of the instructions retired by POWER5+ to the number of instruc-
tions retired by Woodcrest. The path length ratio (instruction count ratio) ranges
from 0.7 to 1.23 for integer programs and 0.73 to 1.83 for floating-point programs.
The lack of bias is evident since the geometric mean is about 1 for both integer and
floating-point applications. Figure 4 presents the CPIs of the two systems for SPEC
CPU2006. As expected, the POWER5+ has better CPIs than the Woodcrest in most
cases. However, in 5 out of 12 integer programs and 7 out of 17 floating-point pro-
grams, the Woodcrest CPI is better (indicated with * in Figure 4). The CPI ratio is the
62 C. Isen, L.K. John, and E. John

Fig. 3. a) Instruction Count (Path Length)-INT

Fig. 3. b) Instruction Count (Path Length) – FP

ratio of the CPI of Woodcrest to that of POWER5+. The CPI ratio ranges from 0.78
to 4.3 for integer programs and 0.75 to 4.4 for floating-point applications. This data is
a sharp contrast to what was observed in the Bhandarkar-Clark study. They obtained
an instruction count ratio in the range of 1 to 4 and a CPI ratio ranging from 3 to 10.5.
In their study, the RISC instruction count was always higher than CISC and the CISC
CPI was always higher than the RISC CPI.
A Tale of Two Processors: Revisiting the RISC-CISC Debate 63

Fig. 4. a) CPI of the 2 processors for INT

Fig. 4. b) CPI of the 2 processors for FP

Figure 5 illustrates an interesting metric, the RISC factor and its change from the
Bhandarkar-Clark study to our study. Bhandarkar–Clark defined RISC factor as the ratio
of CPI ratio to path length (instruction count) ratio. The x-axis indicates the CPI ratio
(CISC to RISC) and the y-axis indicates the instruction count ratio (RISC to CISC).
The SPEC 89 data-points from the Bhandarkar-Clark study are clustered to the
right side of the figure, whereas most of the SPEC CPU2006 points are located closer
to the line representing RISC factor=1 (i.e. no advantage for RISC or CISC). This line
represents the situation where the CPI advantage for RISC is cancelled out by the path
length advantage for CISC. The shift highlights the sharp contrast between the results
observed in the early days of RISC and the current results.
64 C. Isen, L.K. John, and E. John

4.2 Micro-operations
Per Instruction
(uops/inst)

Woodcrest converts its


instructions into sim-
pler instructions called
micro-ops (uops). The
number of uops per
instruction gives an
indication of the com-
plexity of the x86 in-
structions used in each
benchmark. Past stud-
ies by Bhandarkar and
Ding [5] have recorded Fig. 5.(a) CPI ratio vs. Path length ratio - INT
the uops per instruction
to be in the 1.2 to 1.7
range for SPEC 89
benchmarks. A higher
uops/inst ratio would
imply that more work is
done per instruction for
CISC, something that is
expected of CISC. Our
observation on Wood-
crest shows the uops
per instruction ratio to
be much lower than
past studies [5]: an av-
erage very close to 1.
Table 2 presents the
uops/inst for both Fig. 5.(b) CPI ratio vs. Path length ratio - FP
SPEC CPU2006 integer
and floating-point
suites. The integer programs have an average of 1.03 uops/inst and the FP programs
have an average of 1.07 uops/instructions. Only 482.sphinx3 has a uops/inst ratio that
is similar to what is observed by Bhandarkar et al. [5] (a ratio of 1.34). Among the in-
teger benchmarks, mcf has the highest uops/inst ratio – 1.14.

4.3 Instruction Mix

In this section, we present the instruction mix to help the reader better understand the
later sections on branch predictor performance, and cache performance. The instruc-
tion mix can give us an indication of the difference between the benchmarks. It is far
from a clear indicator of bottlenecks but it can still provide some useful information.
Table 3 contains the instruction mix for the integer programs while Table 4
A Tale of Two Processors: Revisiting the RISC-CISC Debate 65

Table 2. Micro-ops per instruction for CPU2006 on Intel Woodcrest

BENCHMARK uops/inst BENCHMARK uops/inst


400.perlbench 1.06 433.milc 1.01
401.bzip2 1.03 434.zeusmp 1.02
403.gcc 0.97 435.gromacs 1.01
429.mcf 1.14 436.cactusADM 1.12
445.gobmk 0.93 437.leslie3d 1.09
456.hmmer 1.08 444.namd 1.02
458.sjeng 1.06 447.dealII 1.04
462.libquantum 1.05 450.soplex 1.00
464.h264ref 1.02 453.povray 1.07
471.omnetpp 0.98 454.calculix 1.05
473.astar 1.07 459.GemsFDTD 1.16
483.xalancbmk 0.96 465.tonto 1.08
470.lbm 1.00
481.wrf 1.16
482.sphinx3 1.34
410.bwaves.input1 1.01
416.gamess 1.02
INT - geomean 1.03 FP – geomean 1.07

Table 3. Instruction mix for SPEC CPU2006 integer benchmarks

POWER5+ Woodcrest
BENCHMARK Branches Stores Load Others Branches Stores Loads other
400.perlbench 18% 15% 25% 41% 23% 11% 24% 41%
401.bzip2 15% 8% 23% 54% 15% 9% 26% 49%
403.gcc 19% 17% 18% 46% 22% 13% 26% 39%
429.mcf 17% 9% 26% 48% 19% 9% 31% 42%
445.gobmk 16% 11% 20% 53% 21% 14% 28% 37%
456.hmmer 14% 11% 28% 47% 8% 16% 41% 35%
458.sjeng 18% 6% 20% 56% 21% 8% 21% 50%
462.libquantum 21% 8% 21% 50% 27% 5% 14% 53%
464.h264ref 7% 16% 35% 42% 8% 12% 35% 45%
471.omnetpp 19% 17% 26% 38% 21% 18% 34% 27%
473.astar 13% 8% 27% 52% 17% 5% 27% 52%
483.xalancbmk 20% 9% 23% 47% 26% 9% 32% 33%

contains the same information for floating-point benchmarks. In comparing the com-
position of instructions in the binaries of POWER5+ and Woodcrest, the instruction
mix seems to be largely similar for both architectures. We do observe that some
Woodcrest binaries have a larger fraction of load instructions compared to their
POWER5+ counterparts. For example, the execution of hmmer on POWER5+ has
28% load instruction while the Woodcrest version has 41% loads. Among integer
programs, gcc, gobmk and xalancbmk are other programs where the percentage of
loads in Woodcrest is higher than that of POWER5+.
66 C. Isen, L.K. John, and E. John

Table 4. Instruction mix for SPEC CPU2006 floating-point benchmarks

POWER5+ Woodcrest
BENCHMARK Branches Stores Load Others Branches Stores Loads Others
410.bwaves 1% 7% 46% 46% 1% 8% 47% 44%
416.gamess 8% 8% 31% 53% 8% 9% 35% 48%
433.milc 3% 18% 34% 46% 2% 11% 37% 50%
434.zeusmp 2% 11% 26% 61% 4% 8% 29% 59%
435.gromacs 4% 14% 28% 54% 3% 14% 29% 53%
436.cactusADM 0% 14% 38% 48% 0% 13% 46% 40%
437.leslie3d 1% 12% 28% 59% 3% 11% 45% 41%
444.namd 5% 6% 28% 61% 5% 6% 23% 66%
447.dealII 15% 9% 32% 45% 17% 7% 35% 41%
450.soplex 15% 6% 26% 53% 16% 8% 39% 37%
453.povray 12% 14% 31% 44% 14% 9% 30% 47%
454.calculix 4% 6% 25% 65% 5% 3% 32% 60%
459.GemsFDTD 2% 10% 31% 57% 1% 10% 45% 43%
465.tonto 6% 13% 29% 52% 6% 11% 35% 49%
470.lbm 1% 9% 18% 72% 1% 9% 26% 64%
481.wrf 4% 11% 31% 54% 6% 8% 31% 56%
482.sphinx3 8% 3% 31% 59% 10% 3% 30% 56%

We also find a difference in the fraction of branch instructions, though not as sig-
nificant as the differences observed for load instructions. For example, xalancbmk has
20% branches in a POWER5+ execution and 26% branches in the case of Woodcrest.
A similar difference exists for gobmk and libquantum. In the case of hmmer, unlike
the previous cases, the number of branches is lower for Woodcrest (14% for
POWER5+ and only 8% for Woodcrest). Similar examples for difference in the frac-
tion of load and branch instructions can be found in the floating-point programs. A
few examples are cactusADM, leslie3d, soplex, gemsFDTD and lbm. FP programs
have traditionally had a lower fraction of branch instructions, but three of the pro-
grams exhibit more than 12% branches. This observation holds for both POWER5+
and Woodcrest. Interestingly these three programs (dealII, soplex and povray) are
C++ programs.

4.4 Branch Prediction

Branch prediction is a key feature in modern processors allowing out of order execu-
tion. Branch misprediction rate and misprediction penalty significantly influence the
stalls in the pipeline, and the amount of instructions that will be executed specula-
tively and wastefully in the misprediction path. In Figure 6 we present the branch
misprediction statistics for both architectures. We find that Woodcrest outperforms
POWER5+ in this aspect. The misprediction rate for Woodcrest among integer
benchmarks ranges from a low 1% for xalancbmk to a high 14% for astar. Only
A Tale of Two Processors: Revisiting the RISC-CISC Debate 67

gobmk and astar have a misprediction rate higher than 10% for Woodcrest. On the
other hand, the misprediction rate for POWER5+ ranges from 1.74% for xalancbmk
and 15% for astar. On average the misprediction for integer benchmarks is 7% for
POWER5+ and 5.5% for Woodcrest. In the case of floating-point benchmarks this is
5% for POWER5+ and 2% for Woodcrest. We see that, in the case of the floating-
point programs, POWER5+ branch prediction performs poorly relative to Woodcrest.
This is particularly noticeable in programs like games, dealII, tonto and sphinx.

16%

14%

12%

10%

8%

6%

4%

2%

0%
f
cc

ng

ta r
2

f
k
mc

p
h

k
m

re
z ip

e tp
nc

bm
3 .g

je

n tu

as
64
9.

ob
lbe

8 .s
1 .b

nc
mn

3.
40

42

ua

h2
5 .g

47
er

45

a la
40

4.

1 .o
ib q
44
0 .p

46

3 .x
2 .l

47
40

48
46

P5+ branch mispred % WC branch mispred %

Fig. 6. a) Branch misprediction – INT

12%

10%

8%

6%

4%

2%

0%
M

TD
p

em ix

3
s
s

44 3d

47 o
s

y
II
d

ex

m
c

rf
43 usm

ac
es
41 ave

45 vra

nx
m

al
43 AD

nt
G lcul
43 .mil

48 1.w
lb
D
e

pl
de

to
na
m
m

hi
sli

sF

0.
so

po
us
bw

45 4.ca
3

5.
ze

43 gro

sp
ga

48
7.
4.
le
43

0.

3.

46
ct

44
0.

4.

7.

2.
6.

45
5.

ca

45
41

6.

9.

P5 + branch mispred % WC branch mispred %

Fig. 6. b) Branch misprediction – FP


68 C. Isen, L.K. John, and E. John

4.5 Cache Misses

The cache hierarchy is one of the important micro-architectural features that differ
between the systems. POWER5+ has a smaller L2 cache (1.9M instead of 4M in
Woodcrest), but it has a large shared L3 cache. This makes the performance of the
cache hierarchies of the two processors of particular interest. Figure 7 shows the L1
data cache misses per thousand instructions for both integer and floating-point bench-
marks. Among integer programs mcf stands out, while there are no floating-point pro-
grams with a similar behavior. POWER5+ has a higher L1 D cache miss rate for gcc,
milc and lbm even though both processors have the same L1 D cache size. In general,
the L1 data cache miss rates are under 40 misses per 1k instructions. In spite of the
small L2 cache, the L2 miss ratio on POWER5+ is lower than that on Woodcrest.
While no data is available to further analyze this, we suspect that differences in the

160

140

120

100

80

60

40

20

0
f
ch

p
k

r
cf

k
re
2

g
c

ta
tp
bm

bm
gc
ip

en

tu
m
en

64

as
ne
bz

an
3.

9.

sj
go

nc
h2
rlb

3.
om
1.

8.
40

42

qu
5.

la
47
pe

4.
40

45

xa
li b

1.
44

46
0.

47
2.

3.
40

46

48

P5+ L1 D miss/1 k inst WC L1 D miss/1k inst

Fig. 7. a) L1 D cache misses per 1k Instructions – INT

160

140

120

100

80

60

40

20

0
ga s

so I

45 4.c ay
44 md

s p rf
46 TD
us s
p

47 to
m
d
43 ss

3
em ulix
ze c

lI
e

w
m

e3

nx
45 ple
il

45 ea

n
41 wav

lb
43 AD

45 ovr
e

ca m a
m

48 8 1.
D
na

to
us
m

0.
sli

hi
d
3.

sF
al

5.
7.

p
o

4.
le

4
b

3.
gr

0.
0.

44
ct

7.
6.

4.

2.
5.
41

43

G
43

6.

9.
43

P5+ L1 D miss/1k inst WC L1 D miss/1k inst

Fig. 7. b) L1 D cache misses per 1k Instructions - FP


A Tale of Two Processors: Revisiting the RISC-CISC Debate 69

40

35

30

25

20

15

10

f
h

p
m
k

r
cf

k
2

re
c

ta
nc

tp
bm

bm
gc
ip

en

tu
m

64

as
ne
bz
e

an
3.

9.

sj
go

nc
rlb

h2

3.
1.

8.

om
40

42

qu
5.

la
47
pe

4.
40

45

xa
1.
44

lib

46
0.

47
2.

3.
40

46

48
P5 L2 miss/1k inst WC L2 miss/1k inst

Fig. 8. a) L2 cache misses per 1k Instructions – INT

60

50

40

30

20

10

0
ga s

so I

s p rf
44 md

46 TD
us s
p

47 to
m
3d
s

3
em ulix
ilc

lI
e

a
es

ac

w
m

nx
45 ple
45 ea

n
41 wav

lb
45 ovr
43 AD
m

ie

48 8 1.
D
na

to
us
m

om

0.

hi
d
3.

sl

sF
al

5.
7.

p
4.
ze

le

4
b

43

3.
43 .g r

0.

.c
0.

44
ct

7.
6.

4.

2.
4
41

ca
43

G
43

6.

9.
45

P5+ L2 miss/1k inst WC L2 miss/1k inst

Fig. 8. b) L2 cache misses per 1k Instructions – FP

amount of loads in the instruction mix (as discussed earlier), differences in the
instruction cache misses (POWER5+ has a bigger I-cache) etc. can lead to this.

4.6 Speculative Execution

Over the years out-of-order processors have achieved significant performance gains
from various speculation techniques. The techniques have primarily focused on con-
trol flow prediction and memory disambiguation. In Figure 9 we present speculation
percentage, a measure of the amount of wasteful execution, for different benchmarks.
We define the speculation % as the ratio of instructions that are executed specula-
tively but not retired to the number of instructions retired (i.e. (dispatched_inst_cnt /
retired_inst_cnt) -1). We find the amount of speculation in integer benchmarks to be
70 C. Isen, L.K. John, and E. John

Speculation %

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

cf
c

f
2

k
h

p
m

re

ta
gc
ip

bm
nc

bm

en

tp
m

tu

as
64
bz

ne
3.

9.
e

sj

an

nc
go

3.
rlb

h2
1.

40

42

8.

om
qu

la
5.

47
40

45
pe

4.

xa
44

1.
lib

46
0.

47

3.
2.
40

48
46
P5+(inst disp/compl) WC(UOPS disp/retired)

Fig. 9. (a) Percentage of instructions executed speculatively - INT

Speculation %

0.3

0.25

0.2

0.15

0.1

0.05

0
le M

45 54.c ay
so I

47 o

sp rf
em ul ix
43 g r o p

m
es

x
4 4 md

D
ilc

4 4 e 3d

3
s

lI
43 s

nt
m

45 ple

nx
45 ea
43 sAD
ac
es

vr

DT

lb
43 3.m
41 wav

to

48 81.
43 e u s

na

0.

hi
m

po
m

d
sli

al

5.
sF
7.
u

4.
ga

4
3.
b

0.

46
z

ct
0.

4.

7.

2.
6.

ca
5.
41

G
4
6.

9.

P5+(inst disp/compl) WC(UOPS disp/retired)

Fig. 9. (b) Percentage of instructions executed speculatively - FP

higher than floating-point benchmarks, not surprising considering the higher percent-
age of branches and branch mispredictions in integer programs.
In general, the Woodcrest micro-architecture speculates much more aggressively
compared to POWER5+. On an average, an excess of 40% of instructions in Wood-
crest and 29% of instructions in POWER5+ are speculative for integer benchmarks.
The amount of speculations for FP programs on average is 20% for Woodcrest and
9% for POWER5+. Despite concerns on power consumption, the fraction of instruc-
tions spent in mispredicted path has increased from the average of 20% (25% for INT
and 15% for FP) seen in the 1997 Pentium Pro study. Among the floating-point pro-
grams, POWER5+ speculates more than Woodcrest in four of the benchmarks: dealII,
soplex, povray and sphinx. It is interesting to note that 3 of these benchmarks are C++
A Tale of Two Processors: Revisiting the RISC-CISC Debate 71

programs. With limitation on power and energy consumption, wastage from execution
in speculative path is of great concern.

5 Techniques That Aid Woodcrest


Part of Woodcrest’s performance advantage comes from the reduction of micro-
operations through fusion. Another important technique is early load address resolu-
tion. In this section, we analyze these specific techniques.

5.1 Macro-fusion and Micro-op Fusion

Although the Woodcrest breaks instructions into micro-operations, in certain cases, it


also uses fusion of micro-operations to combine specific uops to integrated opera-
tions, thus taking the advantage of simple or complex operations as it finds fit. Macro-
fusion [11] is a new feature for Intel’s Core micro-architecture, which is designed to
decrease the number of micro-ops in the instruction stream. Select pairs of compare
and branch instruction are fused together during the pre-decode phase and then sent
through any one of the four decoders. The decoder then produces a micro-op from the
fused pair of instructions. The hardware can perform a maximum of one macro-fusion
per cycle.
Table 5 and Table 6 show the percentage of fused operations for integer and float-
ing-point benchmarks. In the tables, fused operations are classified as macro-fusion
and micro-fusion. Micro-fusion is further classified into two: Loads that are fused
with arithmetic operations or an indirect branch (LD_IND_BR) and store address
computations fused with data store (STD_STA). As stated before, the version of the
benchmark selected (32bit vs. 64bit) depends on the overall performance. This was
done to give maximum performance benefit to CISC. It turns out that most of the pro-
grams performed best in the 64-bit mode but in this mode macro-fusion does not work
well. Since our primary focus is in comparing POWER5+ with Woodcrest we used
the binaries that yielded best performance for this study too.
The best case runs (runs with highest performance) for integer benchmarks have an
average of 19% operations that can be fused by micro or macro-fusion. This implies
that the average uops/inst will go up from 1.03 to 1.23 uops/inst if there was no fu-
sion. The majority of the fusion comes from micro-fusion, an average of 14%, and the
rest from macro-fusion. Macro-fusion in integer benchmarks ranges from 0.13% in
hammer to 21% for xalancbmk. For micro-fusion, we find it to range from 6% (astar)
to 29% (hmmer). Among the two sub-components of micro-fusion, store address com-
putation fusion is predominant. ‘Store address and store’ fusion ranges from 4%, for
astar, to 18%, for omnetpp. On the other hand Loads fusion (LD_IND_BR - Loads
that fused with arithmetic operations or an indirect branch) is the lowest for mcf and
the highest for hmmer. The best case runs (runs with highest performance) for FP
benchmarks have an average of 15% uops that can be fused by micro or macro-fusion.
Almost all of the fusion is from micro-fusion. The percentage of uops that can be
fused via micro-fusion in FP programs ranges from 4% (sphinx) to 21% (leslie3D).
72 C. Isen, L.K. John, and E. John

Table 5. Micro & macro-fusion in SPEC CPU2006 integer benchmarks

%macro- %micro- %fusion %LD_IND_BR %STD_STA


BENCHMARK uops/inst fusion uop fusion uop uop uops uops
400.perlbench 1.06 0% 13% 13% 3% 11%
401.bzip2 1.03 0% 12% 12% 4% 9%
403.gcc 0.97 15% 16% 31% 4% 13%
429.mcf 1.14 0% 8% 8% 0% 8%
445.gobmk 0.93 12% 19% 31% 5% 15%
456.hmmer 1.08 0% 29% 29% 14% 15%
458.sjeng 1.06 0% 9% 9% 2% 7%
462.libquantum 1.05 0% 8% 8% 3% 5%
464.h264ref 1.02 0% 18% 18% 6% 12%
471.omnetpp 0.98 10% 22% 31% 5% 18%
473.astar 1.07 0% 6% 6% 1% 4%
483.xalancbmk 0.96 21% 13% 34% 13% 10%
Average 1.03 5% 14% 19% 5% 11%

Table 6. Micro & macro-fusion in SPEC CPU2006 – FP benchmarks

%macro- %micro- %fusion %LD_IND_BR %STD_STA


BENCHMARK uops/inst fusion uop fusion uop uop uops uops
410.bwaves 1.01 0% 19% 19% 11% 8%
416.gamess 1.02 0% 20% 20% 11% 9%
433.milc 1.01 0% 13% 13% 3% 11%
434.zeusmp 1.02 0% 13% 13% 5% 8%
435.gromacs 1.01 0% 18% 18% 3% 14%
436.cactusADM 1.12 0% 20% 20% 8% 12%
437.leslie3d 1.09 0% 21% 21% 12% 10%
444.namd 1.02 0% 9% 9% 3% 6%
447.dealII 1.04 0% 19% 19% 12% 7%
450.soplex 1.00 4% 15% 20% 8% 7%
453.povray 1.07 0% 13% 13% 5% 8%
454.calculix 1.05 0% 9% 9% 6% 3%
459.GemsFDTD 1.16 0% 13% 13% 5% 9%
465.tonto 1.08 0% 20% 20% 10% 10%
470.lbm 1.00 0% 19% 19% 10% 9%
481.wrf 1.16 0% 13% 13% 7% 6%
482.sphinx3 1.34 0% 4% 4% 2% 2%
Average 1.07 0% 15% 15% 7% 8%
A Tale of Two Processors: Revisiting the RISC-CISC Debate 73

Hypothetically, not having fusion would increase the uops/inst for floating-point
programs from 1.07 uops/inst to 1.23 uops/inst and for integer programs from 1.03
uops/inst to 1.23 uops/inst. It is clear that this micro-architectural technique has
played a significant part in blunting the advantage of RISC by reducing the number of
uops that are executed per instruction.

5.2 Early Load Address Resolution

The cost of memory access has been accentuated by the higher performance of the
logic unit of the processor (the memory wall). The Woodcrest architecture is said to
perform an optimization aimed at reducing the load latencies of operations with re-
gards to the stack pointer [2]. The work by Bekerman et al. [2] proposes tracking the
ESP register and simple operations on it of the form reg±immediate, to enable quick
resolutions of the load address at decode time. The ESP register in IA32 holds the
stack pointer and is almost never used for any other purpose. Instructions such as
CALL/RET, PUSH/POP, and ENTER/LEAVE can implicitly modify the stack
pointer. There can also be general-purpose instructions that modify the ESP in the
fashion ESP←ESP±immediate. These instructions are heavily used for procedure
calls and are translated into uops as given below in Table 7. The value of the immedi-
ate operand is provided explicitly in the uop.

Table 7. Early load address prediction - Example

PUSH EAX ESP←ESP - immediate.


Mem[ESP] ← EAX
POP EAX EAX ← mem[ESP]
ESP←ESP - immediate.
LOAD EAX from stack EAX ← mem[ESP+imm]

These ESP modifications can be tracked easily after decode. Once the initial ESP
value is known later values can be computed after each instruction decode. In essence
this method caches a copy of the ESP value in the decode unit. Whenever a simple
modification to the ESP value is detected the cached value is used to compute the ESP
value without waiting for the uops to reach execution stage. The cached copy is also
updated with the newly computed value. In some cases the uops cause operations that
are not easy to track and compute; for example loads from memory into the ESP or
computations that involve other registers. In these cases the cached value of ESP is
flagged and it is not used for computations until the uop passes the execution stage
and the new ESP value is obtained. In the mean while, if any other instruction that
follows attempts to modify the ESP value, the decoder tracks the change operation
and the delta value it causes. Once the new ESP value is obtained from the uop that
passed the execution stage, the delta value observed is applied on it to bring the ESP
register up-to-date. Having the ESP value at hand allows quick resolution of the load
addresses there by avoiding any stall related to that. This technique is expected to bear
fruit in workloads where there is a significant use of the stack, most likely for func-
tion calls. Further details on this optimization can be found in Bekerman et al. [2].
74 C. Isen, L.K. John, and E. John

In Table 8 we present data related to ESP optimization. The percentage of


ESP.SYNC refers to the number of times the ESP value had to be synchronized with
the delta value as a percent of the total number of instructions. A high number is not
desirable as it would imply the frequent need to synchronize the ESP data i.e. ESP
data can not be computed at the decoder because it has to wait for the value from the
execution stage. % ESP.ADDITIONS is a similar percent for the number of ESP addi-
tion operations performed in the decode unit – an indication of the scope of this opti-
mization. A high value for this metric is desirable because, larger the percentage of
instructions that use the addition operation, more are the number of cycles saved. The
stack optimization seems to be more predominant in the integer benchmarks and not
the floating-point benchmarks. The % ESP addition optimization in integer
benchmarks range from 0.1% for hmmer to 11.3% for xalancbmk. The % of ESP
synchronization is low even for benchmarks with high % of ESP addition. For exam-
ple xalancbmk exhibits 11.3% ESP addition and has only 3.76% ESP synchronization.
The C++ programs are expected to have more function calls and hence more scope for
this optimization. Among integer programs omnetpp and xalancbmk are among the
ones with a large % ESP addition. The others are gcc and gobmk; the modular and
highly control flow intensive nature of gcc allows for these optimizations. Although
Astar is a C++ application, it makes very little use of C++ features [19] and we find
that it has a low % for ESP addition. Among the floating-point applications, dealII
and povray, both C++ applications, have a higher % of ESP addition.

Table 8. Percentage of instructions on which early load address resolutions were applied
% ESP % ESP % ESP % ESP
BENCHMARK SYNCH ADDITIONS BENCHMARK SYNCH ADDITIONS
400.perlbench 0.90% 6.88% 433.milc 0.00% 0.04%
401.bzip2 0.30% 1.41% 434.zeusmp 0.00% 0.00%
403.gcc 1.80% 7.99% 435.gromacs 0.03% 0.14%
429.mcf 0.17% 0.24% 436.cactusADM 0.00% 0.00%
445.gobmk 1.81% 8.45% 437.leslie3d 0.00% 0.00%
456.hmmer 0.00% 0.11% 444.namd 0.00% 0.01%
458.sjeng 0.41% 3.19% 447.dealII 0.20% 3.05%
462.libquantum 0.12% 0.13% 450.soplex 0.11% 0.54%
464.h264ref 0.12% 1.44% 453.povray 0.67% 2.77%
471.omnetpp 3.06% 7.60% 454.calculix 0.03% 0.09%
473.astar 0.01% 0.14% 459.GemsFDTD 0.08% 0.33%
483.xalancbmk 3.76% 11.30% 465.tonto 0.26% 0.77%
470.lbm 0.00% 0.00%
481.wrf 0.19% 0.35%
482.sphinx3 0.17% 0.90%
410.bwaves 0.03% 0.04%
416.gamess 0.15% 0.76%
INT - geomean 1.04% 4.07% FP - geomean 0.12% 0.60%
A Tale of Two Processors: Revisiting the RISC-CISC Debate 75

On average the benefit from ESP based optimization is 4% for integer programs
and 0.6% for FP programs. Each ESP based addition that is avoided amounts to
avoiding execution of one uop. Although the average benefit is low, some of the ap-
plications benefit significantly in reducing unnecessary computations and there by
helping performance of those applications in relation to their POWER5+ counter
parts.

6 Conclusion

Using the SPEC CPU2006 benchmarks, we analyze the performance of a recent CISC
processor, the Intel Woodcrest (Xeon 5160) with a recent RISC processor, the IBM
POWER5+. In a CISC RISC comparison in 1991, the RISC processor showed an ad-
vantage of 2.7x and in a 1997 study of the Alpha 21064 and the Pentium Pro, the
RISC Alpha showed 5% to 200% advantage on the SPEC CPU92 benchmarks. Our
study shows that the performance difference between RISC and CISC has further nar-
rowed down. In contrast to the earlier studies where the RISC processors showed
dominance on all SPEC CPU programs, neither the RISC nor CISC dominates in this
study. In our experiments, the Woodcrest shows advantage on several of the SPEC
CPU2006 programs and the POWER5+ shows advantage on several other programs.
Various factors have helped the Woodcrest to obtain its RISC-like performance.
Splitting the x86 instruction into micro-operations of uniform complexity has helped,
however, interestingly the Woodcrest also combines (fuses) some micro-operations to
a single macro-operation. In some programs, up to a third of all micro-operations are
seen to benefit from fusion, resulting in chained operations that are executed in a
single step by the relevant functional unit. Fusion also reduces the demand on reserva-
tion station and reorder buffer entries. Additionally, it reduces the net uops per in-
struction. The average uop per instruction for Woodcrest in 2007 is 1.03 for integer
programs and 1.07 for floating-point programs, while in Bhandarkar and Ding’s 1997
study [5] using SPEC CPU95 programs, the average was around 1.35 uops/inst. Al-
though the POWER5+ has smaller L2 cache than the Woodcrest, it is seen to achieve
equal or better L2 cache performance than the Woodcrest. The Woodcrest has better
branch prediction performance than the POWER5+. Approximately 40%/20% (int/fp)
of instructions in Woodcrest and 29%/9% (int/fp) of instructions in the POWER5+
are seen to be in the speculative path.
Our study points out that with aggressive micro-architectural techniques for ILP,
CISC and RISC ISAs can be implemented to yield very similar performance.

Acknowledgement
We would like to acknowledge Alex Mericas, Venkat R. Indukuru and Lorena
Pesantez at IBM Austin for their guidance. The authors are supported in part by NSF
grant 0702694, and an IBM Faculty award. Any opinions, findings and conclusions
expressed in this paper are those of the authors and do not necessarily reflect the
views of the National Science Foundation (NSF) or other research sponsors.
76 C. Isen, L.K. John, and E. John

References
1. Agerwala, T., Cocke, J.: High-performance reduced instruction set processors. Technical
report, IBM Computer Science (1987)
2. Bekerman, M., Yoaz, A., Gabbay, F., Jourdan, S., Kalaev, M., Ronen, R.: Early load ad-
dress resolution via register tracking. In: Proceedings of the 27th Annual international
Symposium on Computer Architecture, pp. 306–315
3. Bhandarkar, D., Clark, D.W.: Performance from architecture: comparing a RISC and a
CISC with similar hardware organization. In: Proceedings of ASPLOS 1991, pp. 310–319
(1991)
4. Bhandarkar, D.: A Tale of two Chips. ACM SIGARCH Computer Architecture
News 25(1), 1–12 (1997)
5. Bhandarkar, D., Ding, J.: Performance Characterization of the Pentium® Pro Processor.
In: Proceedings of the 3rd IEEE Symposium on High Performance Computer Architecture,
February 01-05, 1997, pp. 288–297 (1997)
6. Chow, F., Correll, S., Himelstein, M., Killian, E., Weber, L.: How many addressing modes
are enough. In: Proceedings of ASPLOS-2, pp. 117–121 (1987)
7. Cmelik, et al.: An analysis of MIPS and SPARC instruction set utilization on the SPEC
benchmarks. In: ASPLOS 1991, pp. 290–302 (1991)
8. Hennessy, Gelsinger Debate: Can the 386 Architecture Keep Up? John Hennessy and Pat
Gelsinger Debate the Future of RISC vs. CISC: Microprocessor Report
9. Hennessy, J.: VLSI Processor Architecture. IEEE Transactions on Computers C-33(11),
1221–1246 (1984)
10. Hennessy, J.: VLSI RISC Processors. VLSI Systems Design, VI:10, pp. 22–32 (October
1985)
11. Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient
Performance,
http://www.intel.com/technology/architecture-silicon/core/
12. Smith, J.E., Weiss, S.: PowerPC 601 and Alpha 21064. A Tale of Two RISCs, IEEE Com-
puter
13. Microprocessor Report – Chart Watch - Server Processors. Data as of (October 2007)
http://www.mdronline.com/mpr/cw/cw_wks.html
14. Patterson, D.A., Ditzel, D.R.: The case for the reduced instruction set computer. Computer
architecture News 8(6), 25–33 (1980)
15. Patterson, D.: Reduced Instruction Set Computers. Communications of the ACM 28(1), 8–
21 (1985)
16. Kanter, D.: Fall Processor Forum 2006: IBM’s POWER6,
http://www.realworldtech.com/
17. Kanter, D.: Intel’s Next Generation Microarchitecture Unveiled. Real World Technologies
(March 2006), http://www.realworldtech.com
18. SPEC Benchmarks, http://www.spec.org
19. Wong, M.: C++ benchmarks in SPEC CPU 2006. SIGARCH Computer Architecture
News 35(1), 77–83 (2007)
Investigating Cache Parameters of x86 Family
Processors

Vlastimil Babka and Petr Tůma

Department of Software Engineering


Faculty of Mathematics and Physics, Charles University
Malostranské náměstı́ 25, Prague 1, 118 00, Czech Republic
{vlastimil.babka,petr.tuma}@dsrg.mff.cuni.cz

Abstract. The excellent performance of the contemporary x86 proces-


sors is partially due to the complexity of their memory architecture,
which therefore plays a role in performance engineering efforts. Unfortu-
nately, the detailed parameters of the memory architecture are often not
easily available, which makes it difficult to design experiments and eval-
uate results when the memory architecture is involved. To remedy this
lack of information, we present experiments that investigate detailed pa-
rameters of the memory architecture, focusing on such information that
is typically not available elsewhere.

1 Introduction

The memory architecture of the x86 processor family has evolved over more than
a quarter of a century – by all standards, an ample time to achieve consider-
able complexity. Equipped with advanced features such as translation buffers
and memory caches, the architecture represents an essential contribution to the
overall performance of the contemporary x86 family processors. As such, it is a
natural target of performance engineering efforts, ranging from software perfor-
mance modeling to computing kernel optimizations.
Among such efforts is the investigation of the performance related effects
caused by sharing of the memory architecture among multiple software com-
ponents, carried out within the framework of the Q-ImPrESS project1 . The
Q-ImPrESS project aims to deliver a comprehensive framework for multicrite-
rial quality of service modeling in the context of software service development.
The investigation, necessary to achieve a reasonable modeling precision, is based
on evaluating a series of experiments that subject the memory architecture to
various workloads.
In order to design and evaluate the experiments, a detailed information about
the memory architecture exercised by the workloads is required. Lack of infor-
mation about features such as hardware prefetching, associativity or inclusivity
1
This work is supported by the European Union under the ICT priority of the Seventh
Research Framework Program contract FP7-215013 and by the Czech Academy of
Sciences project 1ET400300504.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 77–96, 2009.

c Springer-Verlag Berlin Heidelberg 2009
78 V. Babka and P. Tůma

could result in naive experiment designs, where the workload behavior does not
really target the intended part of the memory architecture, or in naive exper-
iment evaluations, where incidental interference between various parts of the
memory architecture is interpreted as the workload performance.
Within the Q-ImPrESS project, we have carried out multiple experiments on
both AMD and Intel processors. Surprisingly, the documentation provided by
both vendors for their processors has turned out to be somewhat less complete
and correct than necessary – some features of the memory architecture are only
presented in a general manner applicable to an entire family of processors, other
details are buried among hundreds of pages of assorted optimization guidelines.
To overcome the lack of detailed information, we have constructed additional
experiments intended specifically to investigate the parameters of the memory
architecture. These experiments are the topic of this paper.
We believe that the experiments investigating the parameters of the memory
architecture can prove useful to other researchers – some performance relevant
aspects of the memory architecture are extremely sensitive to minute details,
which makes the investigation tedious and error prone. We present both an
overview of some of the more interesting experiments and an overview of the
framework used to execute the experiments – Section 2 focuses on the parameters
of the translation buffers, Section 3 focuses on the parameters of the memory
caches, Section 4 presents the framework.
After a careful consideration, we have decided against providing an overview of
the memory architecture of the x86 processor family. In the following, we assume
familiarity with the x86 processor family on the level of the vendor supplied user
guides [1,2], or at least on the general programmer level [3].

1.1 Experimental Platforms

For the experiments, we have chosen two platforms that represent common
servers with both Intel and AMD processors, further referred to as Intel Server
and AMD Server.
Intel Server. A server configuration with an Intel processor is represented by
the Dell PowerEdge 1955 machine, equipped with two Quad-Core Intel Xeon
CPU E5345 2.33 GHz (Family 6 Model 15 Stepping 11) processors with inter-
nal 32 KB L1 caches and 4 MB L2 caches, and 8 GB Hynix FBD DDR2-667
synchronous memory connected via Intel 5000P memory controller.
AMD Server. A server configuration with an AMD processor is represented
by the Dell PowerEdge SC1435 machine, equipped with two Quad-Core AMD
Opteron 2356 2.3 GHz (Family 16 model 2 stepping 3) processors with internal
64 KB L1 caches, 512 KB L2 caches and 2 MB L3 caches, integrated memory
controller with 16 GB DDR2-667 unbuffered, ECC, synchronous memory.
To collect the timing information, the RDTSC processor instruction is used.
In addition to the timing information, we collect the values of the performance
counters for events related to the experiments using the PAPI library [4] running
Investigating Cache Parameters of x86 Family Processors 79

on top of perfctr [5]. The performance events supported by the platforms are
described in [1, Appendix A.3] and [6, Section 3.14]. For overhead incurred by
the measurement framework, see [7].
Although mostly irrelevant, both platforms are running Fedora Linux 8 with
kernel 2.6.25.4-10.fc8.x86 64, gcc-4.1.2-33.x86 64, glibc-2.7-2.x86 64. Only 4 level
paging with 4 KB pages is investigated.

1.2 Presenting Results


To illustrate the results, we typically provide plots of values such as the dura-
tion of the measured operation or the value of a performance counter, typically
plotted as a dependency on one of the experiment parameters. Durations are
expressed in processor clocks. On Platform Intel Server, a single clock tick cor-
responds to 0.429 ns. On Platform AMD Server, a single clock tick corresponds
to 0.435 ns.
To capture the statistical variability of the results, we use boxplots of indi-
vidual samples, or, where the duration of individual operations approaches the
measurement overhead, boxplots of averages. The boxplots are scaled to fit the
boxes with the whiskers, but not necessarily to fit all the outliers, which are
usually not related to the experiment. Where boxplots would lead to poorly
readable graphs, we use lines to plot the trimmed means.
When averages are used in a plot, the legend of the plot informs about the
details. The Avg acronym is used to denote standard mean of the individual ob-
servations – for example, 1000 Avg indicates that the plotted values are standard
means from 1000 operations performed by the experiment. The Trim acronym
is used to denote trimmed mean of the individual observations where 1 % of
minimum and maximum observations was discarded – for example, 1000 Trim
indicates that the plotted values are trimmed means from 1000 operations per-
formed by the experiment. The acronyms can be combined – for example, 1000
walks Avg Trim means that observations from 1000 walks performed by the ex-
periment were the input of a standard mean calculation, whose outputs were the
input of a trimmed mean calculation, whose output is plotted.
Since the plots that use averages do not give information about the statistical
variability of the results, we point out in text those few cases where the standard
deviation of the results is above 0.5 processor clock cycles or 0.2 performance
event counts.

2 Investigating Translation Buffers


On Platform Intel Server, the translation buffers include an instruction TLB
(ITLB), two levels of data TLB (DTLB0, DTLB1), a cache of the third level
paging structures (PDE cache), and a cache of the second level paging struc-
tures (PDPTE cache). On Platform AMD Server, the translation buffers in-
clude two levels of instruction TLB (L1 ITLB, L2 ITLB), two levels of data TLB
80 V. Babka and P. Tůma

Table 1. Translation Buffer Parameters

Buffer Entries Associativity Miss [cycles]


Platform Intel Server
ITLB 128 4-way 18.5
DTLB0 16 4-way 2
DTLB1 256 4-way +7
PDE cache present +4
PDPTE cache present +8
PML4TE cache not present N/A
Platform AMD Server
L1 ITLB 32 full 4
L2 ITLB 512 4-way +40
L1 DTLB 48 full 5
L2 DTLB 512 4-way +35
PDE cache present +21
PDPTE cache present +21
PML4TE cache present +21

(L1 DTLB, L2 DTLB), a cache of the third level paging structures (PDE cache),
a cache of the second level paging structures (PDPTE cache), and a cache of the
first level paging structures (PML4TE cache). The following table summarizes
the basic parameters of the translation buffers on the two platforms, with the
parameters not available in vendor documentation emphasized.
We begin our translation buffers investigation by describing experiments tar-
geted at the translation miss penalties, which are not available in vendor
documentation.

2.1 Translation Miss Penalties


The experiments we perform are based on measuring durations of memory ac-
cesses using various access patterns, constructed to trigger hits and misses as
necessary. Underlying the construction of the patterns is an assumption that
accesses to the same address generally trigger hits, while accesses to different
addresses generally trigger misses, and the choice of addresses determines which
part of the memory architecture hits or misses.
Due to measurement overhead, it is not possible to measure the memory
accesses alone. To minimize the distortion of the experiment results, the mea-
sured workload should perform as few additional memory accesses and additional
processor instructions as possible. To achieve this, we create the access pattern
in advance and store it in memory as the very data that the measured work-
load accesses. The access pattern forms a chain of pointers and the measured
workload uses the pointer that it reads in each access as an address for the next
access. The workload is illustrated in Listing 1.1.
Investigating Cache Parameters of x86 Family Processors 81

Listing 1.1. Pointer walk workload.

// Variable start is initialized by an access pattern generator


uintptr_t *ptr = start;

for (int i = 0; i < loopCount; i++)


ptr = (uintptr_t *) *ptr;

Experiments with instruction access use a similar workload, replacing chains


of pointers with chains of jump instructions. A necessary difference from using
the chains of pointers is that the chains of jump instructions must not wrap,
but must contain additional instructions that control the access loop. To achieve
a reasonably homogeneous workload, the access loop is partially unrolled, as
presented in Listing 1.2.

Listing 1.2. Instruction walk workload.

// The jump_walk function contains the jump instructions


int len = loopCount / 16;

while (len --)


jump_walk (); // The function is invoked 16 times

To measure the translation miss penalties, the experiments need to access


addresses that miss in TLB but hit in the L1 cache. This is done by access-
ing addresses that map to the same associativity set in TLB but to different
associativity sets in the L1 cache. With a TLB of size S and associativity A
mapping pages of size P , the associativity set is selected by log2 (S/A) bits start-
ing with bit log2 (P ) of the virtual address. Similarly, with a virtually indexed
L1 cache of size S and associativity A caching lines of size L, the associativ-
ity set is selected by log2 (S/A) bits starting with bit log2 (L) of the virtual
address. The two groups of bits can partially overlap, making a choice of an
associativity set in TLB limit the choices of an associativity set in the L1
cache. We generate an access pattern that addresses a single associativity set
in TLB and chooses a random associativity set of the available sets in the L1
cache.
The code of the set collision access pattern generator is presented in Listing 1.3
and accepts these parameters:

– numP ages The number of different addresses to choose from.


– numAccesses The number of different addresses to actually access.
– pageStride The stride of accesses in units of page size.
– accessOf f set Offset of addresses inside pages when not randomized.
– accessOf f setRandom Tells whether to randomize offsets inside pages.
82 V. Babka and P. Tůma

Listing 1.3. Set collision access pattern generator.

// Create array of pointers to the allocated pages


uintptr_t **pages = new (uintptr_t *) [numPages];
for (int i = 0; i < numPages; i++)
pages [i] = (uintptr_t *) buf + pageStride * PAGE_SIZE;

// Cache line size is considered in units of pointer size


int numOffsets = PAGE_SIZE / LINE_SIZE;

// Create array of offsets in a page


offsets = new int [numPageOffsets];
for (int i = 0 ; i < numPageOffsets ; i++)
offsets [i] = i * cacheLineSize;

// Randomize the order of pages and offsets


random_shuffle (pages, pages + numPages);
random_shuffle (offsets, offsets + numOffsets);

// Create the pointer walk from pointers and offsets


uintptr_t *start = addresses [0];
if (accessOffsetRandom) start += offsets [0];
else start += accessOffset;

uintptr_t **ptr = (uintptr_t **) start;


for (int i = 1 ; i < numAccesses ; i++) {
uintptr_t *next = addresses [i];
if (accessOffsetRandom) next += offsets [i % numOffsets];
else next += accessOffset;

(*ptr) = next;
ptr = (uintptr_t **) next;
}

// Wrap the pointer walk


(*ptr) = start;
delete [] pages;

2.2 Experiment: TLB Miss Penalties


For every DTLB present in the system, the experiments that determine the
penalties of translation misses use the set collision pointer walk from List-
ing 1.1 and 1.3 with pageStride set to number of entries divided by associativity,
numP ages set to a value higher than associativity and numAccesses varying
from 1 to numP ages. When numAccesses is less than or equal to associativity,
Investigating Cache Parameters of x86 Family Processors 83

Fig. 1. DTLB0 miss penalty and related performance events on Intel Server

all accesses should hit, afterwards the accesses should start missing, depend-
ing on the replacement policy. For ITLBs, we analogically use a jump emitting
version of code from Listing 1.3 with the code from Listing 1.2.
Since the plots that illustrate the results for each TLB are similar in shape,
we include only representative examples and comment the results in writing. All
plots are available in [7].
Starting with an example of a well documented result, we choose the experi-
ment with DTLB0 on Platform Intel Server, which requires pageStride set to 4
and numAccesses varying from 1 to 32. The results on Fig. 1 contain both the
average access duration and the counts of the related performance events. We
see that the access duration increases from 3 to 5 cycles at 5 accessed pages. At
the same time, the number of misses in DTLB0 (DTLB MISSES.L0 MISS LD
events) increases from 0 to 1, but there are no DTLB1 misses (DTLB MISSES-
:ANY events). The experiment therefore confirms the well documented parame-
ters of DTLB0 such as the 4-way associativity and the miss penalty of 2 cycles [1,
page A-9]. It also suggests that the replacement policy behavior approximates
LRU for our access pattern.
Experimenting with DTLB1 on Platform Intel Server requires changing the
pageStride parameter to 64 and yields an increase in the average access du-
ration from 3 to 12 cycles at 5 accessed pages. Figure 2 shows the counts of
the related performance events, attributing the increase to DTLB1 misses and
confirming the 4-way associativity. Since there are no DTLB0 misses that would
hit in the DTLB1, the figure also suggests non-exclusive policy between DTLB0
and DTLB1. The experiment therefore estimates the miss penalty, which is not
available in vendor documentation, at 7 cycles. Interestingly, the counter of cy-
cles spent in page walks (PAGE WALKS:CYCLES events) reports only 5 cycles
per access and therefore does not fully capture this penalty.
As an additional information not available in vendor documentation, we can
see that exceeding the DTLB1 capacity increases the number of L1 data cache
references (L1D ALL REF events) from 1 to 2. This suggests that page tables
are cached in the L1 data cache, and that the PDE cache is present and the page
table accesses hit there, since only the last level page walk step is needed.
Experimenting with L1 DTLB on Platform AMD Server requires changing
pageStride to 1 for full associativity. The results show a change from 3 to 8
84 V. Babka and P. Tůma

Fig. 2. Performance event counters related to L1 DTLB misses on Intel Server (left)
and L2 DTLB misses on AMD Server (right)

cycles at 49 accessed pages, which confirms the full associativity and 48 entries
in the L1 DTLB, the replacement policy behavior approximates LRU for our
access pattern. The performance counters show a change from 0 to 1 in the
L1 DTLB miss and L2 DTLB hit events, the L2 DTLB miss event does not
occur. The experiment therefore estimates the miss penalty, which is not avail-
able in vendor documentation, at 5 cycles. Note that the value of L1 DTLB
hit counter (L1 DTLB HIT:L1 4K TLB HIT) is always 1, indicating a possible
problem with this counter on the particular experiment platform.
For L2 DTLB on Platform AMD Server, pageStride is set to 128. The results
show an increase from 3 to 43 cycles at 49 accessed pages, which means that we
observe L2 DTLB misses and also indicates a non-exclusive policy between L1
DTLB and L2 DTLB. The L2 associativity, however, is difficult to confirm due
to full L1 associativity. The event counters on Fig. 2 show a change from 0 to 1 in
the L2 miss event (L1 DTLB AND L2 DTLB MISS:4K TLB RELOAD event).
The penalty of the L2 DTLB miss is thus estimated at 35 cycles in addition to
the L1 DTLB miss penalty, or 40 cycles in total.
On Platform AMD Server, the paging structures are not cached in the L1
cache. The value of the REQUESTS TO L2:TLB WALK event counter shows
that each L2 DTLB miss in this experiment results in one page walk step that
accesses the L2 cache. This means that a PDE cache is present, as is further
examined in the next experiment. Note that the problem with the value of the
L1 DTLB HIT:L1 4K TLB HIT event counter persists, it is always 1 even in
presence of L2 DTLB misses.

2.3 Additional Translation Caches

Our experiments targeted at the translation miss penalties indicate that a TLB
miss can be resolved with only one additional memory access, rather than as
many accesses as there are levels in the paging structures. This means that
that a cache of the third level paging structures is present on both investigated
platforms, and since the presence of such additional translation caches mentioned
only discussed in general terms in vendor documentation [8], we investigate these
caches next.
Investigating Cache Parameters of x86 Family Processors 85
Acc. dur. [cycles − 1000 walks Avg Trim]

L1DC accesses [1000 walks Avg Trim]

5
Stride [pages] Stride [pages]
40

512 512
4K 4K

4
8K 8K
30

64 K 64 K

3
128 K 128 K
20

256 K 256 K

2
10

1
5 10 15 5 10 15

Number of accessed pages Number of accessed pages

Fig. 3. Extra translation caches miss penalty (left) and related L1 data cache references
events related (right) on Intel Server

2.4 Experiment: Extra Translation Buffers

With the presence of the third level paging structure cache (PDE cache) already
confirmed, we focus on determining the presence of caches for the second level
(PDPTE cache) and the first level (PML3TE cache).
The experiments use the set collision pointer walk from Listing 1.1 and 1.3.
The numAccesses and pageStride parameters are initially set to values that
make each access miss in the last level of DTLB and hit in the PDE cache. By
repeatedly doubling pageStride, we should eventually reach a point where only
a single associativity set in the PDE cache is accessed, triggering misses when
numAccesses exceeds the associativity. This should be observed as an increase of
the average access duration and an increase of the data cache access count during
page walks. Eventually, the accessed memory range pageStride × numP ages
exceeds the 512 × 512 pages translated by a single third level paging structure,
making the accesses map to different entries in the second level paging structure
and thus different entries in the PDPTE cache, if present. Further increase of
pageStride extends the scenario analogically to the PML4TE cache.
The change of the average access durations and the corresponding change
in the data cache access count for different values of pageStride on Platform
Intel Server are illustrated in Fig. 3. Only those values of pageStride that lead
to different results are displayed, the results for the values that are not displayed
are the same as the results for the previous value.
For the 512 pages stride, the average access duration changes from 3 to 12 at 5 ac-
cessed pages, which means we hit the PDE cache as in the previous experiment. We
also observe an increase of the access duration from 12 to 23 cycles and a change in
the L1 cache miss (L1D REPL event) counts from 0 to 1 at 9 accessed pages. These
misses are not caused by the accessed data but by the page walks, since with this
particular stride and alignment, we always read the first entry of a page table and
therefore the same cache set. We see that the penalty of this miss is 11 cycles, also
reflected in the value of the PAGE WALKS:CYCLES event counter, which changes
from 5 to 16. Later experiments will show that an L1 data cache miss penalty for
data load on this platform is indeed 11 cycles, which means that the L1 data cache
miss penalty simply adds up with the DTLB miss penalty.
86 V. Babka and P. Tůma

Fig. 4. Extra translation caches miss penalty (left) and related page walk requests to
L2 cache (right) on AMD Server

As we increase the stride, we start to trigger misses in the PDE cache. With
the stride of 8192 pages, which spans 16 PDE entries, and 5 or more accessed
pages, the PDE cache misses on each access. The L1 data cache misses event
counter indicates that there are three L1 data cache references per memory
access, two of them are therefore caused by the page walk. This means that a
PDP cache is also present and the PDE miss penalty is 4 cycles.
Further increasing the stride results in a gradual increase of the PDP cache
misses. With the 512 × 512 pages stride, each access maps to a different PDP
entry. At 5 accessed pages, the L1D ALL REF event counter increases to 5
L1 data cache references per access. This indicates that there is no PML4TE
cache, since all four levels of the paging structures are traversed, and that the
PDP cache has at most 4 entries. Compared to the 8192 pages stride, the PDP
miss adds approximately 19 cycles per access. Out of those, 11 cycles are added
by an extra L1 data cache miss, as both PDE and PTE entries miss the L1
data cache due to being mapped to the same set. The remaining 8 cycles is
the cost of walking two additional levels of page tables due to the PDPTE
miss.
The standard deviation of the results exceeds the limit of 0.5 cycles only when
the L1 cache associativity is about to be exceeded – up to 3.5 cycles, and when
the translation cache level is about to be exhausted – up to 8 cycles.
The observed access durations and the corresponding change in the data cache
access count from an analogous experiment on Platform AMD Server are shown
in Fig. 4. We can see that for a stride of 128 pages, we still hit the PDE cache
as in the previous experiment. Strides of 512 pages and more need 2 page walk
steps and thus hit the PDPTE cache. Strides of 256 K pages need 3 steps and
thus hit the PML4TE cache. Finally, strides of 128 M pages need all 4 steps.
The access duration increases by 21 cycles for each additional page walk step.
With a 128 M stride, we see an additional penalty due to page walks triggering
L2 cache misses.
The standard deviation of the results exceeds the limit of 0.5 cycles only when
the L2 cache capacity is exceeded – up to 18 cycles, and when the translation
cache level is about to be exhausted – up to 10 cycles.
Investigating Cache Parameters of x86 Family Processors 87

Table 2. Cache parameters

Cache Size Associativity Index Miss [cycles]


Platform Intel Server
L1 data 32 KB 8-way virtual 11
L1 code 32 KB 8-way virtual 30 2
L2 unified 4 MB 16-way physical 256-286 3
Platform AMD Server
L1 data 64 KB 2-way virtual 12 random, 27-40 single set 4
L1 code 64 KB 2-way virtual 20 random 5 , 25 single set 6
L2 unified 512 KB 16-way physical +32-35 random 7 , +16-63 single set
L3 unified 2 MB 32-way physical +208 random 8 , +159-211 single set.

3 Investigating Memory Caches

On Platform Intel Server, the memory caches include an L1 instruction cache


per core, an L1 data cache per core, and a shared L2 unified cache per every two
cores. Both L1 caches are virtually indexed, the L2 cache is physically indexed.
On Platform AMD Server, the memory caches include an L1 instruction cache
per core, an L1 data cache per core, an L2 unified cache per core, and a shared
L3 unified cache per every four cores. The following table summarizes the basic
parameters of the memory caches on the two platforms, with the parameters not
available in vendor documentation emphasized.
We begin our memory caches investigation by describing experiments tar-
getted at the cache line sizes, which differ between vendor documentation and
reported research.

3.1 Cache Line Size

The experiments we perform are still based on measuring durations of memory


accesses using various access patterns in the pointer walk from Listing 1.1. To
avoid the effects of hardware prefetching, we use a random access pattern gener-
ated by code from Listing 1.4. First, an array of pointers to the buffer of allocSize
bytes is created, with a distance of accessStride bytes between two consecutive
pointers. Next, the array is shuffled randomly. Finally, the array is used to create
the access pattern of a length of accessSize divided by accessStride.
2
Includes penalty of branch misprediction.
3
Depends on the cache line set where misses occur. Also includes associated DTLB1
miss and L1 data cache miss due to page walk.
4
Differs from the 9 cycles penalty stated in vendor documentation [9, page 223].
5
Includes partial penalty of branch misprediction and L1 ITLB miss.
6
Includes partial penalty of branch misprediction.
7
Depends on the offset of the word accessed. Also includes penalty of L1 DTLB miss.
8
Includes penalty of L2 DTLB miss.
88 V. Babka and P. Tůma

Listing 1.4. Random access pattern generator.

// Create array of pointers in the allocated buffer


int numPtrs = allocSize / accessStride;
uintptr_t **ptrs = new (uintptr_t *)[numPtrs];
for (int i = 0; i < numPtrs; i++)
ptrs [i] = buffer + i * accessStride;

// Randomize the order of the pointers


random_shuffle (ptrs, ptrs + numPtrs);

// Create the pointer walk from selected pointers


uintptr_t *start = ptrs [0];
uintptr_t **ptr = (uintptr_t **) start;
int numAccesses = accessSize / accessStride;
for (int i = 1; i < numAccesses; i++) {
uintptr_t *next = ptrs [i];
(*ptr) = next;
ptr = (uintptr_t **) next;
}

// Wrap the pointer walk


(*ptr) = start;
delete [] ptrs;

3.2 Experiment: Cache Line Size

In order to determine the cache line size, the experiment executes a measured
workload that randomly accesses half of the cache lines, interleaved with an inter-
fering workload that randomly accesses all the cache lines. For data caches, both
workloads use a pointer emitting version of code from Listing 1.4 to initialize the
access pattern and code from Listing 1.1 to traverse the pattern. For instruction
caches, both workloads use a jump emitting version of code from Listing 1.4 to
initialize the access pattern and code from Listing 1.2 to traverse the pattern.
The measured workload uses the smallest possible access stride, which is 8 B for
64 bit aligned pointer variables and 16 B for jump instructions. The interfering
workload varies its access stride. When the stride exceeds the cache line size,
the interfering workload should no longer access all cache lines, which should
be observed as a decrease in the measured workload duration, compared to the
situation when the interfering workload accesses all cache lines.
The results from both platforms and all cache levels and types, except the L2
cache on Platform Intel Server, show a decrease in the access duration when the
access stride of the interfering workload increases from 64 B to 128 B. The counts
of the related cache miss events confirm that the decrease in access duration is
caused by the decrease in cache misses. Except for the L2 cache on Platform
Investigating Cache Parameters of x86 Family Processors 89

Fig. 5. The effect of interfering workload access stride on the L2 cache eviction (left);
streamer prefetches triggered by the interfering workload during the L2 cache eviction
on Intel Server (right)

Intel Server, we can therefore conclude that the line size is 64 B for all cache
levels, as stated in the vendor documentation.
Figure 5 shows the results for the L2 cache on Platform Intel Server. These
results are peculiar in that they would indicate the cache line size of the L2
cache is 128 B rather than 64 B, a result that was already reported in [10]. The
reason behind the observed results is the behavior of the streamer prefetcher
[11, page 3-73], which causes the interfering workload to fetch two adjacent lines
to the L2 cache on every miss, even though the second line is never accessed.
The interfering workload with a 128 B stride thus evicts two 64 B cache lines.
Figure 5 contains values of the L2 prefetch miss (L2 LINES IN:PREFETCH)
event counter collected from the interfering workload rather than the measured
workload, and confirms that L2 cache misses triggered by prefetches occur.
Because the vendor documentation does not explain the exact behavior of
the streamer prefetcher when fetching two adjacent lines, we have performed a
slightly modified experiment to determine which two lines are fetched together.
Both workloads of the experiment access 4 MB with 256 B stride, the measured
workload with offset 0 B, the interfering workload with offsets 0, 64, 128 and
192 B. The offset therefore determines whether both workloads access the same
cache associativity sets or not. The offset of 0 B should always evict lines accessed
by the measured code, the offset of 128 B should always avoid them. If the
streamer prefetcher fetches a 128 B aligned pair of cache lines, using the 64 B
offset should also evict the lines of the measured workload, while the 192 B offset
should avoid them. If the streamer prefetcher fetches any pair of consecutive
cache lines, using both the 64 B offset and the 192 B offset should avoid the lines
of the measured workload.
The results on Fig. 6 indicate that the streamer prefetcher always fetches
128 B aligned pair of cache lines, rather than any pair of consecutive cache lines.
Additional experiments also show that the streamer prefetcher does not
prefetch the second line of a pair when the L2 cache is saturated with another
workload. Running two workloads on cores that share the cache therefore results
in fewer prefetches than running the same two workloads on cores that do not
share the cache.
90 V. Babka and P. Tůma

Fig. 6. Access duration (left) and L2 cache misses by accesses only (right) investigating
streamer prefetch on Intel Server

3.3 Cache Indexing

We continue by determining whether the cache is virtually or physically indexed,


since this information is also not always available in vendor documentation.
Knowing whether the cache is virtually or physically indexed is essential for
later experiments that determine cache miss penalties.
We again use the pointer walk code from Listing 1.1 and create the access
pattern so that all accesses map to the same cache line set. To achieve this, we
reuse the pointer walk initialization code from the TLB experiments on List-
ing 1.3, because the stride we need is always a multiple of the page size on our
platforms. The difference is in that we do not use the offset randomization.
For physically indexed caches, the task of constructing the access pattern
where all accesses map to the same cache line set is complicated by the fact
that the cache line set is determined by physical rather than virtual address. To
overcome this complication, our framework provides an allocation function that
returns pages whose physical and virtual addresses are identical in the bits that
determine the cache line set. This allocation function, further called colored
allocation, is used in all experiments that define strides in physically indexed
caches.
Note that we do not have to determine cache indexing for the L1 caches on
Platform Intel Server, where the combination of 32 KB size and 8-way associa-
tivity means that an offset within a page entirely determines the cache line set.

3.4 Experiment: Cache Set Indexing

We measure the average access time in a set collision pointer walk from List-
ing 1.1 and 1.3, with the buffer allocated using either the standard allocation or
the colored allocation. The number of accessed pages is selected to exceed the
cache associativity. If a particular cache is virtually indexed, the results should
show an increase in access duration when the number of accesses exceeds asso-
ciativity for both modes of allocation. If the cache is physically indexed, there
should be no increase in access duration with the standard allocation, because
the stride in virtual addresses does not imply the same stride in physical ad-
dresses.
Investigating Cache Parameters of x86 Family Processors 91

Fig. 7. Dependency of associativity misses in L2 cache on page coloring on Intel Server

The results from Platform Intel Server show that colored allocation is needed
to trigger L2 cache misses, as illustrated in Fig. 7. The L2 cache is therefore
physically indexed. Without colored allocation, the standard deviation of the
results grows when the L1 cache misses start occuring, staying below 3.2 cycles
for 8 accessed pages and below 1 cycle for 9 and more accessed pages. Similarly
with colored allocation, the standard deviation stays below 5.5 cycles for 7 and
8 accessed pages when the L1 cache starts missing, and below 10.5 cycles for 16
and 17 accessed pages when the L2 cache stats missing.
The results from Platform AMD Server on Fig. 8 also show that colored allo-
cation is needed to trigger L2 cache misses with 19 and more accesses. Colored
allocation also seems to make a difference for the L1 data cache, but values of
the event counters on Fig. 8 show that the L1 data cache misses occur with both
modes of allocation, the difference in the observed duration therefore should not
be attributed to indexing. The standard deviation of the results exceeds the limit
of 0.5 cycles for small numbers of accesses, with a maximum standard deviation
of 2.1 cycles at 3 accesses.

3.5 Cache Miss Penalties

Finally, we measure the memory cache miss penalties, which appear to include
effects not described in vendor documentation.

Fig. 8. Dependency of associativity misses in L1 data and L2 cache on page coloring


(left) and related performance events (right) on AMD Server
92 V. Babka and P. Tůma

Fig. 9. L2 cache miss penalty when accessing single cache line set (left); dependency
on cache line set selection in pages of color 0 (right) on Intel Server

3.6 Experiment: Cache Miss Penalties and Their Dependencies

The experiment determines the penalties of misses in all levels of the cache
hierarchy and their possible dependency on the offset of accesses triggering the
misses. We rely again on the set collision access pattern from Listing 1.1 and 1.3,
increasing the number of repeatedly accessed addresses and varying the offset
within a cache line to determine its influence on the access duration. The results
are summarized in Table 2, more can be found in [7].
On Platform Intel Server, we observe an unexpected increase in the average
access duration when about 80 different addresses mapped to the same cache
line set. The increase, visible on Fig. 9, is not reflected by any of the relevant
event counters. Further experiments, also illustrated on Fig. 9, reveal a difference
between accessing odd and even cache line sets within a page. We see that the
difference varies with the number of accessed addresses, with accesses to the even
cache lines faster than odd cache lines for 32 and 64 addresses, and the other
way around for 128 addresses. The standard deviation in these results is under
3 clocks.
On Platform AMD Server, we observe an unusually high penalty for the L1
data cache miss, with an even higher peak when the number of accessed addresses
just exceeds the associativity, as illustrated in Fig. 10. Determined this way, the

Fig. 10. L1 data cache miss penalty when accessing a single cache line set (left) and
random sets (right) on AMD Server
Investigating Cache Parameters of x86 Family Processors 93

Fig. 11. ependency of L2 cache miss penalty on access offset in a cache line when
accessing random cache line sets (left) and 20 cache lines in the same set (right) on
AMD Server

penalty would be 27 cycles, 40 cycles for the peak, which is significantly more
than the stated L2 access latency of 9 cycles [9, page 223]. Without additional
experiments, we speculate that the peak is caused by the workload attempting
to access data that is still in transit from the L1 data cache to the L2 cache.
More light is shed on the unusually high penalty by another experiment,
one which uses the random access pattern from Listing 1.4 rather than the set
collision pattern from Listing 1.3. The workload allocates memory range twice
the cache size and varies the portion that is actually accessed. Accessing the full
range triggers cache misses on each access, the misses are randomly distributed
to all cache sets. With this approach, we observe a penalty of approximately
12 cycles per miss, as illustrated on Fig. 10. We have extended this experiment
to cover all caches on Platform AMD Server, the differences in penalties when
accessing a single cache line set and when accessing multiple cache line sets is
summarized in Table 2.
For the L2 cache, we have also observed a small dependency of the access
duration on the access offset within the cache line when accessing random cache
sets, as illustrated on Fig. 11. The access duration increases with each 16 B of the
offset and can add almost 3 cycles to the L2 miss penalty. A similar dependency
was also observed when accessing multiple addresses mapped to the the same
cache line set, as illustrated on Fig. 11.
Again, we believe that illustrating the many variables that determine the
cache miss penalties is preferable to the incomplete information available in
vendor documentation, especially when results of more complex experiments
which include such effects are to be analyzed.

4 Experimental Framework

The experiments described here were performed within a generic benchmarking


framework, designed to investigate performance related effects due to sharing
of resources such as the processor core or the memory architecture among mul-
tiple software components. The framework source is available for download at
94 V. Babka and P. Tůma

http://dsrg.mff.cuni.cz/benchmark together with multiple benchmarks, includ-


ing all the benchmarks described in this paper, implemented in the form of
extensible workload modules. The support provided by the framework includes:

– Creating and executing parametrized benchmarks. The user can specify


ranges of individual parameters, the framework executes the benchmark with
all the specified combinations of the parameter values.
– Collecting precise timing information through RDTSC instruction and per-
formance counter values through PAPI [4].
– Executing either isolated benchmarks or combinations of benchmarks to in-
vestigate the sharing effects.
– Plotting of results through R [12]. Supports boxplots for examining depen-
dency on one benchmark parameter and plots with multiple lines for different
values of other benchmark parameters.

Besides providing the execution environment for the benchmarks, the frame-
work bundles utility functions, such as the colored allocation used in experiments
with physically indexed caches in Section 3.
The colored allocation is based on page coloring [13], where the bits deter-
mining the associativity set are the same in virtual and physical address. The
number of the associativity set is called a color. As an example, the L2 cache on
Platform Intel Server has a size of 4 MB and 16-way associativity, which means
that addresses with a stride of 256 KB will be mapped to the same cache line set
[11, page 3-61]. With 4 KB page size, this yields 64 different colors, determined
by the 6 least significant bits of the page address.
Although the operating system on our experimental platforms does not sup-
port page allocation with coloring, it does provide a way for the executed
program to determine its current mapping. Our colored allocation uses this in-
formation together with the mremap function to allocate a continuous virtual
memory area, determine its mapping and remap the allocated pages one by one
to a different virtual memory area with the target virtual addresses matching
the color of the physical addresses. This way, the allocator can construct a con-
tinuous virtual memory area with virtual pages having the same color as the
physical frames that the pages are mapped to.

5 Conclusion
We have described a series of experiments designed to investigate some of the
detailed parameters of the memory architecture of the x86 processor family. Al-
though the knowledge of the detailed parameters is of limited practical use in
general software development, where it is simply too involved and too specialized,
we believe it is of significant importance in designing and evaluating research
experiments that exercise the memory architecture. Without this knowledge, it
is difficult to design experiments that target the intended part of the memory
Investigating Cache Parameters of x86 Family Processors 95

architecture and to distinguish results that are characteristic of the experiment


workload from results that are due to incidental interference. We should point
out that the detailed parameters are often not available in vendor documenta-
tion, or – since claiming to know all vendor documentation would be somewhat
preposterous – at least are often only available as fragmented information buried
among hundreds of pages of text.
Among the detailed parameters investigated in this paper are the address
translation miss penalties (which are partially documented for Platform
Intel Server and not documented for Platform AMD Server), the parameters
of the additional translation caches (which are not documented for Platform
Intel Server and not even mentioned for Platform AMD Server), the cache line
size (which is well documented but measured incorrectly in [10]) together with
the reasons for the cited incorrect measurement, the cache indexing (which seems
to be generally known but is not documented for Platform AMD Server), and
the cache miss penalties (which seem to be more complex than documented even
when abstracting from the memory itself). Additionally, we show some interest-
ing anomalies such as suspect values of performance counters.
We also provide a framework that makes it possible to easily reproduce our
experiments, or to execute our experiments on different experiment platforms.
The framework is used within the Q-ImPrESS project and many more collected
results are available in [7].
To our knowledge, the experiments that we have performed are not available
elsewhere. Closest to our work are the results in [10] and [14], which describe
algorithms for automatic assessment of basic memory architecture parameters,
especially the size and associativity of the memory caches. The workloads used
in [10] and [14] share common features with some of our workloads, especially
where the random pointer walk is concerned. Our workloads are more varied and
therefore provide more results, although the comparison is not quite fair since
we did not aim for automated analysis. We also show some effects that the cited
workloads would not reveal.
Although this paper is primarily targeted at performance evaluation
professionals involved in detailed measurements related to the memory archi-
tecture of the x86 processor family, our results in [7] demonstrate that the
observed effects can impact performance modeling precision at much higher
levels.
As far as the general applicability of our results is concerned, it should be
noted that they are very much tied to the particular experimental platforms,
and can change even with minor platform parameters such as processor or
chipset stepping. For different experimental platforms, our results can serve
to illustrate what effects can be observed, but not to guarantee what effects
will really be present. The availability of our experimental framework, how-
ever, makes it possible to repeat our experiments with very little effort, leav-
ing only the evaluation of the different results to be carried out where
applicable.
96 V. Babka and P. Tůma

References
1. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer Manual,
Volume 3: System Programming, Order Nr. 253668-027 and 253669-027 (July 2008)
2. Advanced Micro Devices, Inc.: AMD64 Architecture Programmer’s Manual Volume
2: System Programming, Publication Number 24593, Revision 3.14. (September
2007)
3. Drepper, U.: What every programmer should know about memory (2007),
http://people.redhat.com/drepper/cpumemory.pdf
4. PAPI: Performance application programming interface,
http://icl.cs.utk.edu/papi
5. Pettersson, M.: Perfctr, http://user.it.uu.se/∼ mikpe/linux/perfctr/
6. Advanced Micro Devices, Inc.: AMD BIOS and Kernel Developer’s Guide For
AMD Family 10h Processors, Publication Number 31116, Revision 3.06 (March
2008)
7. Babka, V., Bulej, L., Děcký, M., Kraft, J., Libič, P., Marek, L., Seceleanu, C.,
Tůma, P.: Resource usage modeling, Q-ImPrESS deliverable 3.3 (September 2008),
http://www.q-impress.eu
8. Intel Corporation: Intel 64 and IA-32 Architectures Application Note: TLBs,
Paging-Structure Caches, and Their Invalidation, Order Nr. 317080-002 (April
2008)
9. Advanced Micro Devices, Inc.: AMD Software Optimization Guide for AMD Family
10h Processors, Publication Number 40546, Revision 3.06 (April 2008)
10. Yotov, K., Pingali, K., Stodghill, P.: Automatic measurement of memory hierarchy
parameters. In: Proceedings of the 2005 ACM SIGMETRICS International Con-
ference on Measurement and Modeling of Computer Systems, pp. 181–192. ACM,
New York (2005)
11. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Man-
ual, Order Nr. 248966-016 (November 2007)
12. R: The R Project for Statistical Computing, http://www.r-project.org/
13. Kessler, R.E., Hill, M.D.: Page placement algorithms for large real-indexed caches.
ACM Trans. Comput. Syst. 10(4), 338–359 (1992)
14. Yotov, K., Jackson, S., Steele, T., Pingali, K.K., Stodghill, P.: Automatic measure-
ment of instruction cache capacity. In: Ayguadé, E., Baumgartner, G., Ramanujam,
J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 230–243. Springer, Hei-
delberg (2006)
The Next Frontier for Power/Performance
Benchmarking: Energy Efficiency of Storage Subsystems

Klaus-Dieter Lange

Hewlett-Packard Company, 11445 Compaq Center Dr. W, Houston, TX-77070, USA


[email protected]

Abstract. The increasing concern of energy usage in datacenters has drastically


changed how the IT industry evaluates servers. The energy conscious selection
of storage subsystems is the next logical step. This paper first quantifies the
possible energy savings of utilizing modern storage subsystems by identifying
inherent energy characteristics of next generation disk IO subsystems. Addi-
tionally, the power consumptions of a variety of workload patterns is
demonstrated.

Keywords: SPEC, Benchmark, Power, Energy, Performance, Server, Storage,


Datacenter.

1 Introduction
Today’s challenge for datacenters is their high energy consumption [1]. The demand
for efficient real estate in datacenters has moved to more power efficient datacenters.
This increasing concern of energy usage in datacenters has drastically changed how
the IT industry evaluates servers. In response, the Standard Performance Evaluation
Corporation (SPEC) [2] has developed and released SPECpower_ssj2008 [3], the first
industry-standard benchmark that evaluates the power and performance characteris-
tics of server class computers. The need for this type of measurement was so urgent
and necessary that the US Environmental Protection Agency (US EPA) included it in
their ENERGY STAR® Program Requirements for Computer Servers [4]. The
SPECpower_ssj2008 results [5] are also already being utilized for energy conscious
purchase decisions. With the competitive marketplace driving server innovation even
further, the next logical phase is adopting an energy conscious evaluation of storage
subsystems.

2 Power Consumption of Server and Storage

In order to show the significant impact on the power consumption of the storage sub-
system we configured a server with external storage, similar to the publicly released
SPECweb2005 result [6]. Two AC power analyzers were connected to separately
measure the power consumption of the server and the external storage. The config-
ured system was then benchmarked with the SPECweb2005 (Banking) workload from
idle to 100% in 10% increments and the power measurements were automatically

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 97–101, 2009.
© Springer-Verlag Berlin Heidelberg 2009
98 K.-D. Lange

recorded in 1s intervals. The server power consumption ranged from ~286W at idle to
~312W at 100% performance; while the external storage ranged from ~305W at idle
to ~400W at 100% performance. Figure 1 represents a graphical view of these data.
This test configuration shows that the power consumption of the external storage
subsystem can be significantly higher than the server itself; most of the current public
SPECweb2005 results exhibit similar tendencies. Another recent study [7] on the
energy cost of datacenters shows that in database setups, 63% of power is consumed
by the storage systems. For at least these application areas (web serving and database)
an industry standard method to measure the energy usage for storage subsystems is
necessary.
Another interesting discovery was the range in power consumption between idle
and 100% performance. For our baseline benchmark configuration this equated to
~9% range for the server and ~30% range for the storage. For comparison, in only one
year after its release, SPECpower_ssj2008 results show that companies pushed the
server range as far as 50%.

Average Power Consumption


400
Server External Storage
Average Power Consumption [W]

350

300

250

200

150

100

50

0
idle 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
SPECweb2005 (Banking) Performance

Fig. 1. Average Power Consumption - Graduated SPECweb2005 Workload

3 Saving Energy by Utilizing Modern Storage Subsystems

To demonstrate the energy savings when using the latest technology, two generations
of storage enclosures, both with a standard rack form factor of 3U, were compared.
The older generation storage enclosure holds 14 large form factor (LFF) 3.5” SCSI
drives and the current generation storage enclosure holds 25 small form factor (SFF)
2.5” SAS drives. A drive capacity of 32GB was chosen for each drive. Each empty
enclosure was attached to a server and then loaded with drives, one drive at a time
every 66 seconds. The idle power of the empty SAS enclosure (72W) was slightly
The Next Frontier for Power/Performance Benchmarking 99

Large Form Factor SCSI vs. Small Form Factor SAS


250

200
Idle Power Consumption [W]

150

100

Previous Generation - LFF SCSI


50
Current Generation - SFF SAS

0
1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 22 23 24 25
Drive Count

Fig. 2. Large Form Factor SCSI vs. Small Form Factor SAS

higher than the SCSI enclosure (58W). Nevertheless, with an idle power of ~12W for
an individual LFF SCSI drive and ~6.25 for an individual SFF SAS drive, this advan-
tage was surpassed after the third drive was added. When reaching the maximum
drive capacity, the 14 LFF drive enclosure used ~227W; the SFF SAS enclosure with
14 drives used only ~156W, approximately 71W power savings. The SFF SAS en-
closure needed to be fully equipped with 25 drives before it would reach the power
consumption of the LFF SCSI drive enclosure.

4 Various Workload Pattern

The power consumption of a server is dependent on the CPU stress pattern. Different
stress patterns cause different consumptions of power (Figure 3).
To demonstrate a similar behavior on storage subsystems, the same hardware
configuration was utilized as in section 3 and different workloads were applied.
Five different workloads were selected for this experiment: 100% random write,
75% random write with 25% random read, 100% random read, 100% sequential
write and 100% sequential read. The resulting power consumptions are shown in
Figure 4.
The findings indicate that random access causes higher power consumption than
sequential access – this could be caused by the additional head movement of the
drives.
For these workloads the SFF SAS enclosure needed to be fully equipped with 25
drives before it would reach the power consumption of the LFF SCSI drive
enclosure.
100 K.-D. Lange

Power Consumption with different CPU stress patterns

perlbmk

equake

sixtrac
parser

facere
vortex

wupwi

ammp

fma3d
galgel
crafty

mgrid

mesa
bzip2

applu

lucas
swim
twolf

apsi
gzip

eon
gap
mcf
gcc
vpr

art
310
Power Consumption [W]

290

270

250

230
0:00

0:10

0:20

0:30

0:40

0:50

1:00

1:10

1:20

1:30

1:40

1:50

2:00

2:10

2:20

2:30

2:40

2:50

3:00

3:10

3:20

3:30

3:40
Fig. 3. CPU – power consumption for various workloads

Large Form Factor (SCSI) vs. Small Form Factor (SAS)


various workloads
q-depth = 128 Previous Generation - max 14 LFF SCSI Drives
block-size = 32 Current Generation - max 25 SFF SAS Drives
Current Generation - 14 SFF SAS Drives
350

325
300
Power Consumption [W]

275
250

225
200

175
150
100% RW 75% RW / 25% 100% RR 100% SW 100% SR

Fig. 4. Storage Subsystem – power consumption for various workloads

5 Conclusion
The power consumption of the external storage subsystem has been identified to be
significantly higher than the server itself in the application areas of web serving and
database.
The experiments in sections 3 and 4 show that modern storage subsystems signifi-
cantly save more energy than their predecessors; however as of December 2008 there
The Next Frontier for Power/Performance Benchmarking 101

is no industry standard benchmark available that can demonstrate these or similar real
energy savings.
There will be many challenges along the way to create benchmarks that measure the
power/performance of server storage subsystems. As in the development of
SPECpower_ssj2008, I am convinced that SPEC will again step up to these challenges
and convene the best talents from the industry to lead the exploration in this next frontier.

6 Future Work
Preliminary measurements of the power/performance characteristics of solid-state
drives (SSD) show very promising results which warrant further investigation. An-
other area of interest is to analyze the impact of energy preserving storage enclosures
and advanced power supplies. Once we have studied these measurements, we will
provide the results to SPEC to support their benchmark development.
The active support of the augmentation of a power component to all applicable
SPEC’s benchmarks will be in the industry’s best interest, since it will enable the fair
evaluation of servers and their subsystems under a wide variety of workloads.

Acknowledgement
The author would like to acknowledge Richard Tomaszewski and Steve Fairchild for
their guidance; Kris Langenfeld, Jonathan Koomey, Roger Tipley, Mark Thompson
and Raghunath Nambiar for their comments and feedback; Bryon Georgson, David
Rogers, Daniel Ames and David Schmidt for their support conducting the power and
performance measurements; Dwight Barron, Mike Nikolaiev and Tracey Stewart for
their continuous support.
SPEC and the benchmark names SPECpower_ssj2008 and SPECweb2005 are reg-
istered trademarks of the Standard Performance Evaluation Corporation.

References
1. Koomey, J.: Worldwide electricity used in data centers. Environmental Research Let-
ters 3(034008) (September 23, 2008),
http://www.iop.org/EJ/abstract/1748-9326/3/3/034008/
2. Standard Performance Evaluation Corporation (SPEC), http://www.spec.org
3. SPECpower_ssj2008, http://www.spec.org/power_ssj2008
4. US EPA’s Energy Star for Enterprise Servers,
http://www.energystar.gov/
index.cfm?c=new_specs.enterprise_servers
5. SPECpower_ssj2008 results,
http://www.spec.org/power_ssj2008/results/power_ssj2008.html
6. SPECweb2005 result,
http://www.spec.org/web2005/results/res2006q4/
web2005-20061019-00048.html
7. Poess, M., Nambiar, R.: Energy Cost, The Key Challenge of Today’s Data Centers: A
Power Consumption Analysis of TPC-C Results,
http://www.vldb.org/pvldb/1/1454162.pdf
Thermal Design Space Exploration of 3D Die Stacked
Multi-core Processors Using Geospatial-Based Predictive
Models

Chang-Burm Cho, Wangyuan Zhang, and Tao Li

Intelligent Design of Efficient Architecture Lab(IDEAL),


Department of ECE, University of Florida
{choreno,zhangwy}@ufl.edu, [email protected]

Abstract. This paper presents novel 2D geospatial-based predictive models for


exploring the complex thermal spatial behavior of three-dimensional (3D) die
stacked multi-core processors at the early design stage. Unlike other analytical
techniques, our predictive models can forecast the location, size and tempera-
ture of thermal hotspots. We evaluate the efficiency of using the models for
predicting within-die and cross-dies thermal spatial characteristics of 3D multi-
core architectures with widely varied design choices (e.g. microarchitecture,
floor-plan and packaging). Our results show the models achieve high accuracy
while maintaining low complexity and computation overhead.

Keywords: Thermal/power characterization, multi-core architecture, 3D die


stacking, analytical modeling.

1 Introduction
Three-dimensional (3D) integrated circuit design [1] is an emerging technology that
greatly improves transistor integration density and reduces on-chip wire communica-
tion latency. It places planar circuit layers in the vertical dimension and connects
these layers with a high density and low-latency interface. In addition, 3D offers the
opportunity of binding dies, which are implemented with different techniques to en-
able integrating heterogeneous active layers for new system architectures. Leveraging
3D die stacking technologies to build uni-/multi-core processors has drawn an in-
creased attention to both chip design industry and research community [2- 8].
The realization of 3D chips faces many challenges. One of the most daunting of
these challenges is the problem of inefficient heat dissipation. In conventional 2D
chips, the generated heat is dissipated through an external heat sink. In 3D chips, all
of the layers contribute to the generation of heat. Stacking multiple dies vertically
increases power density and dissipating heat from the layers far away from the heat
sink is more challenging due to the distance of heat source to external heat sink.
Therefore, 3D technologies not only exacerbate existing on-chip hotspots but also
create new thermal hotspots. High die temperature leads to thermal-induced perform-
ance degradation and reduced chip lifetime, which threats the reliability of the whole
system, making modeling and analyzing thermal characteristics crucial in effective
3D microprocessor design.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 102–120, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 103

Die1 Die2 Die3 Die4

CPU
MEM
MIX

Fig. 1. 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors

Config. A Config. B Config. C Config. D


CPU
MEM
MIX

Fig. 2. 2D thermal variation on die 4 under different microarchitecture and floor-plan configu-
rations

Previous studies [5, 6] show that 3D chip temperature is affected by factors such as
configuration and floor-plan of microarchitectural components. For example, instead
of putting hot components together, thermal-aware floor-planning places the hot com-
ponents by cooler components, reducing the global temperature. Thermal-aware floor-
planning [5] uses intensive and iterative simulations to estimate the thermal effect of
microarchitecture components at early architectural design stage. However, using
detailed yet slow cycle-level simulations to explore thermal effects across large de-
sign space of 3D multi-core processors is very expensive in terms of time and cost.
To achieve thermal efficient 3D multi-core processor design, architects and chip
designers need models with low computation overhead, which allow them to quickly
explore the design space and compare different design options. One challenge in
modeling the thermal behavior of 3D die stacked multi-core architecture is that the
manifested thermal patterns show significant variation within each die and across
different dies (as shown in Fig. 1). The results were obtained by simulating a 3D die
stacked quad-core processors running multi-programmed CPU (bzip2, eon,
gcc,perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk)
workloads. Each program within a multi-programmed workload was assigned to a die
104 C.-B. Cho, W. Zhang, and T. Li

that contains a processor core and caches. More details on our experimental method-
ologies can be found in Section 4.
Figure 2 shows the 2D thermal variation on die 4 under different microarchitecture
and floor-plan configurations. On the given die, the 2-dimensional thermal spatial
characteristics vary widely with different design choices. As the number of architec-
tural parameters in the design space increases, the complex thermal variation and
characteristics cannot be captured without using slow and detailed simulations. As
shown in Figs. 1 and 2, to explore the thermal-aware design space accurately and
informatively, we need computationally effective methods that not only predict ag-
gregate thermal behavior but also identify both size and geographic distribution of
thermal hotspots. In this work, we aim to develop fast and accurate predictive models
to achieve this goal.
Prior work has proposed various predictive models [9, 10, 11, 12, 13, 14, 15] to
cost-effectively reason processor performance and power characteristics at the design
exploration stage. A common weakness of existing analytical models is that they
assume centralized and monolithic hardware structures and therefore lack the ability
to forecast the complex and heterogeneous thermal behavior across large and distrib-
uted 3D multi-core architecture substrates. In this paper, we addresses this important
and urgent research task by developing novel, 2D multi-scale predictive models,
which can efficiently reason the geo-spatial thermal characteristics within die and
across different dies during the design space exploration stage without using detailed
cycle-level simulations. Instead of quantifying the complex geo-spatial thermal char-
acteristics using a single number or a simple statistical distribution, our proposed
techniques employ 2D wavelet multiresolution analysis and neural network non-linear
regression modeling. With our schemes, the thermal spatial characteristics are de-
composed into a series of wavelet coefficients. In the transform domain, each individ-
ual wavelet coefficient is modeled by a separate neural network. By predicting only a
small set of wavelet coefficients, our models can accurately reconstruct 2D spatial
thermal behavior across the design space.
The rest of the paper is organized as follows: In Section 2, we briefly describe
the wavelet transform, especially for 2D wavelet transform and the principles of
neural networks are also presented. Section 3 provides our wavelet based neural
networks for 2D thermal behavior prediction and system details. Section 4 intro-
duces our experimental setup. Section 5 highlights our experimental results on 2D
thermal behavior prediction and analyzes the tradeoff between model complexity,
configuration, and prediction accuracy. Section 6 discusses related work. Section 7
concludes the paper.

2 Background

To familiarize the reader with the general methods used in this paper, we provide a
brief overview of wavelet multiresolution analysis and neural network regression
prediction in this section. To learn more details about wavelets and neural networks,
the reader is encouraged to read [16, 17].
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 105

2.1 1D Wavelet Transform

Wavelets are mathematical tools that use a simple, fixed prototype function (called
the analyzing or mother wavelet) to transform data of interest into different frequency
components and study each component with a resolution that matches its scale. A
wavelet transform, which decomposes data of interest by wavelets, provides a com-
pact and effective mathematical representation of the original data. In contrast to
Fourier transforms, which only offer frequency representations, wavelets are capable
of providing time and frequency localizations simultaneously. Wavelet analysis em-
ploys two functions, often referred to as the scaling filter ( H ) and the wavelet filter
( G ), to generate a family of functions that break down the original data. The scaling
filter is similar in concept to an approximation function, while the wavelet filter quan-
tifies the differences between the original data and the approximation generated by
the scaling function. Wavelet analysis allows one to choose the pair of scaling and
wavelet filters from numerous functions. In this section, we provide a quick primer on
wavelet analysis using the Haar wavelet, which is the simplest form of wavelets [18].
Equation (1) shows the scaling and wavelet filters for Haar wavelets, respectively.

H = (1 / 2 ,1 / 2 ) G = (−1 / 2 ,1 / 2 ) . (1)

The Haar discrete wavelet transform (DWT) works by averaging two adjacent val-
ues on a series of data at a given scale to form smoothed, lower-dimensional data (i.e.
approximations), and the resulting coefficients (i.e. details), which are the differences
between the values and their averages. By recursively repeating the decomposition
process on the averaged sequence, we achieve multi-resolution decomposition. The
process continues by decomposing the scaling coefficient (approximation) vector
repeating the same steps, and completes when only one coefficient remains. As a
result, wavelet decomposition is the collection of average and detail coefficients at all
scales.

H * = (1 / 2 ,1 / 2 ) G* = (1 / 2 ,−1 / 2 ) . (2)

The original data can be reconstructed from wavelet coefficients using a pair of
wavelet synthetic filters ( H * and G * ), as shown in (2). With the Haar wavelets, this
inverse wavelet transform can be achieved by adding difference values back or sub-
tracting differences from the averages. This process can be performed recursively
until the finest scale is reached. The original data can be perfectly recovered if all
wavelet coefficients are involved. Alternatively, an approximation of the data can be
reconstructed using a subset of wavelet coefficients. Using a wavelet transform gives
time-frequency localization of the original data. As a result, the original data can be
accurately approximated using only a few wavelet coefficients since they capture
most of the energy of the input data. Thus, keeping only the most significant coeffi-
cients enables us to represent the original data in a lower dimension. Note that in (1)
and (2) we use 2 instead of 2 as a scaling factor since just averaging cannot preserve
Euclidean distance in the transformed data.
106 C.-B. Cho, W. Zhang, and T. Li

2.2 2D Wavelet Transform

To capture the 2D spatial thermal characteristics effectively in 3D integrated multi-


core chips, we propose to use 2D wavelet analysis in this study. With 1D wavelet
analysis that uses Haar wavelet filters, each adjacent pair of data in a discrete interval
is replaced with its average and difference. A similar concept can be applied to obtain
a 2D wavelet transform of data in a discrete plane.

Fig. 3. Illustration of 2D wavelet transforms

As shown in Fig. 3, the 1D analysis filter bank is first applied to the rows (horizon-
tal filtering) of the data and then applied to the columns (vertical filtering). This kind
of 2D DWT leads to a decomposition of approximation coefficients at level j in four
components: the approximation (LL) at level j+1, and the details in three orientations,
e.g., horizontal (LH), vertical (HL), and diagonal (HH).

LL1
346

345

LL LH
2 2
344

343
HL HH2 LH1
2
342

341

340

HL1 HH
1

(a) Original thermal behavior (b) 2D wavelet transformed thermal behavior

Fig. 4. An example of using 2D DWT to capture thermal spatial characteristics

To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the


data along the horizontal axis first, resulting in low-pass and high-pass signals (aver-
age and difference). Next, we apply 1D wavelet transforms to both signals along the
vertical axis generating one averaged and three detailed signals. Consequently, 2D
wavelet decomposition is obtained by recursively repeating this procedure on the
averaged signal.
Fig. 4 illustrates the original thermal behavior and 2D wavelet transformed thermal
behavior. As can be seen, the 2D thermal characteristics can be effectively captured
using a small number of wavelet coefficients (e.g. Average (LL=1) or Average
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 107

(LL=2)). Since a small set of wavelet coefficients provide concise yet insightful in-
formation on 2D thermal spatial characteristics, we use predictive models (i.e. neural
networks) to relate them individually to various design parameters. Through inverse
2D wavelet transform, we use the small set of predicted wavelet coefficients to syn-
thesize 2D thermal spatial characteristics across the design space. Compared with a
simulation-based method, predicting a small set of wavelet coefficients using analyti-
cal models is computationally efficient and is scalable to explore the large thermal
design space of 3D multi-core architecture.

2.3 Neural Network

An Artificial Neural Network (ANN) is an information processing paradigm that is


inspired by the way biological nervous systems process information. It is composed of
a set of interconnected processing elements working in unison to solve problems.

Fig. 5. The radial basis function network

The most common type of neural network (shown as Fig. 5) consists of three layers
of units: a layer of input units is connected to a layer of hidden units, which is con-
nected to a layer of output units. The input is fed into network through input units.
Each hidden unit receives the entire input vector and generates a response. The output
of a hidden unit is determined by the input-output transfer function that is specified
for that unit. Commonly used transfer functions include the sigmoid, linear threshold
function and radial basis function (RBF) [19]. The RBF is a special class of function
with response decreasing monotonically with distance from a central point. The cen-
ter, the distance scale, and the precise shape of the radial function are parameters of
the model. A typical radial function is the Gaussian which, in the case of a scalar
input, is
⎛ ( x − c) 2 ⎞ .
h( x ) = exp⎜⎜ − ⎟ (3)
⎝ r 2 ⎟⎠

Its parameters are its center c and its radius r. A neural network that uses RBF can
be expressed as
108 C.-B. Cho, W. Zhang, and T. Li

n
f ( x) = ∑ w j h j ( x) . (4)
j =1

r
where w ∈ ℜ n is adaptable or trainable weight vector and {h j (⋅)} nj =1 are radial basis
functions of the hidden units. As shown in (4), the ANN output, which is determined
by the output unit, is computed using the responses of the hidden units and the
weights between the hidden and output units. Neural networks outperform linear
models in capturing complex, non-linear relations between input and output, which
make them a promising technique for tracking and forecasting complex thermal be-
havior.

3 Combining Wavelets and Neural Network for 2D Thermal


Spatial Behavior Prediction
We view the 2D spatial thermal characteristics yielded in 3D integrated multi-core
chips as a nonlinear function of architecture design parameters. Instead of inferring
the spatial thermal behavior via exhaustively obtaining temperature on each individ-
ual location, we employ wavelet analysis to approximate it and then use a neural net-
work to forecast the approximated thermal behavior across a large architectural design
space.

Fig. 6. Hybrid neuro-wavelet thermal prediction framework

Previous work [9, 10, 11, 12] shows that neural networks can accurately predict the
aggregated workload behavior across varied architecture configurations. Nevertheless,
monolithic global neural network models lack the ability to reveal complex thermal
behavior on a large scale. To overcome this disadvantage, we propose combining 2D
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 109

wavelet transforms and neural networks that incorporate multiresolution analysis into
a set of neural networks for spatial thermal characteristics prediction of 3D die
stacked multi-core design.
The 2D wavelet transform is a very powerful tool for characterizing spatial behav-
ior since it captures both global trend and local variation of large data sets using a
small set of wavelet coefficients. The local characteristics are decomposed into lower
scales of wavelet coefficients (high frequencies) which are utilized for detailed analy-
sis and prediction of individual or subsets of components, while the global trend is
decomposed into higher scales of wavelet coefficients (low frequencies) that are used
for the analysis and prediction of slow trends across each die. Collectively, these
wavelet coefficients provide an accurate interpretation of the spatial trend and details
of complex thermal behavior at a large scale. Our wavelet neural networks use a sepa-
rate RBF neural network to predict individual wavelet coefficients. The separate
predictions of wavelet coefficients proceed independently. Predicting each wavelet
coefficient by a separate neural network simplifies the training task (which can be
performed concurrently) of each sub-network. The prediction results for the wavelet
coefficients can be combined directly by the inverse wavelet transforms to synthesize
the 2D spatial thermal patterns across each die. Fig. 6 shows our hybrid neuro-wavelet
scheme for 2D spatial thermal characteristics prediction. Given the observed spatial
thermal behavior on training data, our aim is to predict the 2D thermal behavior of
each die in 3D die stacked multi-core processors under different design configura-
tions. The hybrid scheme involves three stages. In the first stage, the observed spatial
thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In
the second stage, each wavelet coefficient is predicted by a separate ANN. In the third
stage, the approximated 2D thermal characteristics are recovered from the predicted
wavelet coefficients. Each RBF neural network receives the entire architecture design
space vector and predicts a wavelet coefficient. The training of an RBF network in-
volves determining the center point and a radius for each RBF, and the weights of
each RBF, which determine the wavelet coefficients.

4 Experimental Methodology

4.1 Floorplanning and Hotspot Thermal Model

In this study, we model four floor-plans that involve processor core and cache struc-
tures as illustrated in Fig. 7.
As can be seen, the processor core is placed at different locations across the differ-
ent floor-plans. Each floor-plan can be chosen by a layer in the studied 3D die stack-
ing quad-core processors. The size and adjacency of blocks are critical parameters for
deriving the thermal model. The baseline core architecture and floorplan we modeled
is an Alpha processor, closely resembling the Alpha 21264.

Fig. 7. The selected floor-plans


110 C.-B. Cho, W. Zhang, and T. Li

Fig. 8 shows the baseline core floorplan. We assume a 65 nm processing technique


and the floor-plan is scaled accordingly. The entire die size is 21×21mm and the core
size is 5.8×5.8mm. We consider three core configurations: 2-issue (5.8×5.8 mm), 4-
issue (8.14×8.14 mm) and 8-issue (11.5×11.5 mm). Since the total die area is fixed,
the more aggressive core configurations lead to smaller L2 caches. For all three types
of core configurations, we calculate the size of the L2 caches based on the remaining
die area available.

RF
Window
(ROB+IQ) ALU
L2 L2
BPRED LSQ

il1 dl1

L2

Fig. 8. Processor core floor-plan

Table 1 lists the detailed processor core and cache configurations. We use Hotspot-
4.0 [20] to simulate thermal behavior of a 3D quad-core chip shown as Fig. 9. The
Hotspot tool can specify the multiple layers of silicon and metal required to model a
three dimensional IC. We choose grid-like thermal modeling mode by specifying a set
of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x
32um) is represented by a value. Hotspot takes power consumption data for each
component block, the layer parameters and the floor-plans as inputs and generates the
steady-state temperature for each active layer.

Fig. 9. Cross section view of the simulated 3D quad-core chip


Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 111

Table 1. Architecture configuration for different issue width

2 issue 4 issue 8 issue


Processor
2-wide fetch/issue/commit 4-wide fetch/issue/commit 8-wide fetch/issue/commit
Width
Issue Queue 32 64 128
32 entries, 4-way, 200 cycle 64 entries, 4-way, 200 cycle
ITLB 128 entries, 4-way, 200 cycle miss
miss miss
Branch Predic- 512 entries Gshare, 10-bit 1K entries Gshare, 10-bit 2K entries Gshare, 10-bit global
tor global history global history history
BTB 512K entries, 4-way 1K entries, 4-way 2K entries, 4-way
Return Address 8 entries RAS 16 entries RAS 32 entries RAS
32K, 2-way, 32 Byte/line, 2 64K, 2-way, 32 Byte/line, 2 128K, 2-way, 32 Byte/line, 2
L1 I-Cache
ports, 1 cycle access ports, 1 cycle access ports, 1 cycle access
ROB Size 32 entries 64 entries 96 entries

Load/ Store 24 entries 48 entries 72 entries


2 I-ALU, 1 I-MUL/DIV, 2 4 I-ALU, 2 I-MUL/DIV, 2 8 I-ALU, 4 I-MUL/DIV, 4
Integer ALU
Load/Store Load/Store Load/Store
1 FP-ALU, 1FP- 2 FP-ALU, 2FP-MUL/
FP ALU 4 FP-ALU, 4FP-MUL/DIV/SQRT
MUL/DIV/SQRT DIV/SQRT
64 entries, 4-way, 200 cycle 128 entries, 4-way, 200
DTLB 256 entries, 4-way, 200 cycle miss
miss cycle miss
32K, 2-way, 32 Byte/line, 2 64KB, 4-way, 64 Byte/line, 128K, 2-way, 32 Byte/line, 2
L1 D-Cache
ports, 1 cycle access 2 ports, 1 cycle ports, 1 cycle access
unified 4MB, 4-way, 128 unified 3.7MB, 4-way, 128 unified 3.2MB, 4-way, 128
L2 Cache
Byte/line, 12 cycle access Byte/line, 12 cycle access Byte/line, 12 cycle access
Memory 32 bit wide, 200 cycles access 64 bit wide, 200 cycles 64 bit wide, 200 cycles access
Access latency access latency latency

To build a 3D multi-core processor simulator, we heavily modified and extended


the M-Sim simulator [21] and incorporated the Wattch power model [22]. The power
trace is generated from the developed framework with an interval size of 500K cycles.
We simulate a 3D-stacked quad-core processor with one core assigned to each layer.

4.2 Workloads and System Configurations

We use both integer and floating-point benchmarks from the SPEC CPU 2000 suite
(e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf,
swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see
Table 2). We categorize all benchmarks into two classes: CPU-bound and MEM
bound applications. We design three types of experimental workloads: CPU, MEM
and MIX. The CPU and MEM workloads consist of programs from only the CPU
intensive and memory intensive categories respectively. MIX workloads are the com-
bination of two benchmarks from the CPU intensive group and two from the memory
intensive group.
These multi-programmed workloads were simulated on our multi-core simulator
configured as 3D quad-core processors. We use the Simpoint tool [23] to obtain a
representative slice for each benchmark (with full reference input set) and each
112 C.-B. Cho, W. Zhang, and T. Li

Table 2. Simulation configurations

Chip Frequency 3G
Voltage 1.2 V
Proc. Technology 65 nm
Die Size 21 mm × 21 mm
CPU1 bzip2, eon, gcc, perlbmk
CPU2 perlbmk, mesa, facerec, lucas
CPU3 gap, parser, eon, mesa
MIX1 gcc, mcf, vpr, perlbmk
Workloads MIX2 perlbmk, mesa, twolf, applu
MIX3 eon, gap, mcf, vpr
MEM1 mcf, equake, vpr , swim
MEM2 twolf, galgel, applu, lucas
MEM3 mcf, twolf, swim, vpr

benchmark is fast-forwarded to its representative point before detailed simulation


takes place. The simulations continue until one benchmark within a workload finishes
the execution of the representative interval of 250M instructions.

4.3 Design Parameters

In this study, we consider a design space that consists of 23 parameters (see Table 3)
spanning from floor-planning to packaging technologies. These design parameters have
been shown to have a large impact on processor thermal behavior. The ranges for these
parameters were set to include both typical and feasible design points within the explored
design space. Using detailed cycle-accurate simulations, we measure processor power
and thermal characteristics on all design points within both training and testing data sets.
We build a separate model for each benchmark domain and use the model to predict
thermal behavior at unexplored points in the design space. The training data set is used to
build the wavelet-based neural network models. An estimate of the model’s accuracy is
obtained by using the design points in the testing data set.
To train an accurate and prompt neural network prediction model, one needs to en-
sure that the sample data sets disperse points throughout the design space but keeps
the space small enough to maintain the low model building cost. To achieve this goal,
we use a variant of Latin Hypercube Sampling (LHS) [24] as our sampling strategy
since it provides better coverage compared to a naive random sampling scheme. We
generate multiple LHS matrices and use a space filing metric called L2-star discrep-
ancy [25]. The L2-star discrepancy is applied to each LHS matrix to find the represen-
tative design space that has the lowest value of L2-star discrepancy. We use a
randomly and independently generated set of test data points to empirically estimate
the predictive accuracy of the resulting models. In this work, we used 200 train and 50
test data to reach a high accuracy for thermal behavior prediction since our study
shows that it offers a good tradeoff between simulation time and prediction accuracy
for the design space we considered. In our study, the thermal characteristics across
each die are represented by 64×64 samples.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 113

Table 3. Design space parameters

Keys Low High


Thickness (m) ly0_th 5e-5 3e-4
Layer0 Floorplan ly0_fl Flp 1/2/3/4
Bench ly0_bench CPU/MEM/MIX
Thickness (m) ly1_th 5e-5 3e-4
Layer1 Floorplan ly1_fl Flp 1/2/3/4
3D Bench ly1_bench CPU/MEM/MIX
Config. Thickness (m) ly2_th 5e-5 3e-4
Layer2 Floorplan ly2_fl Flp 1/2/3/4
Bench ly2_bench CPU/MEM/MIX
Thickness (m) ly3_th 5e-5 3e-4
Layer3 Floorplan ly3_fl Flp 1/2/3/4
Bench ly3_bench CPU/MEM/MIX
TIM Heat Capacity (J/m^3K) TIM_cap 2e6 4e6
(Thermal Interface Resistivity (m K/W) TIM_res 2e-3 5e-2
Material) Thickness (m) TIM_th 2e-5 75e-6
Convection capacity (J/k) HS_cap 140.4 1698
Convection resistance (K/w) HS_res 0.1 0.5
Heat sink
Side (m) HS_side 0.045 0.08
General
Config. Thickness (m) HS_th 0.02 0.08
Heat Side(m) HP_side 0.025 0.045
Spreader Thickness(m) HP_th 5e-4 5e-3
Others Ambient temperature (K) Am_temp 293.15 323.15
Archi. Issue width Issue width_ 2 /4/8

5 Experimental Results

In this section, we present detailed experimental results using 2D wavelet neural net-
works to forecast thermal behaviors of large scale 3D multi-core structures running
various CPU/MIX/MEM workloads without using detailed simulation.

5.1 Simulation Time vs. Prediction Time

To evaluate the effectiveness of our thermal prediction models, we compute the


speedup metric (defined as simulation time vs. prediction time) across all experi-
mented workloads (shown as Table 4).
To calculate simulation time, we measured the time that the Hotspot simulator
takes to obtain steady thermal characteristics on a given design configuration. As can
be seen, the Hotspot tool simulation time varies with design configurations. We report
both shortest (best) and longest (worst) simulation time in Table 4. The prediction
time, which includes the time for the neural networks to predict the targeted thermal
behavior, remains constant for all studied cases.
114 C.-B. Cho, W. Zhang, and T. Li

Table 4. Simulation time vs. prediction time

Simulation (sec) Speedup


Workloads Prediction (sec)
[best:worst] (Sim./Pred.)
CPU1 362 : 6,091 294 : 4,952
CPU2 366 : 6,567 298 : 5,339
CPU3 365 : 6,218 297 : 5,055
MEM1 351 : 5,890 285 : 4,789
MEM2 355 : 6,343 1.23 289 : 5,157
MEM3 367 : 5,997 298 : 4,876
MIX1 352 : 5,944 286 : 4,833
MIX2 365 : 6,091 297 : 4,952
MIX3 360 : 6,024 293 : 4,898

In our experiment, a total number of 16 neural networks were used to predict 16


2D wavelet coefficients which efficiently capture workload thermal spatial character-
istics. As can be seen, our predictive models achieve a speedup ranging from 285
(MEM1) to 5339 (CPU2), making them suitable for rapidly exploring large thermal
design space.

5.2 Prediction Accuracy

The prediction accuracy measure is the mean error defined as follows:

1 N ~
x (k ) − x(k ) .
ME =
N

k =1 x (k )
(5)

where: x(k ) is the actual value generated by the Hotspot thermal model, ~x ( k ) is the
predicted value and N is the total number of samples (a set of 64 x 64 temperature
samples per layer, detailed in section 4.1). As prediction accuracy increases, the ME
becomes smaller.
We present boxplots to observe the average prediction errors and their deviations
for the 50 test configurations against Hotspot simulation results. Boxplots are graphi-
cal displays that measure location (median) and dispersion (interquartile range), iden-
tify possible outliers, and indicate the symmetry or skewness of the distribution. The
central box shows the data between “hinges” which are approximately the first and
third quartiles of the ME values. Thus, about 50% of the data are located within the
box and its height is equal to the interquartile range. The horizontal line in the interior
of the box is located at the median of the data, it shows the center of the distribution
for the ME values. The whiskers (the dotted lines extending from the top and bottom
of the box) extend to the extreme values of the data or a distance 1.5 times the inter-
quartile range from the median, whichever is less. The outliers are marked as circles.
In Fig. 10, the blue line with diamond shape markers indicates the statistics average
of ME across all benchmarks.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 115

20

16

Error (%)
12

0
CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3

Fig. 10. ME boxplots of prediction accuracies (number of wavelet coefficients = 16)

Fig. 10 shows that using 16 wavelet coefficients, the predictive models achieve
median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median
error of 6.9% across all experimented workloads. As can be seen, the maximum error
at any design point for any benchmark is 17.5% (MEM1), and most benchmarks show
an error less than 9%. This indicates that our hybrid neuro-wavelet framework can
predict 2D spatial thermal behavior across large and sophisticated 3D multi-core ar-
chitecture with high accuracy. Fig. 10 also indicates that CPU (average 4.4%) work-
loads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%)
workloads. This is because the CPU workloads usually have higher temperature on
the small core area than the large L2 cache area. These small and sharp hotspots can
be easily captured using just few wavelet coefficients. On MEM and MIX workloads,
the complex thermal pattern can spread the entire die area, resulting in higher predic-
tion error.

CPU1 MEM1 MIX1

Prediction

Simulation

Fig. 11. The simulated and predicted thermal behavior

Fig. 11 illustrates the simulated and predicted 2D thermal spatial behavior of die 4
(for one configuration) on CPU1, MEM1 and MIX1 workloads. The results show that
our predictive models can tack both size and location of thermal hotspots. We further
examine the accuracy of predicting locations and area of the hottest spots and the
results are similar to those presented in Figure 10.
Fig. 12 shows the prediction accuracies with different number of wavelet coeffi-
cients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D
116 C.-B. Cho, W. Zhang, and T. Li

thermal spatial pattern prediction accuracy is increased when more wavelet coeffi-
cients are involved. However, the complexity of the predictive models is proportional
to the number of wavelet coefficients. The cost-effective models should provide high
prediction accuracy while maintaining low complexity. The trend of prediction accu-
racy shown in Fig. 12 suggests that for the programs we studied, a set of wavelet
coefficients with a size of 16 combine good accuracy with low model complexity;
increasing the number of wavelet coefficients beyond this point improves error at a
lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this
work to minimize the complexity of prediction models while achieving good
accuracy.

CPU1

8
Error (%)

0
16wc 32wc 64wc 96wc 128wc 256wc

MEM1

15
Error (%)

10

0
16wc 32wc 64wc 96wc 128wc 256wc

MIX1
20
Error (%)

10

0
16wc 32wc 64wc 96wc 128wc 256wc

Fig. 12. ME boxplots of prediction accuracies with different number of wavelet coefficients

We further compare the accuracy of our proposed scheme with that of approximat-
ing 3D stacked die spatial thermal patterns via predicting the temperature of 16 evenly
distributed locations across 2D plane. The results shown in Fig. 13 indicate that using
the same number of neural networks, our scheme yields significant higher accuracy
than conventional predictive models. This is because wavelets provide a good time
and locality characterization capability and most of the energy is captured by a limited
set of important wavelet coefficients. The coordinated wavelet coefficients provide
superior interpretation of the spatial patterns across scales of time and frequency
domains.
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 117

100
Predicting the wav elet coefficients
Predicting the raw data
80

Error (%)
60

40

20

0
CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3

Fig. 13. The benefit of predicting wavelet coefficients

Our RBF neural networks were built using a regression tree based method. In the
regression tree algorithm, all input parameters (refer to Table 3) were ranked based on
split frequency. The input parameters which cause the most output variation tend to
be split frequently in the constructed regression tree. Therefore, the input parameters
that largely determine the values of a wavelet coefficient have a larger number of
splits. We present in Fig. 14 (shown as star plot) the most frequent splits within the
regression tree that models the most significant wavelet coefficient.

Design Parameters by Regression Tree

ly0 _th ly0 _fl ly0 _bench ly1 _th ly1 _fl ly1 _bench

ly2 _th ly2 _fl ly2 _bench ly3 _th ly3 _fl ly3 _bench

T IM_cap T IM_r es T IM _th H S_cap H S_r es H S_side

Clockwise:
CPU1
MEM1
H S_th H P _side H P _th am_temp Iss_size
MIX1

Fig. 14. The roles of input parameters

A star plot [15] is a graphical data analysis method for representing the relative be-
havior of all variables in a multivariate data set. Each volume size of parameter is
proportional to the magnitude of the variable for the data point relative to the maxi-
mum magnitude of the variable across all data points. From the star plot, we can ob-
tain information such as: What variables are dominant for a given datasets? Which
observations show similar behavior? As can be seen, floor-planning of each layer and
core configuration largely affect thermal spatial behavior of the studied workloads.
118 C.-B. Cho, W. Zhang, and T. Li

6 Related Work
There have been several attempts to build thermal aware microarchitecture [3, 20, 27,
28]. [27, 28] propose invoking energy saving techniques when the temperature ex-
ceeds a predefined threshold. [5] proposes a performance and thermal aware floor-
planning algorithm to estimate power and thermal effects for 2D and 3D architectures
using an automated floor-planner with iterative simulations. To our knowledge, little
research has been completed so far in developing accurate and informative analytical
methods to forecast complex thermal spatial behavior of emerging 3D multi-core
processors at early architecture design stage.
Researchers have successfully applied wavelet techniques in many fields, including
image and video compression, financial data analysis, and various fields in computer
science and engineering [29, 30]. In [31], Joseph and Martonosi used wavelets to
analyze and predict the change of processor voltage over time. In [32], wavelets were
used to improve accuracy, scalability, and robustness in program phase analysis. In
[33], the multiresolution analysis capability of wavelets was exploited to analyze
phase complexity. These studies, however, made no attempt to link architecture wave-
let domain behavior to various design parameters.
In [13] Joseph et al. developed linear models using D-optimal designs to identify
significant parameters and their interactions. Lee and Brooks [14, 15] proposed re-
gression on cubic splines for predicting the performance and power of applications
executing on microprocessor configurations in a large microarchitectural design
space. Neural networks have been used in [9, 10, 11, 12] to construct predictive mod-
els that correlate processor performance characteristics with the design parameters.
The above studies all focus on analyzing and predicting aggregated architecture char-
acteristics and assume monolithic architecture designs while our work aims to model
heterogeneous 2D thermal behavior. Our work significantly extends the scope of
these existing studies and is distinct in its use of 2D multiscale analysis to character-
ize the spatial thermal behavior of large-scale 3D multi-core architecture substrate.

7 Conclusions
Leveraging 3D die stacking technologies in multi-core processor design has received
increased momentum in both the chip design industry and research community. One
of the major road blocks to realizing 3D multi-core design is its inefficient heat dissi-
pation. To ensure thermal efficiency, processor architects and chip designers rely on
detailed yet slow simulations to model thermal characteristics and analyze various
design tradeoffs. However, due to the sheer size of the design space, such techniques
are very expensive in terms of time and cost.
In this work, we aim to develop computationally efficient methods and models
which allow architects and designers to rapidly yet informatively explore the large
thermal design space of 3D multi-core architecture. Our models achieve several or-
ders of magnitude speedup compared to simulation based methods. Meanwhile, our
model significantly improves prediction accuracy compared to conventional predic-
tive models of the same complexity. More attractively, our models have the capability
of capturing complex 2D thermal spatial patterns and can be used to forecast both the
Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors 119

location and the area of thermal hotspots during thermal-aware design exploration. In
light of the emerging 3D multi-core design era, we believe that the proposed thermal
predictive models will be valuable for architects to quickly and informatively examine
a rich set of thermal-aware design alternatives and thermal-oriented optimizations for
large and sophisticated architecture substrates at an early design stage.

References
[1] Banerjee, K., Souri, S., Kapur, P., Saraswat, K.: 3-D ICs: A Novel Chip Design for Im-
proving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integra-
tion. Proceedings of the IEEE 89, 602–633 (2001)
[2] Tsai, Y.F., Wang, F., Xie, Y., Vijaykrishnan, N., Irwin, M.J.: Design Space Exploration
for 3-D Cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16(4)
(April 2008)
[3] Black, B., Nelson, D., Webb, C., Samra, N.: 3D Processing Technology and its Impact on
IA32 Microprocessors. In: Proc. of the 22nd International Conference on Computer De-
sign, pp. 316–318 (2004)
[4] Reed, P., Yeung, G., Black, B.: Design Aspects of a Microprocessor Data Cache using 3D
Die Interconnect Technology. In: Proc. of the International Conference on Integrated Cir-
cuit Design and Technology, pp. 15–18 (2005)
[5] Healy, M., Vittes, M., Ekpanyapong, M., Ballapuram, C.S., Lim, S.K., Lee, H.S., Loh,
G.H.: Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs. IEEE Trans.
on Computer Aided Design of IC and Systems 26(1), 38–52 (2007)
[6] Lim, S.K.: Physical design for 3D system on package. IEEE Design & Test of Com-
puters 22(6), 532–539 (2005)
[7] Puttaswamy, K., Loh, G.H.: Thermal Herding: Microarchitecture Techniques for Control-
ling Hotspots in High-Performance 3D-Integrated Processors. In: HPCA (2007)
[8] Wu, Y., Chang, Y.: Joint Exploration of Architectural and Physical Design Spaces with
Thermal Consideration. In: ISLPED (2005)
[9] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: A Predictive Performance Model for
Superscalar Processors. In: MICRO (2006)
[10] Ipek, E., McKee, S.A., Supinski, B.R., Schulz, M., Caruana, R.: Efficiently Exploring Ar-
chitectural Design Spaces via Predictive Modeling. In: ASPLOS (2006)
[11] Yoo, R.M., Lee, H., Chow, K., Lee, H.H.S.: Constructing a Non-Linear Model with Neu-
ral Networks For Workload Characterization. In: IISWC (2006)
[12] Lee, B., Brooks, D., Supinski, B., Schulz, M., Singh, K., McKee, S.: Methods of Infer-
ence and Learning for Performance Modeling of Parallel Applications. In: PPoPP 2007
(2007)
[13] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: Construction and Use of Linear Regres-
sion Models for Processor Performance Analysis. In: HPCA (2006)
[14] Lee, B., Brooks, D.: Accurate and Efficient Regression Modeling for Microarchitectural
Performance and Power Prediction. In: ASPLOS (2006)
[15] Lee, B., Brooks, D.: Illustrative Design Space Studies with Microarchitectural Regression
Models. In: HPCA (2007)
[16] Daubechies, I.: Ten Lectures on Wavelets. Capital City Press, Montpelier (1992)
[17] Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood
Cliffs (1999)
120 C.-B. Cho, W. Zhang, and T. Li

[18] Daubechies, I.: Orthonomal bases of Compactly Supported Wavelets. Communications


on Pure and Applied Mathematics 41, 906–966 (1988)
[19] Orr, M., Takezawa, K., Murray, A., Ninomiya, S., Leonard, T.: Combining Regression
Tree and Radial Based Function Networks. International Journal of Neural Systems
(2000)
[20] Skadron, K., et al.: Temperature-Aware Microarchitecture. In: ISCA (2003)
[21] http://www.cs.binghamton.edu/~jsharke/m-sim/
[22] Brooks, D., Tiwari, V., Martonosi, M.: Wattch: A framework for architectural-level
power analysis and optimizations. In: ISCA (2000)
[23] Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically Characterizing Large
Scale Program Behavior. In: ASPLOS (2002)
[24] Cheng, J., Druzdzel, M.J.: Latin Hypercube Sampling in Bayesian Networks. In: FLAIRS
(2000)
[25] Vandewoestyne, B., Cools, R.: Good Permuatations for Deterministic Scrambled Halton
Sequences in terms of L2-discrepancy. Journal of Computational and Applied Mathemat-
ics 189(1-2) (2006)
[26] Chambers, J., Cleveland, W., Kleiner, B., Tukey, P.: Graphical Methods for Data Analy-
sis, Wadsworth (1983)
[27] Brooks, D., Martonosi, M.: Dynamic Thermal Management for High-Performance Mi-
croprocessors. In: HPCA (2001)
[28] Gunther, S., Binns, F., Canmean, D.M., Hall, J.C.: Managing the Impact of Increasing
Microprocessor Power Consumption. Intel Technology Journal, Ql (2001)
[29] Mallat, S.: Multifrequency Channel Decompositions of Images and Wavelet Models.
IEEE Trans. on Acoustic, Speech, and Signal Processing 37, 2091–2110 (1989)
[30] Feldmann, A., Gilbert, A.C., Willinger, W., Kurtz, T.G.: The Changing Nature of Net-
work Traffic: Scaling Phenomena. ACM Computer Communication Review 28, 5–29
(1998)
[31] Joseph, R., Hu, Z.G., Martonosi, M.: Wavelet Analysis for Microprocessor Design: Ex-
periences with Wavelet-Based dI/dt Characterization. In: HPCA (2004)
[32] Cho, C.B., Li, T.: Using Wavelet Domain Workload Execution Characteristics to Improve
Accuracy, Scalability and Robustness in Program Phase Analysis. In: ISPASS (2007)
[33] Cho, C.B., Li, T.: Complexity-based Program Phase Analysis and Classification.
In: PACT (2006)
Generation, Validation and Analysis of SPEC
CPU2006 Simulation Points Based on Branch,
Memory and TLB Characteristics

Karthik Ganesan, Deepak Panwar, and Lizy K. John

University of Texas at Austin,


1 University Station C0803, Austin, TX 78712, USA

Abstract. The SPEC CPU2006 suite, released in Aug 2006 is the cur-
rent industry-standard, CPU-intensive benchmark suite, created from a
collection of popular modern workloads. But, these workloads take ma-
chine weeks to months of time when fed to cycle accurate simulators
and have widely varying behavior even over large scales of time [1]. It
is to be noted that we do not see simulation based papers using SPEC
CPU2006 even after 1.5 years of its release. A well known technique to
solve this problem is the use of simulation points [2]. We have gener-
ated the simulation points for SPEC CPU2006 and made it available at
[3]. We also report the accuracies of these simulation points based on
the CPI, branch misspredictions, cache & TLB miss ratios by comparing
with the full runs for a subset of the benchmarks. It is to be noted that
the simulation points were only used for cache, branch and CPI studies
until now and this is the first attempt towards validating them for TLB
studies. They have also been found to be equally representative in depict-
ing the TLB characteristics. Using the generated simulation points, we
provide an analysis of the behavior of the workloads in the suite for dif-
ferent branch predictor & cache configurations and report the optimally
performing configurations. The simulations for the different TLB config-
urations revealed that usage of large page sizes significantly reduce the
translation misses and aid in improving the overall CPI of the modern
workloads.

1 Introduction
Understanding program behaviors through simulations is the foundation for com-
puter architecture research and program optimization. These cycle accurate sim-
ulations take machine weeks of time on most modern realistic benchmarks like
the SPEC [4] [5] [6] suites incurring a prohibitively large time cost. This problem
is further aggravated due to the need to simulate on different micro-architectures
to test the efficacy of the proposed enhancement. This necessitates the need to
come up with techniques [7] [8] that can facilitate faster simulations of large work-
loads like SPEC suites. One such well known technique is the Simulation Points.
While there are Simulation Points for the SPEC CPU2000 suite widely available
and used, the simulation points are not available for the SPEC CPU2006 suite.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 121–137, 2009.

c Springer-Verlag Berlin Heidelberg 2009
122 K. Ganesan, D. Panwar, and L.K. John

We used the SimPoint [9] [10] [11] tool to generate these simulation points for
the SPEC2006 benchmark suite and provide it for use at [3].
The contributions of this paper are two-fold. The first contribution is the
creation of the simulation points, which we make it available at [3] to the rest
of the architecture research community. We also provide the accuracy of these
simulation points by comparing the results with the full run of select benchmarks.
It must be noted that 1.5 years after the release of SPEC CPU2006, simulations
based papers using CPU2006 are still not appearing in architecture conferences.
The availability of simulation points for CPU2006 will change this situation.
The second contribution is the use of CPU2006 simulation points for branch
predictor, cache & TLB studies. Our ultimate goal was to find the optimal branch
predictor, the cache and the TLB configurations which provide the best perfor-
mance on most of the benchmarks. For this, we analyzed the benchmark results
for different set of static and dynamic branch predictors [12] and tried to come
up with the ones that perform reasonably well on most of the benchmarks. We
then varied the size of one of these branch predictors to come up with the best
possible size for a hardware budget. A similar exercise was performed to come
up with the optimum instruction and data cache design parameters. We varied
both the associativity and size of caches to get an insight into the best perform-
ing cache designs for the modern SPEC CPU workloads. The performance for
different TLB configurations was also studied to infer the effect of different TLB
parameters like the TLB size, page size and associativity.
It should be noted that such a study without simulation points will take
several machine weeks. Since the accuracy of the simulation points were verified
with several full runs, we are fairly confident of the usefullness of the results.

2 Background

Considerable work has been done in investigating the dynamic behavior of the
current day programs. It has been seen that the dynamic behavior varies over
time in a way that is not random, rather structured [1] [13] as sequences of a
number of short reoccurring behaviors. The SimPoint [2] tool tries to intelli-
gently choose and cluster these representative samples together, so that they
represent the entire execution of the program. These small set of samples are
called simulation points that, when simulated and weighted appropriately pro-
vide an accurate picture of the complete execution of the program with large
reduction in the simulation time.
Using the Basic Block Vectors [14] , the SimPoint tool [9][10][11] employs the
K-means clustering algorithm to group intervals of execution such that the inter-
vals in one cluster are similar to each other and the intervals in different clusters
are different from one another. The Manhattan distance between the Basic Block
Vectors serve as the metric to know the extent of similarity between two inter-
vals. The SimPoint tool takes the maximum number of clusters as the input and
generates a representative simulation point for each cluster. The representative
simulation point is chosen as the one which has the minimum distance from the
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 123

centroid of the cluster. Each of the simulation points is assigned a weight based
on the number of intervals grouped into its corresponding cluster. These weights
are normalized such that they sum up to unity.

3 Methodology
In this paper we used sim-fast, sim-outorder simulators of the simplescalar toolset
[6] along with the SimPoint tool to generate the simulation points for the SPEC
CPU2006 suite. Figure 1 shows the flowchart representation of the methodology.
We used sim-fast simulator to identify the different basic blocks in the static code
of the benchmark and generate a Basic Block Vector for every fixed dynamic in-
terval of execution of the program. We chose the interval size to be 100 million
instructions. Further, these basic block vectors are fed as input to the clustering
algorithm of the SimPoint tool, which generates the different simulation points
(collection of Basic Block Vectors) and their corresponding weights. Having ob-
tained the simulation points and their corresponding weights, the simulation
points are tested by fast-forwarding (i.e., executing the program without per-
forming any cycle accurate simulation, as described in [3]) up to the simulation
point, and then running a cycle accurate simulation for 100 million instructions.
The sim-outorder tool provides a convenient method of fast-forwarding, to simu-
late programs in the manner described above. Fast-forwarding a program implies
only a functional simulation and avoids any time consuming detailed cycle ac-
curate measurements. The statistics like CPI (Cycles Per Instruction), cache
misses, branch mispredictions etc. are recorded for each simulation point. The
metrics for the overall program were computed based on the weight of each simu-
lation point. Each of the individual simulation point is simulated in parallel and
their results were aggregated based on their corresponding normalized weight.
For example, the CPI was computed by multiplying the CPI of each individual
simulation point with its corresponding weights as in eqn (1).

n
CP I = (CP I i ∗ weighti ) (1)
i=0
On the other hand, the ratio based metrics like branch misprediction rate,
cache miss ratio were computed by weighing the numerator and denominator
correspondingly as in eqn (2).
n
(missesi ∗ weighti )

M issRatio = ni=0 (2)
i=0 (lookupsi ∗ weighti )
The accuracy of the generated simulation points were studied by performing
the full program simulation using sim-outorder simulator and comparing the
metrics like CPI, cache miss ratios and branch mispredictions. This validation
was performed to know the effectiveness of the SimPoint methodology on SPEC
CPU2006 [15] suite in depicting the true behavior of the program. Since, sim-
outorder runs on SPEC CPU2006 take machine weeks of time, we restricted
ourselves to running only a few selected benchmarks for this purpose.
124 K. Ganesan, D. Panwar, and L.K. John

Bench
mark

Simfast

Sim-outorder

BBVs

Simpoint Engine

Simp weigh
oints ts
Compa
re
Sim-outorder

Simpoint
o/p 1 2 . . . .n

error % , speedup
Aggregate Data

Fig. 1. Simulation point Generation and Verification

For studying the branch behavior of the suite we once again used the sim-
outorder simulator available in SimpleScalar [6]. This tool has in-built imple-
mentation for most of the common static and dynamic branch predictors namely
Always Taken, Always Not-Taken, Bimodal, Gshare and other Twoway adaptive
predictors. We studied the influence of above predictors on the program behav-
ior in terms of common metrics like execution time, CPI, branch misprediction.
One of the best performing predictors was chosen and the Pattern History Table
(PHT) size was varied and the results were analyzed to come up with an optimal
size for the PHT.
To get an insight into the memory and TLB behavior of the Suite, the same
sim-outorder simulator was employed, using which the configurations for the
different levels of the cache hierarchy and TLB were specified. We obtained
the corresponding hit and miss rate for various configurations along with their
respective CPIs.

4 Simulation Points Generation and Verification


Figures 2 shows the sim-fast results for the SPECINT and SPECFP bench-
marks. The tables in the Figures. 2 and 3 show the number of simulation points
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 125

Fig. 2. SPEC CPU2006 - Number of simulation points, total number of instructions


and the simulation time taken by the Simfast simulator of the SimpleScalar LLC. It is
to be noted that Simoutorder will take an order more time than Simfast

Fig. 3. Speedup obtained by using the simulation points. The simulation point runs
were done on the Texas Advance Computing Center and the full runs on a quad core
2 Ghz Xeon Processor

generated for each of the benchmarks along with their instruction count and
simulation time on a 2 GHz Xeon machine. The interval of execution given
to the sim-fast simulator was 100 million instructions. Also, maximum number
of clusters given to the SimPoint tool were 30. These simulation points were
launched as parallel jobs on the Texas Advance Computing Center (TACC) us-
ing the sim-outorder simulator. A node on TACC could have been 2x to 3x faster
than the other xeon machine to which the execution times are compared. But,
still the speedup numbers here are too high that this discrepancy in machine
speeds can be safely ignored. The final aggregated metrics for the simulation
point runs were calculated using the formulae mentioned in the previous section.
The full run simulations were also carried out for a few integer and floating point
126 K. Ganesan, D. Panwar, and L.K. John

2.5

1.5
CPI

0.5

er
ll
k
s

p
r
ex

al

g
en
ip

m
bm

ta
es

tp

m
m

en
de
bz

tu
pl

as

rlb
ne

us

hm
m

go

sj
an
so

1.

7.
3.

pe
ga

om

ze

8.
40
5.

6.
qu
0.

44
47

45
0.
6.

4.
44

45
1.
45

lib
40
41

43
47

2.
Full run

46
Benchmarks Simpoint run

Fig. 4. CPI comparison between full runs and simulation point runs

0.08
0.07
Misprediction Rate

0.06
0.05
0.04
Full run
0.03
0.02 Simpoint run
0.01
0
ll

er
k
s

p
r

p
al
ex

g
en
ip
bm

m
ta
es

tp

m
m

en
de
bz

tu
pl

as

rlb
ne

us

hm
m

go

sj
an
so

1.

7.
3.

pe
ga

om

ze

8.
40
5.

6.
qu
44
0.

47

45
0.
6.

4.
44

45
1.
45

lib
40
41

43
47

2.
46

Benchmarks

Fig. 5. Branch misprediction rate comparison between full runs and simulation point
runs

benchmarks and the accuracy of the generated simulation points were obtained
by comparing the results.
To verify the accuracy of the simulation points, we further compared the CPIs
and cache miss ratios of the simulation point run to that of full run and analyzed
the speedup obtained due to the usage of simulation points. The configuration
that we used to simulate the various full and the simulation point runs is with a
RUU size of 128, LSQ size of 64, decode, issue and commit widths of 8, L1 data
and instruction cache size of 256 sets, 64B block size, an associativity of 2, L2
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 127

0.008
0.007
0.006
0.005
Miss rate

0.004
Full run
0.003
Simpoint run
0.002
0.001
0

ll
k
s

p
r
x

al

en
ip
bm

ta
es

tp
le

bz

de
as

rlb
ne
p
m

go

so

1.

7.
3.
ga

pe
om

40
5.

44
0.

47

0.
6.

44

1.
45

40
41

47

Benchmarks

Fig. 6. Instruction cache miss ratio comparison between full runs and simulation point
runs

data and instruction cache size of 4096 sets, 64B block size, an associativity of
4. The ITLB size used was 32 sets with 4K block size, and an associativity of 4.
The DTLB size used was 64 sets, 4K block size and an associativity of 4. The
number of Integer ALUs were set to 4 and the number of Floating Point ALUs
were set to 2. A combined branch predictor with a meta table size of 2048. The
error percentage in CPI and the speed-up obtained due to the use of simulation
points are given in Figures 3 and 4 . Clearly, performing the simulation using
the generated simulation points results in considerable speedup without much
loss in the accuracy, reducing machine weeks of time to a few hours. The CPI
values obtained using simulation points was within 5 percent of the full run CPI
values for all the benchmarks except 401.bzip where the value was off by around
8 percent. Even the error in Data, Instruction cache miss rates, DTLB miss rates
and the branch misprediction ratios were within a limit of 5 percent for most of
the benchmarks excepting bzip and libquantum that have an error of 11% and
13% for the branch missprediction rates. Figures 4, 5, 6, 7 show the errors in the
values of CPI, branch mispredictions, data cache, instruction cache and DTLB
miss rates for a set of benchmarks. Though the concept of simulation points have
been widely used in various studies about caches, branch predictors etc., this is
the first attempt towards validating and studying the TLB characteristics based
on simulation points. It is quite evident from the results that these simulation
points are representative of the whole benchmark even in terms of the TLB
characteristics. Though the methodology used by SimPoint is micorarchitecture
independent, this validation is performed by taking one specific platform (alpha)
as a case study and the error rates may vary for other platforms.
128 K. Ganesan, D. Panwar, and L.K. John

DL1 miss rate comparison

0.14
0.12
0.1
Miss rate

0.08
0.06 Full run
0.04
Simpoint run
0.02
0

er
pe lll
k
s

45 mp
om ar
47 e x

g
lib lben
44 ip
m

m
a
es

tp

m
45 jen
de
t

bz

43 ntu
b

pl

47 .as

ne

us

hm
m

go

so

1.

s
7.
ga

ze

8.
3

40
5.

6.
qu
0.

0.
6.

4.
44

1.
45

40
41

2.
46
Benchmarks

DTLB miss rate comparison

0.02

0.015
Miss rate

0.01 Full run


0.005 Simpoint run

0
er
ll
k
s

p
r
ex

al

g
en
ip
bm

m
ta
es

tp

m
m

en
de
bz

tu
pl

as

rlb
ne

us

hm
m

go

sj
an
so

1.

7.
3.
ga

pe
om

ze

8.
40
5.

6.
qu
44
0.

47

45
0.
6.

4.
44

45
1.
45

lib
40
41

43
47

2.
46

Benchmarks

Fig. 7. Data cache and DTLB miss rate comparison between full runs and simulation
point runs

We hope that these simulation points that are provided [3] will serve as a
powerful tool aiding in carrying out faster simulations using the large and repre-
sentative benchmarks of the SPEC CPU2006 Suite. The reference provided has
the simulation points for 21 benchmarks and we are in the process of generating
the remaining simulation points, which will also be added to the same reference.

5 Simulation Results and Analysis


5.1 Branch Characteristics
As mentioned earlier, sim-outorder supports both static and dynamic branch
predictors. static predictors are quite ideal for the embedded applications due to
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 129

their simplicity and low power requirements. Static predictors are also employed
in designing simple cores in case of single chip multiprocessors like Niagara [15],
where there exists strict bounds on area and power consumption on each core.
It is also commonly used as backup predictors in superscalar processors that
require an early rough prediction during training time and when there are misses
in the Branch Targe Buffer. On the other hand, dynamic predictors give superior
performance compared to the static ones but at the cost of increased power and
area, as implemented in the modern complex x86 processors.
Fig. 8 shows the CPI results for two common type of static branch predictors
viz., Always Taken and Always Not-Taken. As expected, it is clear from Fig. 8
and Fig. 10 that the performance of static predictors is quite poor compared to
the perfect predictor. Always taken has the overhead in branch target calculation,
but most of the branches in loops are taken.
Fig. 9 shows the CPI results for some common dynamic branch predictors. In
this paper, we have studied the performance of the following dynamic predictors
viz., Bimodal, Combined, Gshare, PAg and GAp. The configurations that were
used for these predictors respectively are,

– Bimodal - 2048
– Combined - 2048 (Meta table size)
– Gshare - 1:8192:13:1
– PAg - 256:4096:12:0
– GAp - 1:8192:10:0

Gshare, PAg and GAp are 2level predictors and their configurations are given
in the format {l1size:l2size:hist size:xor}. Clearly, the CPI values obtained using
dynamic predictors is much closer to the values obtained from the perfect pre-
dictor. Also, among these predictors, Gshare and Combined branch predictors
performs much better compared to others. Taking a closer look at the graphs,
we see that the Gshare predictor is ideal in the case of FP benchmarks while
combined predictors fares better for the integer benchmarks. Also, PAg per-
forms better than GAp predictor which indicates that a predictor with a global
Pattern History Table (PHT) performs better than one with a private PHT.
This clearly shows that constructive interference in a global PHT is helping the
modern workloads and results in an improved CPI.
Looking at the performance of the private and the global configurations of
the Branch History Shift Register (BHSR), it is evident that each of them per-
form well on specific benchmarks. Fig. 11 shows the misprediction rates for the
different dynamic predictors. The performance improvement in CPI and Mis-
prediction rate by using a dynamic predictor to a static predictor is drastic for
the cases of 471.omnetpp and 416.gamess. Both of these benchmarks are pretty
small workloads, that their branch behavior is easily captured by these history
based Branch Predictors. 462.libquantum and 450.soplex also have a significant
improvement in the CPI compared to their static counterparts, which can be
attributed to fact that the dynamic predictors are able to efficiently capture the
branch behavior of these benchmarks.
130 K. Ganesan, D. Panwar, and L.K. John

4
CPI

3
Not Taken
2 Taken
Perfect
1

0 er

ll
k

s
s

ilc
p

d
r

s
3d

al
ex
an

en

3
ip
bm

ta

es
em
es
tp

m
m

ac
m
en

nx
m

de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
sj

G
so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

43

sp
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

0.
45
1.

45

5.
40

43
46

41

43

48
41
47

43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 8. Static branch predictor CPI

2.5

1.5 Bimod
CPI

Comb
1
gshare
PAg
0.5
GAp

0
er

ll
k

ilc
s
p

d
r

s
3d

al
ex
an

en
ip

3
bm

ta

em

es
es
tp

m
m

ac
en

nx
m

de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

sl
so
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

43

sp
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp---------------------------------->

Fig. 9. Dynamic branch predictor CPI

0.45
0.4
0.35
0.3
Miss ratio

0.25
0.2 NotTaken
0.15 Taken
0.1
0.05
0
er

ll
k

s
s
p

ilc
d

p
r

s
3d
ex

al
an

en
ip

3
bm

ta

em

es
es
tp

m
m

ac
en

nx
m
bz

de
pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

45
0.

7.
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43

<--------------------------SPECInt-----------------> <-------------------------SPECfp------------------------>

Fig. 10. Static branch predictor misprediction rate

For the purpose of analyzing the effect of PHT size on the behavior of the
programs, we chose one of the best performing predictors obtained in the previ-
ous analysis i.e. Gshare and varied the size of it’s PHT. We used PHT of index
12, 13 and 14 bits and observed the improvement in both CPI and branch mis-
prediction rate (Fig 12. & 10). Different benchmarks responded differently to
the increase in the PHT size. It can be observed that the integer benchmarks
respond more to the increase in the PHT size compared to the floating point
benchmarks. The floating point benchmarks have the least effect on the CPI for
the increase in the PHT size. This is because of the fact that the floating
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 131

0.2
0.18
0.16
0.14 Bimod
Miss ratio

0.12
Comb
0.1
0.08 gshare
0.06 PAg
0.04
GAp
0.02
0

er

ll
t
r

ip

ex

es
p

ilc
k

3
s
an

en
pp

s
ng

d
ta

al
em
m
bm

nx
es
bz

ac
m

e3

m
pl

av
de
as

rlb
t

sje
qu

na

us
ne

hm

hi
1.

om
3.
so

sli

G
go

bw
3.

7.
pe

sp
lib

4.

ze
40
8.

ga

43
om

9.
le
0.
47
5.

6.

gr
44
44
45

0.
0.

2.
2.

4.
45
7.
6.

45
44

45

5.
1.

41
40

48
46

43
43
41

43
47

<--------------------------SPECInt-----------------> <------------------------------SPECfp------------------------>

Fig. 11. Dynamic branch predictor misprediction rate

0.12

0.1

0.08
Miss Ratio

0.06
1:4096:12:1
0.04
1:8192:13:1
0.02 1:16384:14:1

0
er

ll
k

s
s
p

ilc
d

p
r

s
ex

3d

al
an

en
ip

3
bm

ta

em

es
es
tp

m
m

ac
en

nx
m
bz

de
pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

45
0.

7.
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43
<-----------------------SPECInt-----------------> <------------------------SPECfp------------------------->

Fig. 12. Misprediction rate for Gshare configurations given as L1size:L2size:hist size
& xor

2.5

1.5
CPI

1:4096:12:1
1
1:8192:13:1
0.5 1:16384:14:1

0
er

ll
k

s
s
p

ilc
d

p
r

s
ex

3d

al
an

en
ip

3
bm

ta

em

es
es
tp

m
m

ac
m
en

nx
m
bz

de
pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

45
0.

7.
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43

<---------------------SPECInt-----------------> <-------------------------SPECfp------------------------>

Fig. 13. CPI for Gshare configurations given as L1size:L2size:hist size & xor

point benchmarks have lesser number of branches and thus their behavior can
be captured with a smaller PHT.
For instance, considering 435.gromacs, although there is a significant reduc-
tion in the misprediction rate with an increase in the PHT size, there is not
much improvement observed in the CPI. After analyzing this benchmark, we
found that 435.gromacs has only 2 percent of the instructions as branches. So,
improving the accuracy of branch predictor does not have much effect on the CPI
of the FP benchmarks. On the other hand, for the case of 445.gobmk which is an
integer benchmark, the improvement in misprediction rate shows a proportional
change in the CPI. This is expected since 445.gobmk has higher percentage of
branches (15 percent) to the total instructions.
132 K. Ganesan, D. Panwar, and L.K. John

2.5

1.5 DL1:256:64:1:1
CPI

DL1:512:64:1:1
1
DL1:1024:64:1:1
0.5 DL1:256:64:2:1
DL1:128:64:4:1
0
er

ll
k

s
s

ilc
p

p
r

s
3d
an

al
x
g

en

3
ip
bm

ta

em

s
es
tp

m
m

ac
e
en

nx
m

e
de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

sl
so
1.

7.
3.

pe

ga
lib
om

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 14. CPI for DL1 configurations in format name:no.sets:blk size:associativity&repl.


policy

0.3

0.25

0.2 DL1:256:64:1:1
Miss Ratio

0.15 DL1:512:64:1:1
DL1:1024:64:1:1
0.1
DL1:256:64:2:1
0.05
DL1:128:64:4:1
0
er

ll
k

s
s

ilc
p

d
r

s
3d

al
ex
an

en
ip

3
bm

ta

es
em
es
tp

m
m

ac
m
en

nx
m

de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

43

sp
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

0.
45
1.

45

5.
40

43
46

41

43

48
41
47

43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 15. Missrate for DL1 configs in format name:no.sets:blk size:associativity&repl.


policy

2.5

1.5 IL1:1:256:64:1:1
CPI

IL1:1:512:64:1:1
1 IL1:1:1024:64:1:1
IL1:1:256:64:2:1
0.5 IL1:1:128:64:4:1

0
er

ll
k

s
s

ilc
p

p
r

s
ex

3d
an

al
g

en
ip

3
bm

ta

es
em
es
tp

m
m

ac
en

nx
m
bz

de
pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
sj

G
so

sl
1.

7.
3.

pe

ga
lib
om

ze
4.
8.

bw

sp
43
9.
le
40
5.

6.

0.

44

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43

<----------------------------------SPECInt-----------------> <-------------------------------SPECfp----------------------------->

Fig. 16. CPI for IL1 configs in format name:no.sets:blk size:associativity&repl. policy

5.2 Memory Characteristics

The memory hierarchy design is of paramount importance in modern superscalar


processors because of the performance loss due the Von Neumann bottleneck.
It necessitates the need to come up with the optimal cache design parameters,
so that it is capable of hiding the memory latencies efficiently. In this paper, we
analyzed both the instruction and data level I caches and tried to come up with
the optimal design parameters.
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 133

0.02
0.018
0.016
0.014
0.012 IL1:1:256:64:1:1
Miss ratio

0.01 IL1:1:512:64:1:1
0.008 IL1:1:1024:64:1:1
0.006
IL1:1:256:64:2:1
0.004
IL1:1:128:64:4:1
0.002
0
er

ll
k

s
s

ilc
p

p
r

s
3d
ex

al
an

en
ip

3
bm

ta

es
em
es
tp

m
m

ac
en

nx
m

de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
sj

G
so

sl
1.

7.
3.

pe

ga
lib
om

ze
4.
8.

bw

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43
<----------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 17. Missrate for IL1 configs in format name:no.sets:blk size:associativity&repl.


policy

Fig. 18. CPI for varying associativity with 16KB page sizes

For the purpose of analyzing the L1 caches, we varied both the cache size and
the associativity and compared the values of CPI and the miss ratios. We used
the LRU replacement policy for all our experiments which is given as one in
specifying the configuraion of the cache in the figures. From the graph in Fig. 14
& 15, it is evident that the effect of increasing associativity has a prominent ef-
fect on the performance than just increasing the size of the data cache. For some
benchmarks like 445gobmk, increasing the associativity to 2 result in a colossal
reduction in the miss ratios, which can be attributed to smaller foot prints of
these benchmarks. Other benchmarks where associativity provided significant
benefit are 456.hmmer, 458.sjeng and 482.sphinx3 in which case increasing the
associativity to 2 resulted in more than 50 percent reduction in miss ratio. How-
ever, some benchmarks like 473.astar and 450.soplex responded more to the size
than associativity. It can be concluded that 473.astar and 450.soplex has lot of
sequential data and hence we cannot extract much benefit by increasing the as-
sociativity. The CPIs of the benchmarks 462.libquantum and 433.milc neither
respond to the increase in the cache size nor to that in associativity. This may be
due to a smaller memory footprint of these benchmarks which can be captured
completely by just a small direct mapped cache.
134 K. Ganesan, D. Panwar, and L.K. John

Fig. 19. TLB miss ratios for varying sssociativity with 16KB page sizes

2.5

1.5
CPI

4 KB

1 16 KB
64 KB
0.5 16 MB

0
er

ll
k

s
s
p

ilc
d

p
r

s
ex

3d
an

al
g

en
ip

3
bm

ta

em

es
es
tp

m
m

ac
en

nx
m
bz

de
pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

0.

44

gr
47

44
45

45
0.

7.
2.

6.

4.

2.
44

0.
45
1.

45

5.
40

43
46

41

43

48
41
47

43
<-------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 20. CPI for varying page sizes with 2-way associative TLB

The CPI and the miss ratios for different Level 1 instruction cache configura-
tions are shown in Fig. 16 and 17. As expected, the miss ratios of the instruction
cache is much lesser than that of the data cache because of the uniformity in
the pattern of access to the instruction cache. For some of the benchmarks
like 473.astar, 456.hmmer, 435.gromacs, the miss ratio is almost negligible and
hence further increase in the cache size or associativity does not have any effect
on the performance. The performance benefit due to increase in associativity
compared to cache size in instruction cache is not as much as the data cache.
This is because of the fact that the instruction cache responds more to the
increase in the cache size to that of associativity because of high spatial lo-
cality in the references. Considering the tradeoff between the performance and
complexity, an associativity of two at the instruction cache level seems to be
optimal.

5.3 TLB Characteristics


Although designing the data cache is an important step in processor design, it
has to be coupled with an efficient TLB usage to achieve good performance.
Choosing the TLB page size is becoming critical in modern memory intensive
workloads with large foot prints. This can be attributed to the recent addition
of features like multiple page sizes to modern operating systems.
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 135

0.014

0.012

0.01
TLB Miss ratios

0.008 4 KB

0.006 16 KB
64 KB
0.004
16 MB
0.002

0
er

ll
k

s
s

ilc
p

p
r

s
ex

3d

al
an

en

3
ip
bm

ta

em

es
es
tp

m
m

ac
en

nx
m

de
bz

pl

ie
as

rlb
qu
ne

na

us
hm

av
m

om
3.
go

hi
G
sj

so

sl
1.

7.
3.

pe

ga
om

lib

ze
4.

bw
8.

sp
43
9.
le
40
5.

6.

44
0.

gr
47

44
45

0.

7.

45
2.

6.

4.

2.
44

45

0.
1.

45

5.
40

43
46

41

43

48
41
47

43
<-------------------------------SPECInt-----------------> <------------------------------------SPECfp------------------------------>

Fig. 21. TLB miss ratio for varying page sizes with 2-way associative TLB

Using Simplescalar, we performed simulations on the SPEC 06 suite for dif-


ferent TLB page sizes & associativities and observed the TLB miss ratio, which
characterizes the part of the CPI due to the time incurred in page
translation.
First, we fixed the page size as 16KB and varied the associativity to see the
corresponding impact on miss ratios and CPI. As expected, the direct mapped
TLB has performed worse than the 2-way and 4-way TLBs as seen in Fig. 19
& 20. It looks like the improvement in the performance from 2-way to 4-way is
not much and is not worth the extra hardware complexity required for the same.
Thus, an associativity of two seems to be optimal for the modern workloads.
As we increased the TLB size from 16KB to 16MB, we found that the change
in associativity did not have any effect on the performance and this can be
attributed to the fact that a page size of 16MB is large enough to reduce the
conflict misses to zero.
Second, we performed simulations with various page sizes for a 2-way asso-
ciative TLB. Our results as shown in Fig. 21 & 22 had a close match with that
of the results specified in [16] for a power5 processor. We found that large page
sizes resulted in the least translation misses, leading to a better CPI. Firstly, it
can be observed that there is a reduction in the TLB miss ratio around 30% for
471.omnetpp, 80% for 473.astar when the page size is increased from 4KB to
16KB. There is a consistent improvement in the performance of all the bench-
marks for an increase in the page size. When a page size of 16MB is used, the
TLB misses reduces to nearly zero for most of the benchmarks except 445.gobmk
and 450.soplex. One possible cause for the increase in CPI for 445.gobmk and
450.soplex for a 16MB page size could be due to serious wastage of memory
caused due to internal fragmentation problems. Other reasons could be having
higher numbers of conflicts amongst the cache lines if the virtual address bits
used in cache tag matches are insufficiently distinct from each other under larger
sized TLB mappings.

6 Conclusion
The simulation points have proved to be an effective technique in reducing the
simulation time to a large extent without much loss of accuracy in the SPEC
136 K. Ganesan, D. Panwar, and L.K. John

CPU2006 Suite. Using simulation points not only reduces the number of dynamic
instructions to be simulated but also makes the workload parallel, making them
ideal for the current day parallel computers.
Further, simulating the different benchmarks with the different branch pre-
dictors, gave an insight into understanding the branch behavior of modern work-
loads, which helped in coming up with the best performing predictor configura-
tions. We observed Gshare and the combined (Bimodal & 2-level) to be the ideal
predictors, predicting most of the branches to near perfection. Looking at the
effect of different cache parameters, it is observed that the design of level-1 data
cache parameters proves to be more important in affecting the CPI than that
of the instruction cache parameters. Instruction accesses, due to their inherent
uniformity, tends to miss less frequently, which makes the task of designing the
Instruction cache much easier. The line size of the Instruction cache seems to
be the most important, while for the data cache, both the line size and the as-
sociativity needs to be tailored appropriately to get the best performance. The
simulations for the different TLB configurations revealed that usage of large page
sizes significantly reduce the translation misses and aid in improving the overall
CPI of the modern workloads.

Acknowledgement
We would like to thank the Texas Advance Computing Center (TACC) for the
excellent simulation environment provided for performing all the time consuming
simulations of SPEC CPU2006 with enough parallelism. Our thanks to Lieven
Eeckhout and Kenneth Hoste of the Ghent University, Belgium for providing
us the alpha binaries for the SPEC suite. This work is also supported in part
through the NSF award 0702694. Any opinions, findings and conclusions ex-
pressed in this paper are those of the authors and do not necessarily reflect the
views of the National Science Foundation (NSF).

References
1. Sherwood, T., Calder, B.: Time varying behavior of programs. Technical Report
UCSD-CS99-630, UC San Diego, (August 1999)
2. Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing
large scale program behavior. In: ASPLOS (October 2002)
3. http://www.freewebs.com/gkofwarf/simpoints.htm
4. SPEC. Standard performance evaluation corporation, http://www.spec.org
5. Henning, J.L.: SPEC CPU 2000: Measuring cpu performance in the new millen-
nium. IEEE Computer 33(7), 28–35 (2000)
6. Charney, M.J., Puzak, T.R.: Prefetching and memory system behavior of the
SPEC95 benchmark suite. IBM Journal of Research and Development 41(3) (May
1997)
7. Haskins, J., Skadron, K.: Minimal subset evaluation: warmup for simulated hard-
ware state. In: Proceedings of the 2001 International Conference on Computer
Design (September 2000)
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points 137

8. Phansalkar, A., Joshi, A., John, L.K.: Analysis of redundancy and application
balance in the SPEC CPU 2006 benchmark suite. In: The 34th International Sym-
posium on Computer Architecture (ISCA) (June 2007)
9. Hamerly, G., Perelman, E., Lau, J., Calder, B.: Simpoint 3.0: Faster and more flex-
ible program analysis. In: Workshop on Modeling, Benchmarking and Simulation
(June 2005)
10. Hamerly, G., Perelman, E., Calder, B.: How to use simpoint to pick simulation
points. ACM SIGMETRICS Performance Evaluation Review (March 2004)
11. Perelman, E., Hamerly, G., Calder, B.: Picking statistically valid and early simula-
tion points. In: International Conference on Parallel Architectures and Compilation
Techniques (September 2003)
12. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch
prediction. In: 19th Annual International Symposium on Computer Architecture
(May 1992)
13. Lau, J., Sampson, J., Perelman, E., Hamerly, G., Calder, B.: The strong correlation
between code signatures and performance. In: IEEE International Symposium on
Performance Analysis of Systems and Software (March 2005)
14. Perelman, E., Sherwood, T., Calder, B.: Basic block distribution analysis to find
periodic behavior and simulation points in applications. In: International Confer-
ence on Parallel Architectures and Compilation Techniques (September 2001)
15. Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc
processor. MICRO 25(2), 21–29 (2005)
16. Korn, W., Chang, M.S.: SPEC CPU 2006 sensitivity to memory page sizes. ACM
SIGARCH Computer Architecture News (March 2007)
A Note on the Effects of Service Time Distribution in the
M/G/1 Queue

Alexandre Brandwajn1 and Thomas Begin2


1
Baskin School of Engineering, University of California Santa Cruz, USA
2
Université Pierre et Marie Curie, LIP6, France
[email protected], [email protected]

Abstract. The M/G/1 queue is a classical model used to represent a large num-
ber of real-life computer and networking applications. In this note, we show
that, for coefficients of variation of the service time in excess of one, higher-
order properties of the service time distribution may have an important effect on
the steady-state probability distribution for the number of customers in the
M/G/1 queue. As a result, markedly different state probabilities can be observed
even though the mean numbers of customers remain the same. This should be
kept in mind when sizing buffers based on the mean number of customers in the
queue. Influence of higher-order distributional properties can also be important
in the M/G/1/K queue where it extends to the mean number of customers itself.
Our results have potential implications for the design of benchmarks, as well as
the interpretation of their results.

Keywords: performance evaluation, M/G/1 queue, higher-order effects, finite


buffers.

1 Introduction

The M/G/1 queue is a classical model used to represent a large number of real-life
computer and networking applications. For example, M/G/1 queues have been applied
to evaluate the performance of devices such as volumes in a storage subsystem [1],
Web servers [13], or nodes in an optical ring network [3]. In many applications re-
lated to networking, the service times may exhibit significant variability, and it may
be important to account for the fact that the buffer space is finite. It is well known
that, in the steady state, the mean number of users in the unrestricted M/G/1 queue
depends only on the first two moments of the service time distribution [11]. It is also
known [4] that the first three (respectively, the first four) moments of the service time
distribution enter into the expression for the second (respectively, the third) moment
of the waiting time. In this note our goal is to illustrate the effect of properties of the
service time distribution beyond its mean and coefficient of variation on the shape of
the stationary distribution of the number of customers in the M/G/1 queue. In particu-
lar, we point out the risk involved in dimensioning buffers based on the mean number
of users in the system.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 138–144, 2009.
© Springer-Verlag Berlin Heidelberg 2009
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 139

2 M/G/1 Queue
Assuming a Poisson arrival process, a quick approach to assess the required capacity
for buffers in a system is to evaluate it as some multiplier (e.g. three or six) times the
mean number of customers in an open M/G/1 queue (e.g. [12]). From the Pollaczek-
Khintchine formula [11], this amounts to dimensioning the buffers based on only the
first two moments of service time distribution. Unfortunately, the steady-state distri-
bution of the number of customers in the M/G/1 queue can exhibit a strong depend-
ence on higher-order properties of the service time distribution.
This is illustrated in Figure 1, which compares the distribution of the number of
customers for two different Cox-2 service time distributions with the same first two
moments, and thus yielding the same mean number of customers in the system. The
parameters of these distributions are given in Table 1. Note that both distributions I
and II correspond to a coefficient of variation of 3 but have different higher-order
properties such as skewness and kurtosis [14]. Similarly, distributions III and IV both
correspond to a coefficient of variation of 5 but again different higher-order proper-
ties. The stationary distribution of the number of customers in this M/G/1 queue was
computed using a recently published recurrence method [2]. We observe that, per-
haps not surprisingly, the effects of the distribution tend to be more significant as the
server utilization and the coefficient of variation of the service time distribution in-
crease. It is quite instructive to note, for instance, that with a coefficient of variation
of 3 and server utilization of 0.5, the probability of exceeding 20 users in the queue (a
little over 6 times the mean) is about 0.1% in one case while it is an order of magni-
tude larger for another service time distribution with same first two moments.

Table 1. Parameters and properties of the service time distributions used in Figure 1

Mean Rate of Probability Rate of


Coefficient
Distribution service Skewness Kurtosis service at to go to service at
of variation
time stage 1 stage 2 stage 2
Dist. I 1 3 4.5 27.3 10000.0 2.00*10-1 2.00*10-1
Dist. II 1 3 3557.4 1.90*107 1.0 2.50*10-7 2.50*10-4
Dist. III 1 5 7.5 75.1 10000.0 7.69*10-2 7.69*10-2
Dist. IV 1 5 6913.2 6.63*107 1.0 8.33*10-8 8.33*10-5

3 M/G/1/K Queue
Clearly, using the M/G/1/K, i.e., the M/G/1 queue with a finite queueing room would
be a more direct way to dimension buffers. There seem to be fewer theoretical results
for the M/G/1/K queue than for the unrestricted M/G/1 queue, but it is well known
that the steady-state distribution for the M/G/1/K queue can be obtained from that for
the unrestricted M/G/1 queue after appropriate transformations [10, 7, 4]. Clearly,
this approach can only work if the arrival rate does not exceed the service rate since
otherwise the unrestricted M/G/1 would not be stable.
140 A. Brandwajn and T. Begin

Queue length with utilization of 0.5

0.6

0.5

0.4

Dist. I
0.3
Dist. II

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
number of customers

(a) Coefficient of variation: 3, server utilization: 0.5


Queue length with an utilization of 0.9

0.12

0.1

0.08

Dist. I
0.06
Dist. II

0.04

0.02

0
12

15

18

21

24

27

30

33

36

39

42

45

48
0

number of customers

(b) Coefficient of variation: 3, server utilization: 0.9


Queue length with utilization of 0.5

0.6

0.5

0.4

Dist. III
0.3
Dist. IV

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
number of customers

(c) Coefficient of variation: 5, server utilization: 0.5

Fig. 1. Effect of service time distributions on the number of customers in the M/G/1 queue
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 141

Queue length with utilization of 0.9

0.12

0.1

0.08

Dist. III
0.06
Dist. IV

0.04

0.02

0
0

9
12

15

18

21

24

27

30

33

36

39

42

45

48
number of customers

(d) Coefficient of variation:0.5, server utilization: 0.9

Fig. 1. (continued)

While the steady-state distribution for the M/G/1/K queue can be derived from the
one for the unrestricted M/G/1 queue, and the mean number of users in the latter de-
pends only on the first two moments of the service time distribution, this is not the
case for the M/G/1/K queue. Table 2 shows that even the first three moments of the
service time distribution do not generally suffice to determine the mean number of
customers in the M/G/1/K queue. Here we illustrate the results obtained for two
Cox-3 distributions sharing the first three moments but with different properties of
higher-order.
Since the mean number of customers in the unrestricted M/G/1 queue depends only
on the first two moments of the service time distribution, and in the M/G/1/K for K=1
there is no distributional dependence at all (since there is no queueing), it is interest-
ing to see how the dependence on properties of higher-order varies with K, the size of
the queueing room. This is the objective in Figure 2 where we have represented the
relative difference in the probabilities of having exactly one customer in the system,
as well as in the probabilities of the buffer being full, for distribution I and II of Table
1. We observe that, although the first two moments of the service time distribution
are the same for both distributions, higher-order properties lead to drastically different
values for the selected probabilities. Interestingly, for the probability of the buffer
being full, although the relative difference between the distributions considered de-
creases as the size of the queueing room, K , increases, it remains significant even for
large values of the latter.
To further illustrate the dependence on higher-order properties of the service time
distribution, we consider read performance for two simple cached storage devices.
When the information requested is found in the cache, a hit occurs and the service
time is viewed as a constant (assuming a fixed record length). When the information
is not in the cache, it must be fetched from the underlying physical storage device. In
Table 3 we show simulation results [8] obtained for two different storage systems
with the same first two moments of the service time (resulting from the combination
142 A. Brandwajn and T. Begin

Table 2. Effect of properties beyond the third moment on the mean number in the M/G/1/K
queue

Relative
First Cox-3 Second Cox-3
differences
Rate of arrivals 1 1
Size of queueing
30 30
room
Mean service
1
time
Coefficient of
6.40
variation
Skewness 2331.54
Kurtosis 7.43*106 1.44*107
Mean number in
3.98 5.07 27.4 %
the M/G/1/K

Fig. 2. Relative difference in selected probabilities for distributions I and II as a function of the
queueing room in the M/G/1/K queue

of hit and miss service times), and queueing room limited to 10. In one case the ser-
vice time of the underlying physical device (i.e. miss service time) is represented by a
uniform distribution, and in the other by a truncated exponential [9]. We are inter-
ested in the achievable I/O rate such that the mean response time does not exceed 5
ms. We observe that the corresponding I/O rates differ by over 20% in this example
(the coefficient of variation of the service time being a little over 1.6).
It has been our experience that the influence of higher-order properties tends to in-
crease as the coefficient of variation and the skewness of the service time increase. It
is interesting to note that this is precisely the case when one considers instruction
execution times in programs running on modern processors where most frequent
A Note on the Effects of Service Time Distribution in the M/G/1 Queue 143

Table 3. I/O rate for same mean I/O time in two storage subsystems

Truncated
Uniform miss Relative
exponential miss
service time differences
service time
Mean service
1.9
time
Coefficient of
1.62
variation
Hit probability 0.9 0.985
Hit service time 1 1.64
Truncated exponen-
Miss service time Uniform [2,18] tial mean: 20, max:
100
Attainable I/O
rate for Mean I/O 0.257 0.312 21.4 %
time of 5 ms

instructions are highly optimized, less frequent instructions can be significantly


slower, and certain even less frequent instructions may be implemented as subroutine
calls with order of magnitude longer execution times.
As another example of the effects of higher-order properties of the service time in
an M/G/1 queue, consider the probability that a small buffer of 10 messages at an
optical network node is full. Incoming packets can be of three different lengths. In
the fist case, abstracted from reported IP traffic, the packet lengths are 40, 300 and
1500 bytes with probabilities 0.5, 0.3 and 0.2, respectively. In the second case, longer
packets are used: 150, 500 and 5000 bytes, with respective probabilities 0.426, 0.561
and 0.013. Both packet length distributions have the same mean of 410 bytes with a
coefficient of variation of 1.36, but different higher order properties. With the average
packet arrival rate at 1 per mean packet service time, simulation results indicate that
the probability of the buffer being full differs by some 20% depending on the packet
mix (12.5% in the first case vs. 10.5% in the second) even though both packet mixes
have the same fist two moments.

4 Conclusion
In conclusion, we have shown that, for coefficients of variation of the service time in
excess of one, higher-order properties of the service time distribution may have an
important effect on the steady-state probability distribution for the number of custom-
ers in the M/G/1 queue. As a result, markedly different state probabilities can be ob-
served even though the mean numbers of customers remain the same. This should be
kept in mind when sizing buffers based on the mean number of customers in the
queue. Influence of higher-order distributional properties can also be important in the
M/G/1/K queue where it extends to the mean number of customers itself. The poten-
tially significant impact of higher-order distributional properties of the service times
should be kept in mind also when interpreting benchmark results for systems that may
144 A. Brandwajn and T. Begin

be viewed as instances of the M/G/1 or M/G/1/K queue, in particular, transaction


oriented systems. Our results imply that it may not be sufficient to look just at the
mean or even the mean and the variance of the system execution times to correctly
assess the overall system performance. Another implication relates to benchmark
design since, unless one is dealing with a system that satisfies the assumptions of a
product-form queueing network, it may not be sufficient to simply preserve the mean
of the system load [6].

Acknowledgments. The authors wish to thank colleagues for their constructive re-
marks on an earlier version of this note.

References
1. Brandwajn, A.: Models of DASD Subsystems with Multiple Access Paths: A Throughput-
Driven Approach. IEEE Transactions on Computers C-32(5), 451–463 (1983)
2. Brandwajn, A., Wang, H.: Conditional Probability Approach to M/G/1-like Queues. Per-
formance Evaluation 65(5), 366–381 (2008)
3. Bouabdallah, N., Beylot, A.-L., Dotaro, E., Pujolle, G.: Resolving the Fairness Issues in
Bus-Based Optical Access Networks. IEEE Journal on Selected Areas in Communica-
tions 23(8), 1444–1457 (2005)
4. Cohen, J.W.: On Regenerative Processes in Queueing Theory. Lecture Notes in Economics
and Mathematical Systems. Springer, Berlin (1976)
5. Cohen, J.W.: The Single Server Queue, 2nd edn. North-Holland, Amsterdam (1982)
6. Ferrari, D.: On the foundations of artificial workload design. SIGMETRICS Perform. Eval.
Rev. 12(3), 8–14 (1984)
7. Glasserman, P., Gong, W.: Time-changing and truncating K-capacity queues from one K
to another. Journal of Applied Probability 28(3), 647–655 (1991)
8. Gross, D., Juttijudata, M.: Sensitivity of Output Performance Measures to Input Distribu-
tions in Queueing Simulation Modeling. In: Proceedings of the 1997 winter simulation
conference, pp. 296–302 (1997)
9. Jawitz, J.W.: Moments of truncated continuous univariate distributions. Advances in Water
Resources 27(3), 269–281 (2004)
10. Keilson, J.: The Ergodic Queue Length Distribution for Queueing Systems with Finite Ca-
pacity. Journal of the Royal Statistical Society 28(1), 190–201 (1966)
11. Kleinrock, L.: Queueing systems. Theory, vol. I. Wiley, Chichester (1974)
12. Mitrou, N.M., Kavidopoulos, K.: Traffic engineering using a class of M/G/1 models. Jour-
nal of Network and Computer Applications 21, 239–271 (1998)
13. Molina, M., Castelli, P., Foddis, G.: Web traffic modeling exploiting. TCP connections’
temporal clustering through HTML-REDUCE 14(3), 46–55 (2000)
14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall /
CRC, Boca Raton (1986)
Author Index

Babka, Vlastimil 77 Lange, Klaus-Dieter 97


Begin, Thomas 138 Lavery, Daniel M. 36
Brandwajn, Alexandre 138 Li, Tao 102

Cho, Chang-Burm 102 McNairy, Cameron 36


Chow, Kingsum 17
Nicolau, Alexandru 36
Desai, Darshan 36
Panwar, Deepak 121
Ganesan, Karthik 121 Petrochenko, Dmitry 17

Henning, John L. 1 Shiv, Kumar 17


Hoflehner, Gerolf F. 36
Tůma, Petr 77
Isen, Ciji 57
Veidenbaum, Alexander V. 36
John, Eugene 57
John, Lizy K. 57, 121 Wang, Yanping 17

Kejariwal, Arun 36 Zhang, Wangyuan 102

You might also like