D S E M S: ARK Ilicon and The ND OF Ulticore Caling
D S E M S: ARK Ilicon and The ND OF Ulticore Caling
..........................................................................................................................................................................................................................
WHETHER SCALING MULTICORES WILL PROVIDE THE PERFORMANCE AND VALUE NEEDED
Hadi Esmaeilzadeh ANSWER TO THIS QUESTION, A COMPREHENSIVE STUDY THAT PROJECTS THE SPEEDUP
...............................................................................................................................................................................................
Related Work in Modeling Multicore Speedup and Dark Silicon
Hill and Marty extend Amdahls law to model multicore speedup with tradeoffs; and exhaustively considers multicore organizations, microarch-
symmetric, asymmetric, and dynamic topologies and conclude that dy- itectural features, and the behavior of real applications.
namic multicores are superior.1 Their model uses area as the primary
constraint and models single-core area/performance tradeoff using
p
Pollacks rule (Performance / Area) without considering technology References
trends.2 Azizi et al. derive the single-core energy/performance tradeoff 1. M.D. Hill and M.R. Marty, Amdahls Law in the Multicore
of Pareto frontiers using architecture-level statistical models combined Era, Computer, vol. 41, no. 7, 2008, pp. 33-38.
with circuit-level energy/performance tradeoff functions.3 For modeling 2. F. Pollack, New Microarchitecture Challenges in the Com-
single-core power/performance and area/performance tradeoffs, our ing Generations of CMOS Process Technologies, Proc.
core model derives two separate Pareto frontiers from real measure- 32nd Ann. ACM/IEEE Intl Symp. Microarchitecture (Micro
ments. Furthermore, we project these tradeoff functions to the future 99), IEEE CS, 2009, p. 2.
technology nodes using our device model. 3. O. Azizi et al., Energy-Performance Tradeoffs in Processor
Chakraborty considers device scaling and estimates a simultaneous Architecture and Circuit Design: A Marginal Cost Analysis,
activity factor for technology nodes down to 32 nm.4 Hempstead et al. Proc. 37th Ann Intl Symp. Computer Architecture (ISCA
introduce a variant of Amdahls law to estimate the amount of special- 10), ACM, 2010, pp. 26-36.
ization required to maintain 1.5 performance growth per year, 4. K. Chakraborty, Over-Provisioned Multicore Systems,
assuming completely parallelizable code.5 Chung et al. study uncon- doctoral thesis, Department of Computer Sciences, Univ.
ventional cores including custom logic, field-programmable gate arrays of WisconsinMadison, 2008.
(FPGAs), or GPUs in heterogeneous single-chip design.6 They rely on 5. M. Hempstead, G.-Y. Wei, and D. Brooks, Navigo: An Early-
Pollacks rule for the area/performance and power/performance trade- Stage Model to Study Power-Constrained Architectures and
offs. Using International Technology Roadmap for Semiconductors Specialization, Workshop on Modeling, Benchmarking, and
(ITRS) projections, they report on the potential for unconventional Simulations (MoBS ), 2009.
cores considering parallel kernels. Hardavellas et al. forecast the lim- 6. E.S. Chung et al., Single-Chip Heterogeneous Computing:
its of multicore scaling and the emergence of dark silicon in servers Does the Future Include Custom Logic, FPGAs, and
with workloads that have an inherent abundance of parallelism. 7 GPUs? Proc. 43rd Ann. IEEE/ACM Intl Symp. Microarchi-
Using ITRS projections, Venkatesh et al. estimate technology-imposed tecture (Micro 43), IEEE CS, 2010, pp. 225-236.
utilization limits and motivate energy-efficient and application-specific 7. N. Hardavellas et al., Toward Dark Silicon in Servers, IEEE
core designs.8 Micro, vol. 31, no. 4, 2011, pp. 6-15.
Previous work largely abstracts away processor organization and ap- 8. G. Venkatesh et al., Conservation Cores: Reducing the En-
plication details. Our study provides a comprehensive model that consid- ergy of Mature Computations, Proc. 15th Intl Conf. Archi-
ers the implications of process technology scaling; decouples power/area tectural Support for Programming Languages and Operating
constraints; uses real measurements to model single-core design Systems (ASPLOS 10), ACM, 2010, pp. 205-218.
...............................................................................................................................................................................................
TOP PICKS
Dark silicon
Multicore speedup
S (q) N(q)s(q) Historical
gap
Core area
performance
VDD
scaling
CPU-like GPU-like
Power
Tech node
% of dark silicon
Area
Core power
Tech node
Tech node
f
Applications
Tech node Performance Year
2 projection Data for 152 2 chip organizations 4 topologies Search 800 configs.
schemes processors 12 benchmarks for 12 benchmarks
Figure 1. Overview of the methodology and models. By combining the device scaling model (DevM), core scaling model
(CorM), and multicore scaling model (CmpM), we project performance speedup and reveal a gap between the projected
speedup and the speedup expected with each technology generation indicated as the dark silicon gap. The three-tier
model also projects the percentage of dark silicon as technology scales.
multicore design compared to a baseline de- Single-thread (ST) cores are uniprocessor-
sign. The model projects performance for style cores with large caches, and many-
each hybrid configuration based on high- thread (MT) cores are GPU-style cores
level application properties and microarchi- with smaller caches.
tectural features. We modeled the two main- Combining the device model with the
stream classes of multicore organizations, core model provided power/performance
multicore CPUs and many-thread GPUs, and area/performance Pareto frontiers at
which represent two extreme points in the future technology nodes. Any performance
threads-per-core spectrum. The CPU multi- improvements for future cores will come
core organization represents Intel Nehalem- only at the cost of area or power as defined
like, heavyweight multicore designs with by these curves. Finally, combining all three
fast caches and high single-thread perfor- models and performing an exhaustive design-
mance. The GPU multicore organization space search produced the optimal multicore
represents Nvidia Tesla-like lightweight configuration and the maximum multicore
cores with heavy multithreading support speedups for each benchmark at future tech-
and poor single-thread performance. For nology nodes while enforcing area, power,
each multicore organization, we considered and benchmark constraints.
four topologies: symmetric, asymmetric, dy-
namic, and composed (fused). Future directions
Table 1 outlines the four topologies in As the rest of the article will elaborate, we
the design space and the cores roles during model an upper bound on parallel applica-
serial and parallel portions of applications. tion performance available from multicore
....................................................................
Table 1. The four multicore topologies for CPU-like and GPU-like organizations. (ST core: single-thread core;
MT core: many-thread core.)
and CMOS scalingassuming no major dis- era, or will industry need to move in dif-
ruptions in process scaling or core efficiency. ferent, perhaps radical, directions to justify
Using a constant area and power budget, this the cost of scaling?
study shows that the space of known multi-
core designs (CPUs, GPUs, and their The glass is half-empty
hybrids) or novel heterogeneous topologies A pessimistic interpretation of this study
(for example, dynamic or composable) falls is that the performance improvements to
far short of the historical performance gains which we have grown accustomed over the
our industry is accustomed to. Even with past 30 years are unlikely to continue with
aggressive ITRS scaling projections, scaling multicore scaling as the primary driver. The
cores achieves a geometric mean 7.9 transition from multicore to a new approach
speedup through 2024 at 8 nm. With con- is likely to be more disruptive than the tran-
servative scaling, only 3.7 geometric sition to multicore and, to sustain the current
mean speedup is achievable at 8 nm. Fur- cadence of Moores law, must occur in only a
thermore, with ITRS projections, at 22 nm, few years. This period is much shorter than
21 percent of the chip will be dark, and at the traditional academic time frame required
8 nm, more than 50 percent of the chip can- for research and technology transfer. Major
not be utilized. architecture breakthroughs in alternative
The articles findings and methodology directions such as neuromorphic computing,
are both significant and indicate that without quantum computing, or biointegration will
process breakthroughs, directions beyond require even more time to enter industry
multicore are needed to provide performance product cycle. Furthermore, while a slowing
scaling. For decades, Dennard scaling per- of Moores law will obviously not be fatal, it
mitted more transistors, faster transistors, has significant economic implications for the
and more energy-efficient transistors with semiconductor industry.
each new process node, which justified the
enormous costs required to develop each The glass is half-full
new process node. Dennard scalings failure If energy-efficiency breakthroughs are
led industry to race down the multicore made on supply voltage and process scaling,
path, which for some time permitted perfor- the performance improvement potential is
mance scaling for parallel and multitasked high for applications with very high degrees
workloads, permitting the economics of pro- of parallelism.
cess scaling to hold. A key question for the
microprocessor research and design commu- Rethinking multicores long-term potential
nity is whether scaling multicores will pro- We hope that our quantitative findings
vide the performance and value needed to trigger some analyses in both academia
scale down many more technology genera- and industry on the long-term potential of
tions. Are we in a long-term multicore the multicore strategy. Academia is now
....................................................................
...............................................................................................................................................................................................
TOP PICKS
making a major investment in research focus- power budget for an entire laptop, smart-
ing on multicore and its related problems of phone, or tablet. We believe this study will
expressing and managing parallelism. Re- revitalize and trigger microarchitecture inno-
search projects assuming hundreds or thou- vations, making the case for their urgency
sands of capable cores should consider this and potential impact.
model and the power requirements under
various scaling projections before assuming A case for specialization
that the cores will inevitably arrive. The par- There is emerging consensus that special-
adigm shift toward multicores that started in ization is a promising alternative to efficiently
the high-performance, general-purpose mar- use transistors to improve performance. Our
ket has already percolated to mobile and study serves as a quantitative motivation on
embedded markets. The qualitative trends such works urgency and potential impact.
we predict and our modeling methodology Furthermore, our study shows quantitatively
hold true for all markets even though the levels of energy improvement that spe-
our study considers the high-end desktop cialization techniques must deliver.
market. This studys results could help
break industrys current widespread consen- A case for complementing the core
sus that multicore scaling is the viable for- Our study also shows that when perfor-
ward path. mance becomes limited, techniques that oc-
casionally use parts of the chip to deliver
Model points to opportunities outcomes orthogonal to performance are
Our study is based on a model that takes ways to sustain the industrys economics.
into account properties of devices, processor However, techniques that focus on using
core, multicore organization, and topology. the device integration capacity for improving
Thus the model inherently provides the pla- security, programmer productivity, software
ces to focus on for innovation. To surpass maintainability, and so forth must consider
the dark silicon performance barrier high- energy efficiency as a primary factor.
lighted by our work, designers must develop
systems that use significantly more energy- Device scaling model (DevM)
efficient techniques. Some examples include The device model (DevM) provides
device abstractions beyond digital logic transistor-area, power, and frequency-scaling
(error-prone devices); processing paradigms factors from a base technology node (for
beyond superscalar, single instruction, multi- example, 45 nm) to future technologies.
ple data (SIMD), and single instruction, The area-scaling factor corresponds to the
multiple threads (SIMT); and program se- shrinkage in transistor dimensions. The
mantic abstractions allowing probabilistic DevM model calculates the frequency-scaling
and approximate computation. The results factor based on the fanout-of-four (FO4)
show that radical departures are needed, delay reduction. The model computes the
and the model shows quantitative ways to power-scaling factor using the predicted fre-
measure the impact of such techniques. quency, voltage, and gate capacitance scaling
2
factors in accordance with the P CVDD f
A case for microarchitecture innovation equation.
Our study also shows that fundamental We generated two device scaling models:
processing limitations emanate from the pro- ITRS scaling and conservative scaling.
cessor core. Clearly, architectures that move The ITRS model uses projections from the
well past the power/performance Pareto- 2010 ITRS. The conservative model is based
optimal frontier of todays designs are neces- on predictions presented by Borkar3 and rep-
sary to bridge the dark silicon gap and resents a less optimistic view. Table 2 summa-
use transistor integration capacity. Thus, rizes the parameters used for calculating the
improvements to the cores efficiency will power and performance-scaling factors. We
impact performance improvement and will allocated 20 percent of the chip power budget
enable technology scaling even though the to leakage power and assumed chip designers
core consumes only 20 percent of the can maintain this ratio.
....................................................................
Table 2. Scaling factors with International Technology Roadmap for Semiconductors (ITRS ) and conservative
projections. ITRS projections show an average 31 percent frequency increase and 35 percent power
reduction per node, compared to an average 6 percent frequency increase and 23 percent power
reduction per node for conservative projections.
Frequency Capacitance
Device scaling Technology scaling factor VDD scaling scaling factor Power scaling
model Year node (nm) (45 nm) factor (45 nm) (45 nm) factor (45 nm)
ITRS scaling 2010 45 1.00 1.00 1.00 1.00
2012 32 1.09 0.93 0.70 0.66
2015 22y 2.38 0.84 0.33 0.54
2018 16y 3.21 0.75 0.21 0.38
2021 11y 4.17 0.68 0.13 0.25
2024 8y 3.85 0.62 0.08 0.12
Conservative scaling 2008 45 1.00 1.00 1.00 1.00
2010 32 1.10 0.93 0.75 0.71
2012 22 1.19 0.88 0.56 0.52
2014 16 1.25 0.86 0.42 0.39
2016 11 1.30 0.84 0.32 0.29
2018 8 1.34 0.84 0.24 0.22
.................................................................................................................................................................................
Extended Planar Bulk Transistors; y Multi-Gate Transistors.
Core scaling model (CorM) estimated the core power budget using the
We built the technology-scalable core thermal design power (TDP) reported in
model (CorM) by populating the area/ processor datasheets. The TDP is the chip
performance and power/performance design power budget, or the amount of power the
spaces with the data collected for a set of pro- chip can dissipate without exceeding the
cessors, all fabricated in the same technology transistor junction temperature. After
node. The core model is the combination of excluding the share of uncore components
the area/performance Pareto frontier, A(q), from the power budget, we divided the
and the power/performance Pareto frontier, power budget allocated to the cores to
P(q), for these two design spaces. The q is the number of cores to estimate the
a cores single-threaded performance. These core power budget. We used die photos
frontiers capture the optimal area/performance of the four microarchitecturesIntel
and power/performance tradeoffs for a Atom, Intel Core, AMD Shanghai, and
core while abstracting away specific details Intel Nehalemto estimate the core areas
of the core. (excluding Level-2 [L2] and Level-3 [L3]
As Figure 2 shows, we populated the two caches). Because this works focus is to
design spaces at 45 nm using 20 representa- study the impact of technology constraints
tive Intel and Advanced Micro Devices on logic scaling rather than cache scaling,
(AMD) processors and derive the Pareto we derive the Pareto frontiers using only
frontiers. The curve that bounds all power/ the portion of power budget and area allo-
performance (area/performance) points in cated to the core in each processor excluding
the design space and indicates the minimum the uncore components share.
amount of power (area) required for a given As Figure 2 illustrates, we fit a cubic poly-
performance level constructs the Pareto nomial, P (q), to the points along the edge of
frontier. The P(q) and A(q) pair, which are the power/performance design space, and a
polynomial equations, constitute the core quadratic polynomial (Pollacks rule4), A(q),
model. The core performance (q) is the pro- to the points along the edge of the area/
cessors SPECmark and is collected from the performance design space. The Intel Atom
SPEC website (http://www.spec.org). We Z520 with an estimated 1.89 W core TDP
....................................................................
...............................................................................................................................................................................................
TOP PICKS
25
25
20
20
15
15
10
Intel Nehalem (45 nm) Intel Nehalem (45 nm)
Intel Core (45 nm) Intel Core (45 nm)
AMD Shanghai (45 nm) 10 AMD Shanghai (45 nm)
5
Intel Atom (45 nm) Intel Atom (45 nm)
Pareto Frontier (45 nm) Pareto Frontier (45 nm)
0 5
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
(a) Performance (SPECmark) (b) Performance (SPECmark)
Figure 2. Design space and the derived Pareto frontiers. Power/performance frontier, 45 nm (a); area/performance frontier,
45 nm (b).
represents the lowest power design (lower-left Using this model, we consider single-
frontier point), and the Nehalem-based Intel threaded cores with large caches to cover
Core i7-965 Extreme Edition with an esti- the CPU multicore design space and mas-
mated 31.25 W core TDP represents the sively threaded cores with minimal caches
highest-performing design (upper-right fron- to cover the GPU multicore design space
tier point). We used the points along the across all four topologies, as described in
scaled Pareto frontier as the search space for Table 1. Table 3 lists the input parameters
determining the best core configuration by to the model, and how the multicore design
the multicore scaling model. choices impact them, if at all.
average time spent waiting for each memory to a baseline multicore (PerfB). That is, the
access (t ), fraction of instructions that access parallel portion of code (f ) is sped up by
the memory (rm), and the CPIexe: SParallel PerfP/PerfB and the serial portion of
! code (1f ) is sped up by SSerial PerfS /PerfB.
T We calculated the number of cores that
min 1; rm (2)
1 t CPI exe
can fit on the chip based on the multicores
topology, area budget (AREA), power bud-
The average time spent waiting for mem- get (TDP), and each cores area [A(q)] and
ory accesses (t) is a function of the time to power [P (q)].
access the caches (tL1 and tL2), time to visit
memory (tmem), and the predicted cache NSymm q
miss rate (mL1 and mL2): AREA TDP
min ;
t (1 mL1)tL1 mL1 (1 mL2)tL2 Aq P q
mL1mL2tmem (3)
1L1 NAsym qL ; qS
CL1
mL1 and AREA AqL TDP P qL
T L1 min ;
1L2 AqS P qS
CL2
mL2 (4)
T L2
Ndynm qL ; qS
Multicore topologies AREA AqL TDP
min ;
The multicore model is an extended AqS P qS
Amdahls law6 equation that incorporates
the multicore performance (Perf ) calculated NComp qL ; qS
from Equations 1 through 4:
AREA TDP
f 1f min ;
Speedup 1= (5) 1 AqS P qS
Sparallel Sserial
For heterogeneous multicores, qS is the
The CmpM model (Equation 5) mea- single-threaded performance of the small
sures the multicore speedup with respect cores and qL is the large cores single-threaded
....................................................................
...............................................................................................................................................................................................
TOP PICKS
performance. The area overhead of support- 64-Kbyte L1 cache, and chips with only
ing composability is t, while no power over- single-thread cores have an L2 cache that is
head is assumed for composability support. 30 percent of the chip area. MT cores have
small L1 caches (32 Kbytes for every eight
Model implementation cores), support multiple hardware contexts
One of the contributions of this work is (1,024 threads per eight cores), a thread reg-
the incorporation of Pareto frontiers, physi- ister file, and no L2 cache. From Atom and
cal constraints, real application behavior, Tesla die photos, we estimated that eight
and realistic microarchitectural features into small many-thread cores, their shared L1
the multicore speedup projections. cache, and their thread register file can fit
The input parameters that characterize an in the same area as one Atom processor.
application are its cache behavior, fraction of We assumed that off-chip bandwidth
instructions that are loads or stores, and frac- (BWmax) increases linearly as process technol-
tion of parallel code. For the PARSEC ogy scales down and while the memory
benchmarks, we obtained this data from access time is constant.
two previous studies.7,8 To obtain the frac- We assumed that t increases from 10 per-
tion of parallel code (f ) for each benchmark, cent up to 400 percent, depending on the
we fit an Amdahls lawbased curve to the composed cores total area. The composed
reported speedups across different numbers cores performance cannot exceed perfor-
of cores from both studies. This fit shows mance of a single Nehalem core at 45 nm.
values of f between 0.75 and 0.9999 for We derived the area and power budgets
individual benchmarks. from the same quad-core Nehalem multicore
To incorporate the Pareto-optimal curves at 45 nm, excluding the L2 and L3 caches.
into the CmpM model, we converted the They are 111 mm2 and 125 W, respectively.
SPECmark scores (q) into an estimated The reported dark silicon projections are
CPIexe and core frequency. We assumed for the area budget thats solely allocated to
that core frequency scales linearly with per- the cores, not caches and other uncore com-
formance, from 1.5 GHz for an Atom core ponents. The CmpMs speedup baseline is a
to 3.2 GHz for a Nehalem core. Each appli- quad-Nehalem multicore.
cations CPIexe depends on its instruction
mix and use of hardware optimizations Combining models
such as functional units and out-of-order Our three-tier modeling approach allows
processing. Since the measured CPIexe for us to exhaustively explore the design space
each benchmark at each technology node is of future multicores, project their upper
not available, we used the CmpM model to bound performance, and estimate the
generate per-benchmark CPIexe estimates for amount of integration capacity underutiliza-
each design point along the Pareto frontier. tion, dark silicon.
With all other model inputs kept constant,
we iteratively searched for the CPIexe at Device core model
each processor design point. We started by To study core scaling in future technology
assuming that the Nehalem core has a CPIexe nodes, we scaled the 45 nm Pareto frontiers
of . Then, the smallest core, an Atom pro- down to 8 nm by scaling each processor
cessor, should have a CPIexe such that the data points power and performance using
ratio of its CmpM performance to the Neha- the DevM model and then refitting the Par-
lem cores CmpM performance is the same eto optimal curves at each technology node.
as the ratio of their SPECmark scores (q). We assumed that performance, which we
We assumed that the CPIexe does not change measured in SPECmark, would scale linearly
with technology node, while frequency with frequency. By making this assumption,
scales. we ignored the effects of memory latency and
A key component of the detailed model is bandwidth on the core performance. Thus,
the set of input parameters modeling the actual performance gains through scaling
cores microarchitecture. For single-thread could be lower. Based on the optimistic
cores, we assumed that each core has a ITRS model, scaling a microarchitecture
....................................................................
...............................................................................................................................................................................................
TOP PICKS
32 32
Exponential performance Exponential performance
Geometric mean Geometric mean
Design points Design points
24 24
Speedup
Speedup
16 16
8 8
0 0
45 32 22 16 11 8 45 32 22 16 11 8
(a) Technology node (nm) (b) Technology node (nm)
Figure 3. Speedup across process technology nodes across all organizations and topologies with PARSEC benchmarks.
The exponential performance curve matches transistor count growth. Conservative scaling (a); ITRS scaling (b).
Finding: With ITRS projections, at 22 nm, achievable by 442 cores. With conservative
21 percent of the chip will be dark, and at scaling, the 90 percent speedup core count
8 nm, more than 50 percent of the chip is 20 at 8 nm.
cannot be utilized.
Finding: Due to limited parallelism in the
Finding: The level of parallelism in PARSEC PARSEC benchmark suite, even with
applications is the primary contributor to novel heterogeneous topologies and opti-
the dark silicon speedup gap. However, in mistic ITRS scaling, integrating more than
realistic settings, the dark silicon resulting 35 cores improves performance only slightly
from power constraints limits the achievable for CPU-like topologies.
speedup.
Sensitivity studies
Core count projections We performed sensitivity studies on the
Different applications saturate perfor- impact of various features, including L2
mance improvements at different core cache sizes, memory bandwidth, simultane-
counts. We considered the chip configura- ous multithreading (SMT) support, and
tion that provided the best speedups for all the percentage of total power allocated
applications to be an ideal configuration. to leakage. Quantitatively, these studies
Figure 4 shows the number of cores (solid show that these features have limited impact
line) for the ideal CPU-like dynamic multicore on multicore performance.
configuration across technology generations,
because dynamic configurations performed Limitations
best. The dashed line illustrates the number Our device and core models do not
of cores required to achieve 90 percent of the explicitly consider dynamic voltage and fre-
ideal configurations geometric mean speedup quency scaling (DVFS). Instead, we take an
across PARSEC benchmarks. As depicted, optimistic approach to account for its best-
with ITRS scaling, the ideal configuration inte- case impact. When deriving the Pareto fron-
grates 442 cores at 8 nm. However, 35 cores tiers, we assume that each processor data
reach the 90 percent of the speedup point operates at its optimal voltage and
....................................................................
406 442
256 256
Ideal configuration Ideal configuration
90% configuration 90% configuration
224 224
192 192
160 160
No. of cores
No. of cores
128 128
96 96
64 64
32 32
0 0
45 32 22 16 11 8 45 32 22 16 11 8
(a) Technology node (nm) (b) Technology node (nm)
Figure 4. Number of cores for the ideal CPU-like dynamic multicore configurations and the number of cores delivering 90
percent of the speedup achievable by the ideal configurations across the PARSEC benchmarks. Conservative scaling (a);
ITRS scaling (b).
...............................................................................................................................................................................................
TOP PICKS
....................................................................
References University of WisconsinMadison. Her
1. G.E. Moore, Cramming More Components research interests include energy and perfor-
onto Integrated Circuits, Electronics, vol. 38, mance tradeoffs in computer architecture
no. 8, 1965, pp. 56-59. and quantifying them using analytic perfor-
2. R.H. Dennard et al., Design of Ion- mance modeling. Blem has an MS in
Implanted Mosfets with Very Small Physi- computer science from the University of
cal Dimensions, IEEE J. Solid-State Cir- WisconsinMadison.
cuits, vol. 9, no. 5, 1974, pp. 256-268.
3. S. Borkar, The Exascale Challenge, Proc. Renee St. Amant is a PhD student in the
Intl Symp. on VLSI Design, Automation Department of Computer Science at the
and Test (VLSI-DAT 10), IEEE CS, 2010, University of Texas at Austin. Her research
pp. 2-3. interests include computer architecture, low-
4. F. Pollack, New Microarchitecture Chal- power microarchitectures, mixed-signal ap-
lenges in the Coming Generations of proximate computation, new computing
CMOS Process Technologies, Proc. 32nd technologies, and storage design for approx-
Ann. ACM/IEEE Intl Symp. Microarchitec- imate computing. St. Amant has an MS in
ture (Micro 99), IEEE CS, 2009, p. 2. computer science from the University of
5. Z. Guz et al., Many-Core vs. Many-Thread Texas at Austin.
Machines: Stay Away From the Valley,
IEEE Computer Architecture Letters, vol. 8, Karthikeyan Sankaralingam is an assistant
no. 1, 2009, pp. 25-28. professor in the Department of Computer
6. G.M. Amdahl, Validity of the Single Pro- Sciences at the University of Wisconsin
cessor Approach to Achieving Large-scale Madison, where he also leads the Vertical
Computing Capabilities, Proc. Joint Com- Research Group. His research interests
puter Conf. American Federation of Infor- include microarchitecture, architecture, and
mation Processing Societies (AFIPS 67), very-large-scale integration (VLSI). Sankar-
ACM, 1967, doi:10.1145/1465482.1465560. alingam has a PhD in computer science from
7. M. Bhadauria, V. Weaver, and S. McKee, the University of Texas at Austin.
Understanding PARSEC Performance on
Contemporary CMPs, Proc. IEEE Intl Doug Burger is the director of client and
Symp. Workload Characterization (IISWC cloud applications at Microsoft Research,
09), IEEE CS, 2009, pp. 98-107. where he manages multiple strategic re-
8. C. Bienia et al., The PARSEC Benchmark search projects covering new user interfaces,
Suite: Characterization and Architectural datacenter specialization, cloud architec-
Implications, Proc. 17th Intl Conf. Paral- tures, and platforms that support persona-
lel Architectures and Compilation Tech- lized online services. Burger has a PhD
niques (PACT 08), ACM, 2008, pp. 72-81. in computer science from the University
of Wisconsin. He is a fellow of IEEE and
Hadi Esmaeilzadeh is a PhD student in the
the ACM.
Department of Computer Science and
Engineering at the University of Washing-
ton. His research interests include power- Direct questions and comments about this
efficient architectures, approximate general- article to Hadi Esmaeilzadeh, University of
purpose computing, mixed-signal architec- Washington, Computer Science & Engi-
tures, machine learning, and compilers. neering, Box 352350, AC 101, 185 Stevens
Esmaeilzadeh has an MS in computer science Way, Seattle, WA 98195; hadianeh@
from the University of Texas at Austin and cs.washington.edu.
an MS in electrical and computer engineer-
ing from the University of Tehran.
....................................................................