0% found this document useful (0 votes)
78 views13 pages

D S E M S: ARK Ilicon and The ND OF Ulticore Caling

This document discusses the challenges of continuing to scale multicore processors given the end of Dennard scaling and the rise of "dark silicon". It conducted a decade-long performance projection for multicore designs assuming fixed power and area budgets. The study found that future technologies can double transistors per generation but with less efficiency improvements. It also estimated the increasing percentage of unused silicon ("dark silicon") on chip as an indicator of how severely power and heat will limit further scaling of multicores.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views13 pages

D S E M S: ARK Ilicon and The ND OF Ulticore Caling

This document discusses the challenges of continuing to scale multicore processors given the end of Dennard scaling and the rise of "dark silicon". It conducted a decade-long performance projection for multicore designs assuming fixed power and area budgets. The study found that future technologies can double transistors per generation but with less efficiency improvements. It also estimated the increasing percentage of unused silicon ("dark silicon") on chip as an indicator of how severely power and heat will limit further scaling of multicores.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

[3B2-9] mmi2012030122.

3d 17/5/012 10:48 Page 122

..........................................................................................................................................................................................................................

DARK SILICON AND THE END


OF MULTICORE SCALING
..........................................................................................................................................................................................................................
A KEY QUESTION FOR THE MICROPROCESSOR RESEARCH AND DESIGN COMMUNITY IS

WHETHER SCALING MULTICORES WILL PROVIDE THE PERFORMANCE AND VALUE NEEDED

TO SCALE DOWN MANY MORE TECHNOLOGY GENERATIONS. TO PROVIDE A QUANTITATIVE

Hadi Esmaeilzadeh ANSWER TO THIS QUESTION, A COMPREHENSIVE STUDY THAT PROJECTS THE SPEEDUP

University of Washington POTENTIAL OF FUTURE MULTICORES AND EXAMINES THE UNDERUTILIZATION OF

INTEGRATION CAPACITYDARK SILICONIS TIMELY AND CRUCIAL.


Emily Blem
University of Wisconsin ...... Moores law (the doubling of
transistors on chip every 18 months) has
problem will be for multicore scaling, espe-
cially given the large multicore design space
Madison been a fundamental driver of computing.1 (CPU-like, GPU-like, symmetric, asymmet-
For the past three decades, through device, ric, dynamic, composed/fused, and so forth).
Renee St. Amant circuit, microarchitecture, architecture,
and compiler advances, Moores law,
To explore the speedup potential of
future multicores, we conducted a decade-
University of Texas coupled with Dennard scaling, has resulted
in commensurate exponential performance
long performance scaling projection for
multicore designs assuming fixed power
at Austin increases. 2 The recent shift to multicore and area budgets. It considers devices, core
designs aims to increase the number of microarchitectures, chip organizations, and
cores using the increasing transistor count benchmark characteristics, applying area
Karthikeyan to continue the proportional scaling of and power constraints at future technology
performance. nodes. Through our models we also esti-
Sankaralingam With the end of Dennard scaling, future mate the effects of nonideal device scaling
technology generations can sustain the dou- on integration capacity utilization and esti-
University of Wisconsin bling of devices every generation, but with mate the percentage of dark silicon (transis-
significantly less improvement in energy effi- tor integration capacity underutilization) on
Madison ciency at the device level. This device scaling future multicore chips. For more informa-
trend presages a divergence between energy- tion on related research, see the Related
Doug Burger efficiency gains and transistor-density
increases. For the architecture community,
Work in Modeling Multicore Speedup
and Dark Silicon sidebar.
Microsoft Research it is crucial to understand how effectively
multicore scaling will use increased device in- Modeling multicore scaling
tegration capacity to deliver performance To project the upper bound performance
speedups in the long term. While everyone achievable through multicore scaling (under
understands that power and energy are criti- current scaling assumptions), we considered
cal problems, no detailed, quantitative study technology scaling projections, single-core
has addressed how severe (or not) the power design scaling, multicore design choices,
..............................................................

122 Published by the IEEE Computer Society 


0272-1732/12/$31.00 c 2012 IEEE
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 123

...............................................................................................................................................................................................
Related Work in Modeling Multicore Speedup and Dark Silicon
Hill and Marty extend Amdahls law to model multicore speedup with tradeoffs; and exhaustively considers multicore organizations, microarch-
symmetric, asymmetric, and dynamic topologies and conclude that dy- itectural features, and the behavior of real applications.
namic multicores are superior.1 Their model uses area as the primary
constraint and models single-core area/performance tradeoff using
p
Pollacks rule (Performance / Area) without considering technology References
trends.2 Azizi et al. derive the single-core energy/performance tradeoff 1. M.D. Hill and M.R. Marty, Amdahls Law in the Multicore
of Pareto frontiers using architecture-level statistical models combined Era, Computer, vol. 41, no. 7, 2008, pp. 33-38.
with circuit-level energy/performance tradeoff functions.3 For modeling 2. F. Pollack, New Microarchitecture Challenges in the Com-
single-core power/performance and area/performance tradeoffs, our ing Generations of CMOS Process Technologies, Proc.
core model derives two separate Pareto frontiers from real measure- 32nd Ann. ACM/IEEE Intl Symp. Microarchitecture (Micro
ments. Furthermore, we project these tradeoff functions to the future 99), IEEE CS, 2009, p. 2.
technology nodes using our device model. 3. O. Azizi et al., Energy-Performance Tradeoffs in Processor
Chakraborty considers device scaling and estimates a simultaneous Architecture and Circuit Design: A Marginal Cost Analysis,
activity factor for technology nodes down to 32 nm.4 Hempstead et al. Proc. 37th Ann Intl Symp. Computer Architecture (ISCA
introduce a variant of Amdahls law to estimate the amount of special- 10), ACM, 2010, pp. 26-36.
ization required to maintain 1.5 performance growth per year, 4. K. Chakraborty, Over-Provisioned Multicore Systems,
assuming completely parallelizable code.5 Chung et al. study uncon- doctoral thesis, Department of Computer Sciences, Univ.
ventional cores including custom logic, field-programmable gate arrays of WisconsinMadison, 2008.
(FPGAs), or GPUs in heterogeneous single-chip design.6 They rely on 5. M. Hempstead, G.-Y. Wei, and D. Brooks, Navigo: An Early-
Pollacks rule for the area/performance and power/performance trade- Stage Model to Study Power-Constrained Architectures and
offs. Using International Technology Roadmap for Semiconductors Specialization, Workshop on Modeling, Benchmarking, and
(ITRS) projections, they report on the potential for unconventional Simulations (MoBS ), 2009.
cores considering parallel kernels. Hardavellas et al. forecast the lim- 6. E.S. Chung et al., Single-Chip Heterogeneous Computing:
its of multicore scaling and the emergence of dark silicon in servers Does the Future Include Custom Logic, FPGAs, and
with workloads that have an inherent abundance of parallelism. 7 GPUs? Proc. 43rd Ann. IEEE/ACM Intl Symp. Microarchi-
Using ITRS projections, Venkatesh et al. estimate technology-imposed tecture (Micro 43), IEEE CS, 2010, pp. 225-236.
utilization limits and motivate energy-efficient and application-specific 7. N. Hardavellas et al., Toward Dark Silicon in Servers, IEEE
core designs.8 Micro, vol. 31, no. 4, 2011, pp. 6-15.
Previous work largely abstracts away processor organization and ap- 8. G. Venkatesh et al., Conservation Cores: Reducing the En-
plication details. Our study provides a comprehensive model that consid- ergy of Mature Computations, Proc. 15th Intl Conf. Archi-
ers the implications of process technology scaling; decouples power/area tectural Support for Programming Languages and Operating
constraints; uses real measurements to model single-core design Systems (ASPLOS 10), ACM, 2010, pp. 205-218.

actual application behavior, and microarchi- factors at technology nodes from 45 nm


tectural features. We considered fixed-size through 8 nm. We consider aggressive Inter-
and fixed-power-budget chips. We built national Technology Roadmap for Semiconduc-
and combined three models to project per- tors (ITRS; http://www.itrs.net) projections
formance, as Figure 1 shows. The three mod- and conservative projections from Borkars
els are the device scaling model (DevM), the recent study.3
core scaling model (CorM), and the multi- We modeled the power/performance and
core scaling model (CmpM). The models area/performance of single core designs using
predict performance speedup and show a Pareto frontiers derived from real measure-
gap between our projected speedup and the ments. Through Pareto-optimal curves, the
speedup we have come to expect with each core-level model provides the maximum per-
technology generation. This gap is referred formance that a single core can sustain for any
to as the dark silicon gap. The models also given area. Further, it provides the minimum
project the percentage of the dark silicon as power that must be consumed to sustain this
the process technology scales. level of performance.
We built a device scaling model that pro- We developed an analytical model
vides the area, power, and frequency scaling that provides per-benchmark speedup of a
....................................................................

MAY/JUNE 2012 123


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 124

...............................................................................................................................................................................................
TOP PICKS

Device scaling Core Multicore scaling Optimal no. of cores


=>
(DevM ) scaling (CorM ) (CmpM ) Multicore speedup
% of dark silicon
ITRS Collecting Analytical models
projections empirical data Microarchitectural
Conservative Deriving features
projections Pareto frontiers Application behavior
1
1f + f

Dark silicon
Multicore speedup
S (q) N(q)s(q) Historical

gap
Core area
performance
VDD

scaling
CPU-like GPU-like
Power

Tech node multicore tions


multicore projec
Model
Performance Year
C

Tech node

% of dark silicon
Area

Core power

Tech node


Tech node
f

Applications
Tech node Performance Year
2 projection Data for 152 2 chip organizations 4 topologies Search 800 configs.
schemes processors 12 benchmarks for 12 benchmarks

Figure 1. Overview of the methodology and models. By combining the device scaling model (DevM), core scaling model
(CorM), and multicore scaling model (CmpM), we project performance speedup and reveal a gap between the projected
speedup and the speedup expected with each technology generation indicated as the dark silicon gap. The three-tier
model also projects the percentage of dark silicon as technology scales.

multicore design compared to a baseline de- Single-thread (ST) cores are uniprocessor-
sign. The model projects performance for style cores with large caches, and many-
each hybrid configuration based on high- thread (MT) cores are GPU-style cores
level application properties and microarchi- with smaller caches.
tectural features. We modeled the two main- Combining the device model with the
stream classes of multicore organizations, core model provided power/performance
multicore CPUs and many-thread GPUs, and area/performance Pareto frontiers at
which represent two extreme points in the future technology nodes. Any performance
threads-per-core spectrum. The CPU multi- improvements for future cores will come
core organization represents Intel Nehalem- only at the cost of area or power as defined
like, heavyweight multicore designs with by these curves. Finally, combining all three
fast caches and high single-thread perfor- models and performing an exhaustive design-
mance. The GPU multicore organization space search produced the optimal multicore
represents Nvidia Tesla-like lightweight configuration and the maximum multicore
cores with heavy multithreading support speedups for each benchmark at future tech-
and poor single-thread performance. For nology nodes while enforcing area, power,
each multicore organization, we considered and benchmark constraints.
four topologies: symmetric, asymmetric, dy-
namic, and composed (fused). Future directions
Table 1 outlines the four topologies in As the rest of the article will elaborate, we
the design space and the cores roles during model an upper bound on parallel applica-
serial and parallel portions of applications. tion performance available from multicore
....................................................................

124 IEEE MICRO


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 125

Table 1. The four multicore topologies for CPU-like and GPU-like organizations. (ST core: single-thread core;
MT core: many-thread core.)

Multicore Portion Symmetric Dynamic Composed


organization of code topology Asymmetric topology topology topology
CPU multicore Serial 1 ST core 1 large ST core 1 large ST core 1 large ST core
Parallel N ST cores 1 large ST core +N small ST cores N small ST cores N small ST cores
GPU multicore Serial 1 MT core 1 large ST core (1 thread) 1 large ST core 1 large ST core
(1 thread) (1 thread) (1 thread)
Parallel N MT cores 1 large ST core +N small MT cores N small MT cores N small MT cores
(multiple (1 thread) (multiple threads) (multiple threads) (multiple threads)
threads)

and CMOS scalingassuming no major dis- era, or will industry need to move in dif-
ruptions in process scaling or core efficiency. ferent, perhaps radical, directions to justify
Using a constant area and power budget, this the cost of scaling?
study shows that the space of known multi-
core designs (CPUs, GPUs, and their The glass is half-empty
hybrids) or novel heterogeneous topologies A pessimistic interpretation of this study
(for example, dynamic or composable) falls is that the performance improvements to
far short of the historical performance gains which we have grown accustomed over the
our industry is accustomed to. Even with past 30 years are unlikely to continue with
aggressive ITRS scaling projections, scaling multicore scaling as the primary driver. The
cores achieves a geometric mean 7.9 transition from multicore to a new approach
speedup through 2024 at 8 nm. With con- is likely to be more disruptive than the tran-
servative scaling, only 3.7 geometric sition to multicore and, to sustain the current
mean speedup is achievable at 8 nm. Fur- cadence of Moores law, must occur in only a
thermore, with ITRS projections, at 22 nm, few years. This period is much shorter than
21 percent of the chip will be dark, and at the traditional academic time frame required
8 nm, more than 50 percent of the chip can- for research and technology transfer. Major
not be utilized. architecture breakthroughs in alternative
The articles findings and methodology directions such as neuromorphic computing,
are both significant and indicate that without quantum computing, or biointegration will
process breakthroughs, directions beyond require even more time to enter industry
multicore are needed to provide performance product cycle. Furthermore, while a slowing
scaling. For decades, Dennard scaling per- of Moores law will obviously not be fatal, it
mitted more transistors, faster transistors, has significant economic implications for the
and more energy-efficient transistors with semiconductor industry.
each new process node, which justified the
enormous costs required to develop each The glass is half-full
new process node. Dennard scalings failure If energy-efficiency breakthroughs are
led industry to race down the multicore made on supply voltage and process scaling,
path, which for some time permitted perfor- the performance improvement potential is
mance scaling for parallel and multitasked high for applications with very high degrees
workloads, permitting the economics of pro- of parallelism.
cess scaling to hold. A key question for the
microprocessor research and design commu- Rethinking multicores long-term potential
nity is whether scaling multicores will pro- We hope that our quantitative findings
vide the performance and value needed to trigger some analyses in both academia
scale down many more technology genera- and industry on the long-term potential of
tions. Are we in a long-term multicore the multicore strategy. Academia is now
....................................................................

MAY/JUNE 2012 125


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 126

...............................................................................................................................................................................................
TOP PICKS

making a major investment in research focus- power budget for an entire laptop, smart-
ing on multicore and its related problems of phone, or tablet. We believe this study will
expressing and managing parallelism. Re- revitalize and trigger microarchitecture inno-
search projects assuming hundreds or thou- vations, making the case for their urgency
sands of capable cores should consider this and potential impact.
model and the power requirements under
various scaling projections before assuming A case for specialization
that the cores will inevitably arrive. The par- There is emerging consensus that special-
adigm shift toward multicores that started in ization is a promising alternative to efficiently
the high-performance, general-purpose mar- use transistors to improve performance. Our
ket has already percolated to mobile and study serves as a quantitative motivation on
embedded markets. The qualitative trends such works urgency and potential impact.
we predict and our modeling methodology Furthermore, our study shows quantitatively
hold true for all markets even though the levels of energy improvement that spe-
our study considers the high-end desktop cialization techniques must deliver.
market. This studys results could help
break industrys current widespread consen- A case for complementing the core
sus that multicore scaling is the viable for- Our study also shows that when perfor-
ward path. mance becomes limited, techniques that oc-
casionally use parts of the chip to deliver
Model points to opportunities outcomes orthogonal to performance are
Our study is based on a model that takes ways to sustain the industrys economics.
into account properties of devices, processor However, techniques that focus on using
core, multicore organization, and topology. the device integration capacity for improving
Thus the model inherently provides the pla- security, programmer productivity, software
ces to focus on for innovation. To surpass maintainability, and so forth must consider
the dark silicon performance barrier high- energy efficiency as a primary factor.
lighted by our work, designers must develop
systems that use significantly more energy- Device scaling model (DevM)
efficient techniques. Some examples include The device model (DevM) provides
device abstractions beyond digital logic transistor-area, power, and frequency-scaling
(error-prone devices); processing paradigms factors from a base technology node (for
beyond superscalar, single instruction, multi- example, 45 nm) to future technologies.
ple data (SIMD), and single instruction, The area-scaling factor corresponds to the
multiple threads (SIMT); and program se- shrinkage in transistor dimensions. The
mantic abstractions allowing probabilistic DevM model calculates the frequency-scaling
and approximate computation. The results factor based on the fanout-of-four (FO4)
show that radical departures are needed, delay reduction. The model computes the
and the model shows quantitative ways to power-scaling factor using the predicted fre-
measure the impact of such techniques. quency, voltage, and gate capacitance scaling
2
factors in accordance with the P CVDD f
A case for microarchitecture innovation equation.
Our study also shows that fundamental We generated two device scaling models:
processing limitations emanate from the pro- ITRS scaling and conservative scaling.
cessor core. Clearly, architectures that move The ITRS model uses projections from the
well past the power/performance Pareto- 2010 ITRS. The conservative model is based
optimal frontier of todays designs are neces- on predictions presented by Borkar3 and rep-
sary to bridge the dark silicon gap and resents a less optimistic view. Table 2 summa-
use transistor integration capacity. Thus, rizes the parameters used for calculating the
improvements to the cores efficiency will power and performance-scaling factors. We
impact performance improvement and will allocated 20 percent of the chip power budget
enable technology scaling even though the to leakage power and assumed chip designers
core consumes only 20 percent of the can maintain this ratio.
....................................................................

126 IEEE MICRO


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 127

Table 2. Scaling factors with International Technology Roadmap for Semiconductors (ITRS ) and conservative
projections. ITRS projections show an average 31 percent frequency increase and 35 percent power
reduction per node, compared to an average 6 percent frequency increase and 23 percent power
reduction per node for conservative projections.

Frequency Capacitance
Device scaling Technology scaling factor VDD scaling scaling factor Power scaling
model Year node (nm) (45 nm) factor (45 nm) (45 nm) factor (45 nm)
ITRS scaling 2010 45 1.00 1.00 1.00 1.00
2012 32 1.09 0.93 0.70 0.66
2015 22y 2.38 0.84 0.33 0.54
2018 16y 3.21 0.75 0.21 0.38
2021 11y 4.17 0.68 0.13 0.25
2024 8y 3.85 0.62 0.08 0.12
Conservative scaling 2008 45 1.00 1.00 1.00 1.00
2010 32 1.10 0.93 0.75 0.71
2012 22 1.19 0.88 0.56 0.52
2014 16 1.25 0.86 0.42 0.39
2016 11 1.30 0.84 0.32 0.29
2018 8 1.34 0.84 0.24 0.22
.................................................................................................................................................................................

Extended Planar Bulk Transistors; y Multi-Gate Transistors.

Core scaling model (CorM) estimated the core power budget using the
We built the technology-scalable core thermal design power (TDP) reported in
model (CorM) by populating the area/ processor datasheets. The TDP is the chip
performance and power/performance design power budget, or the amount of power the
spaces with the data collected for a set of pro- chip can dissipate without exceeding the
cessors, all fabricated in the same technology transistor junction temperature. After
node. The core model is the combination of excluding the share of uncore components
the area/performance Pareto frontier, A(q), from the power budget, we divided the
and the power/performance Pareto frontier, power budget allocated to the cores to
P(q), for these two design spaces. The q is the number of cores to estimate the
a cores single-threaded performance. These core power budget. We used die photos
frontiers capture the optimal area/performance of the four microarchitecturesIntel
and power/performance tradeoffs for a Atom, Intel Core, AMD Shanghai, and
core while abstracting away specific details Intel Nehalemto estimate the core areas
of the core. (excluding Level-2 [L2] and Level-3 [L3]
As Figure 2 shows, we populated the two caches). Because this works focus is to
design spaces at 45 nm using 20 representa- study the impact of technology constraints
tive Intel and Advanced Micro Devices on logic scaling rather than cache scaling,
(AMD) processors and derive the Pareto we derive the Pareto frontiers using only
frontiers. The curve that bounds all power/ the portion of power budget and area allo-
performance (area/performance) points in cated to the core in each processor excluding
the design space and indicates the minimum the uncore components share.
amount of power (area) required for a given As Figure 2 illustrates, we fit a cubic poly-
performance level constructs the Pareto nomial, P (q), to the points along the edge of
frontier. The P(q) and A(q) pair, which are the power/performance design space, and a
polynomial equations, constitute the core quadratic polynomial (Pollacks rule4), A(q),
model. The core performance (q) is the pro- to the points along the edge of the area/
cessors SPECmark and is collected from the performance design space. The Intel Atom
SPEC website (http://www.spec.org). We Z520 with an estimated 1.89 W core TDP
....................................................................

MAY/JUNE 2012 127


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 128

...............................................................................................................................................................................................
TOP PICKS

P(q) = 0.0002q 3 + 0.0009q 2 + 0.3859q 0.0301 A(q) = 0.0152q 2 + 0.0265q + 7.4393


30 30

25
25

Core area (mm2)


Core power (W )

20
20
15
15
10
Intel Nehalem (45 nm) Intel Nehalem (45 nm)
Intel Core (45 nm) Intel Core (45 nm)
AMD Shanghai (45 nm) 10 AMD Shanghai (45 nm)
5
Intel Atom (45 nm) Intel Atom (45 nm)
Pareto Frontier (45 nm) Pareto Frontier (45 nm)
0 5
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
(a) Performance (SPECmark) (b) Performance (SPECmark)

Figure 2. Design space and the derived Pareto frontiers. Power/performance frontier, 45 nm (a); area/performance frontier,
45 nm (b).

represents the lowest power design (lower-left Using this model, we consider single-
frontier point), and the Nehalem-based Intel threaded cores with large caches to cover
Core i7-965 Extreme Edition with an esti- the CPU multicore design space and mas-
mated 31.25 W core TDP represents the sively threaded cores with minimal caches
highest-performing design (upper-right fron- to cover the GPU multicore design space
tier point). We used the points along the across all four topologies, as described in
scaled Pareto frontier as the search space for Table 1. Table 3 lists the input parameters
determining the best core configuration by to the model, and how the multicore design
the multicore scaling model. choices impact them, if at all.

Multicore scaling model (CmpM) Microarchitectural features


We developed a detailed chip-level model Equation 1 calculates the multithreaded
(CmpM) that integrates the area and power performance (Perf ) of either a CPU-like or
frontiers, microarchitectural features, and ap- GPU-like multicore organization running a
plication behavior, while accounting for the fully parallel (f 1) and multithreaded ap-
chip organization (CPU-like or GPU-like) plication in terms of instructions per second
and its topology (symmetric, asymmetric, by multiplying the number of cores (N ) by
dynamic, or composed). Guz et al. proposed the core utilization (Z) and scaling by the
a model for studying the first-order impacts ratio of the processor frequency to CPIexe :
of microarchitectural features (cache organi-
zation, memory bandwidth, threads per Perf
core, and so forth) and workload behavior  
freq BWmax
(memory access patterns).5 Their model min N ;
CPIexe rm  mL1  mL2  b
considers stalls due to memory dependences
and resource constraints (bandwidth or func- (1)
tional units). We extended their approach
to build our multicore model. Our exten- The CPIexe parameter does not include
sions incorporate additional application stalls due to cache accesses, which are consid-
behaviors, microarchitectural features, and ered separately in core utilization (Z). The
physical constraints, and covers both ho- core utilization (Z) is the fraction of time
mogeneous and heterogeneous multicore that a thread running on the core can keep
topologies. it busy. It is modeled as a function of the
....................................................................

128 IEEE MICRO


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 129

Table 3. CmpM parameters with default values from 45-nm Nehalem.

Parameter Description Default Impacted by


N Number of cores 4 Multicore topology
T Number of threads per core 1 Core style
freq Core frequency (MHz) 3,200 Core performance
CPIexe Cycles per instruction (zero-latency cache accesses) 1 Core performance, application
CL1 Level-1 (L1) cache size per core (Kbytes) 64 Core style
CL2 Level-2 (L2) cache size per chip (Mbytes) 2 Core style, multicore topology
tL1 L1 access time (cycles) 3 N/A
tL2 L2 access time (cycles) 20 N/A
tmem Memory access time (cycles) 426 Core performance
BWmax Maximum memory bandwidth (Gbytes/s) 200 Technology node
b Bytes per memory access (bytes) 64 N/A
f Fraction of code that can be parallel Varies Application
rm Fraction of instructions that are memory accesses Varies Application
L1, L1 L1 cache miss rate function constants Varies Application
L2, L2 L2 cache miss rate function constants Varies Application

average time spent waiting for each memory to a baseline multicore (PerfB). That is, the
access (t ), fraction of instructions that access parallel portion of code (f ) is sped up by
the memory (rm), and the CPIexe: SParallel PerfP/PerfB and the serial portion of
! code (1f ) is sped up by SSerial PerfS /PerfB.
T We calculated the number of cores that
 min 1; rm (2)
1 t CPI exe
can fit on the chip based on the multicores
topology, area budget (AREA), power bud-
The average time spent waiting for mem- get (TDP), and each cores area [A(q)] and
ory accesses (t) is a function of the time to power [P (q)].
access the caches (tL1 and tL2), time to visit
memory (tmem), and the predicted cache NSymm q
 
miss rate (mL1 and mL2): AREA TDP
min ;
t (1  mL1)tL1 mL1 (1  mL2)tL2 Aq P q
mL1mL2tmem (3)
 1L1 NAsym qL ; qS
CL1  
mL1 and AREA  AqL TDP  P qL
T L1 min ;
 1L2 AqS P qS
CL2
mL2 (4)
T L2
Ndynm qL ; qS
 
Multicore topologies AREA  AqL TDP
min ;
The multicore model is an extended AqS P qS
Amdahls law6 equation that incorporates
the multicore performance (Perf ) calculated NComp qL ; qS
from Equations 1 through 4:  
  AREA TDP
f 1f min ;
Speedup 1= (5) 1 AqS P qS
Sparallel Sserial
For heterogeneous multicores, qS is the
The CmpM model (Equation 5) mea- single-threaded performance of the small
sures the multicore speedup with respect cores and qL is the large cores single-threaded
....................................................................

MAY/JUNE 2012 129


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 130

...............................................................................................................................................................................................
TOP PICKS

performance. The area overhead of support- 64-Kbyte L1 cache, and chips with only
ing composability is t, while no power over- single-thread cores have an L2 cache that is
head is assumed for composability support. 30 percent of the chip area. MT cores have
small L1 caches (32 Kbytes for every eight
Model implementation cores), support multiple hardware contexts
One of the contributions of this work is (1,024 threads per eight cores), a thread reg-
the incorporation of Pareto frontiers, physi- ister file, and no L2 cache. From Atom and
cal constraints, real application behavior, Tesla die photos, we estimated that eight
and realistic microarchitectural features into small many-thread cores, their shared L1
the multicore speedup projections. cache, and their thread register file can fit
The input parameters that characterize an in the same area as one Atom processor.
application are its cache behavior, fraction of We assumed that off-chip bandwidth
instructions that are loads or stores, and frac- (BWmax) increases linearly as process technol-
tion of parallel code. For the PARSEC ogy scales down and while the memory
benchmarks, we obtained this data from access time is constant.
two previous studies.7,8 To obtain the frac- We assumed that t increases from 10 per-
tion of parallel code (f ) for each benchmark, cent up to 400 percent, depending on the
we fit an Amdahls lawbased curve to the composed cores total area. The composed
reported speedups across different numbers cores performance cannot exceed perfor-
of cores from both studies. This fit shows mance of a single Nehalem core at 45 nm.
values of f between 0.75 and 0.9999 for We derived the area and power budgets
individual benchmarks. from the same quad-core Nehalem multicore
To incorporate the Pareto-optimal curves at 45 nm, excluding the L2 and L3 caches.
into the CmpM model, we converted the They are 111 mm2 and 125 W, respectively.
SPECmark scores (q) into an estimated The reported dark silicon projections are
CPIexe and core frequency. We assumed for the area budget thats solely allocated to
that core frequency scales linearly with per- the cores, not caches and other uncore com-
formance, from 1.5 GHz for an Atom core ponents. The CmpMs speedup baseline is a
to 3.2 GHz for a Nehalem core. Each appli- quad-Nehalem multicore.
cations CPIexe depends on its instruction
mix and use of hardware optimizations Combining models
such as functional units and out-of-order Our three-tier modeling approach allows
processing. Since the measured CPIexe for us to exhaustively explore the design space
each benchmark at each technology node is of future multicores, project their upper
not available, we used the CmpM model to bound performance, and estimate the
generate per-benchmark CPIexe estimates for amount of integration capacity underutiliza-
each design point along the Pareto frontier. tion, dark silicon.
With all other model inputs kept constant,
we iteratively searched for the CPIexe at Device  core model
each processor design point. We started by To study core scaling in future technology
assuming that the Nehalem core has a CPIexe nodes, we scaled the 45 nm Pareto frontiers
of . Then, the smallest core, an Atom pro- down to 8 nm by scaling each processor
cessor, should have a CPIexe such that the data points power and performance using
ratio of its CmpM performance to the Neha- the DevM model and then refitting the Par-
lem cores CmpM performance is the same eto optimal curves at each technology node.
as the ratio of their SPECmark scores (q). We assumed that performance, which we
We assumed that the CPIexe does not change measured in SPECmark, would scale linearly
with technology node, while frequency with frequency. By making this assumption,
scales. we ignored the effects of memory latency and
A key component of the detailed model is bandwidth on the core performance. Thus,
the set of input parameters modeling the actual performance gains through scaling
cores microarchitecture. For single-thread could be lower. Based on the optimistic
cores, we assumed that each core has a ITRS model, scaling a microarchitecture
....................................................................

130 IEEE MICRO


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 131

(core) from 45 nm to 8 nm will result in a curve matches transistor count growth as


3.9 performance improvement and an 88 process technology scales.
percent reduction in power consumption.
Conservative scaling, however, suggests that Finding: With optimal multicore configura-
performance will increase only by 34 percent tions for each individual application, at
and that power will decrease by 74 percent. 8 nm, only 3.7 (conservative scaling) or
7.9 (ITRS scaling) geometric mean
Device  core  multicore model speedup is possible, as shown by the dashed
We combined all three models to produce line in Figure 3.
final projections for optimal multicore Finding: Highly parallel workloads with a
speedup, number of cores, and amount of degree of parallelism higher than 99 percent
dark silicon. To determine the best multicore will continue to benefit from multicore
configuration at each technology node, we scaling.
swept the design points along the scaled
area/performance and power/performance Finding: At 8 nm, the geometric mean
Pareto frontiers (DevM  CorM) because speedup for dynamic and composed topolo-
these points represent the most efficient gies is only 10 percent higher than the
designs. For each core design, we constructed geometric mean speedup for symmetric
a multicore consisting of one such core at topologies.
each technology node. For a symmetric mul-
ticore, we iteratively added identical cores Dark silicon projections
one by one until we hit the area or power To understand whether parallelism or the
budget or until performance improvement power budget is the primary source of the
was limited. We swept the frontier and con- dark silicon speedup gap, we varied each of
structed a symmetric multicore for each pro- these factors in two experiments at 8 nm.
cessor design point. From this set of First, we kept the power budget constant
symmetric multicores, we picked the multi- (our default budget is 125 W) and varied
core with the best speedup as the optimal the level of parallelism in the PARSEC appli-
symmetric multicore for that technology cations from 0.75 to 0.99, assuming that
node. The procedure is similar for other programmer effort can realize this improve-
topologies. We performed this procedure sep- ment. Performance improved slowly as the
arately for CPU-like and GPU-like organiza- parallelism level increased, with most bench-
tions. The amount of dark silicon is the marks reaching a speedup of about only 15
difference between the area occupied by at 99 percent parallelism. Provided that the
cores for the optimal multicore and the area power budget is the only limiting factor, typ-
budget that is only allocated to the cores. ical upper-bound ITRS-scaling speedups will
still be limited to 15. With conservative
Scaling and future multicores scaling, this best-case speedup is limited
We used the combined models to study to 6.3.
the future of multicore designs and their For the second experiment, we kept each
performance-limiting factors. The results applications parallelism at its real level and
from this study provide detailed analysis of varied the power budget from 50 W to
multicore behavior for 12 real applications 500 W. Eight of 12 benchmarks showed
from the PARSEC suite. no more than 10 speedup even with a prac-
tically unlimited power budget. In other
Speedup projections words, increasing core counts beyond a cer-
Figure 3 summarizes all of the speedup tain point did not improve performance be-
projections in a single scatter plot. For cause of the limited parallelism in the
every benchmark at each technology node, applications and Amdahls law. Only four
we plot the speedup of eight possible multi- benchmarks have sufficient parallelism to
core configurations (CPU-like or GPU-like) even hypothetically sustain speedup levels
 (symmetric, asymmetric, dynamic, or that matches the exponential transistor
composed). The exponential performance count growth, Moores law.
....................................................................

MAY/JUNE 2012 131


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 132

...............................................................................................................................................................................................
TOP PICKS

32 32
Exponential performance Exponential performance
Geometric mean Geometric mean
Design points Design points

24 24
Speedup

Speedup
16 16

8 8

0 0
45 32 22 16 11 8 45 32 22 16 11 8
(a) Technology node (nm) (b) Technology node (nm)

Figure 3. Speedup across process technology nodes across all organizations and topologies with PARSEC benchmarks.
The exponential performance curve matches transistor count growth. Conservative scaling (a); ITRS scaling (b).

Finding: With ITRS projections, at 22 nm, achievable by 442 cores. With conservative
21 percent of the chip will be dark, and at scaling, the 90 percent speedup core count
8 nm, more than 50 percent of the chip is 20 at 8 nm.
cannot be utilized.
Finding: Due to limited parallelism in the
Finding: The level of parallelism in PARSEC PARSEC benchmark suite, even with
applications is the primary contributor to novel heterogeneous topologies and opti-
the dark silicon speedup gap. However, in mistic ITRS scaling, integrating more than
realistic settings, the dark silicon resulting 35 cores improves performance only slightly
from power constraints limits the achievable for CPU-like topologies.
speedup.
Sensitivity studies
Core count projections We performed sensitivity studies on the
Different applications saturate perfor- impact of various features, including L2
mance improvements at different core cache sizes, memory bandwidth, simultane-
counts. We considered the chip configura- ous multithreading (SMT) support, and
tion that provided the best speedups for all the percentage of total power allocated
applications to be an ideal configuration. to leakage. Quantitatively, these studies
Figure 4 shows the number of cores (solid show that these features have limited impact
line) for the ideal CPU-like dynamic multicore on multicore performance.
configuration across technology generations,
because dynamic configurations performed Limitations
best. The dashed line illustrates the number Our device and core models do not
of cores required to achieve 90 percent of the explicitly consider dynamic voltage and fre-
ideal configurations geometric mean speedup quency scaling (DVFS). Instead, we take an
across PARSEC benchmarks. As depicted, optimistic approach to account for its best-
with ITRS scaling, the ideal configuration inte- case impact. When deriving the Pareto fron-
grates 442 cores at 8 nm. However, 35 cores tiers, we assume that each processor data
reach the 90 percent of the speedup point operates at its optimal voltage and
....................................................................

132 IEEE MICRO


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 133

406 442
256 256
Ideal configuration Ideal configuration
90% configuration 90% configuration
224 224

192 192

160 160
No. of cores

No. of cores
128 128

96 96

64 64

32 32

0 0
45 32 22 16 11 8 45 32 22 16 11 8
(a) Technology node (nm) (b) Technology node (nm)

Figure 4. Number of cores for the ideal CPU-like dynamic multicore configurations and the number of cores delivering 90
percent of the speedup achievable by the ideal configurations across the PARSEC benchmarks. Conservative scaling (a);
ITRS scaling (b).

frequency setting (VDDmin ,Freqmax ). At a fixed


VDD setting, scaling down the frequency from
Freqmax results in a power/performance point
T his work makes two key contributions:
projecting multicore speedup limits
and quantifying the dark silicon effect, and
inside the optimal Pareto curve, which is a providing a novel and extendible model that
suboptimal design point. However, scaling integrates device scaling trends, core design
voltage up and operating at a new tradeoffs, and multicore configurations.
0 0
(VDD min
,Freqmax ) setting results in a different While abstracting away many details, the
power-performance point that is still on the model can find optimal configurations and
optimal frontier. Because we investigate all project performance for CPU- and GPU-
of the points along the frontier to find the style multicores while considering micro-
optimal multicore configuration, our study architectural features and high-level applica-
covers multicore designs that introduce heter- tion properties. We made our model
ogeneity to symmetric topologies through publicly available at http://research.cs.wisc.
DVFS. The multicore model considers the edu/vertical/DarkSilicon. We believe this
first-order impact of caching, parallelism, study makes the case for innovations
and threading under assumptions that result urgency and its potential for high impact
only in optimistic projections. Comparing while providing a model that researchers
the CmpM models output against published and engineers can adopt as a tool to study
empirical results confirms that our model al- limits of their solutions. MICRO
ways overpredicts multicore performance.
The model optimistically assumes that the Acknowledgments
workload is homogeneous; that work is infi- We thank Shekhar Borkar for sharing
nitely parallel during parallel sections of his personal views on how CMOS devices
code; that memory accesses never stall due are likely to scale. Support for this research
to a previous access; and that no thread syn- was provided by the NSF under grants
chronization, operating system serialization, CCF-0845751, CCF-0917238, and CNS-
or swapping occurs. 0917213.
....................................................................

MAY/JUNE 2012 133


[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 134

...............................................................................................................................................................................................
TOP PICKS

....................................................................
References University of WisconsinMadison. Her
1. G.E. Moore, Cramming More Components research interests include energy and perfor-
onto Integrated Circuits, Electronics, vol. 38, mance tradeoffs in computer architecture
no. 8, 1965, pp. 56-59. and quantifying them using analytic perfor-
2. R.H. Dennard et al., Design of Ion- mance modeling. Blem has an MS in
Implanted Mosfets with Very Small Physi- computer science from the University of
cal Dimensions, IEEE J. Solid-State Cir- WisconsinMadison.
cuits, vol. 9, no. 5, 1974, pp. 256-268.
3. S. Borkar, The Exascale Challenge, Proc. Renee St. Amant is a PhD student in the
Intl Symp. on VLSI Design, Automation Department of Computer Science at the
and Test (VLSI-DAT 10), IEEE CS, 2010, University of Texas at Austin. Her research
pp. 2-3. interests include computer architecture, low-
4. F. Pollack, New Microarchitecture Chal- power microarchitectures, mixed-signal ap-
lenges in the Coming Generations of proximate computation, new computing
CMOS Process Technologies, Proc. 32nd technologies, and storage design for approx-
Ann. ACM/IEEE Intl Symp. Microarchitec- imate computing. St. Amant has an MS in
ture (Micro 99), IEEE CS, 2009, p. 2. computer science from the University of
5. Z. Guz et al., Many-Core vs. Many-Thread Texas at Austin.
Machines: Stay Away From the Valley,
IEEE Computer Architecture Letters, vol. 8, Karthikeyan Sankaralingam is an assistant
no. 1, 2009, pp. 25-28. professor in the Department of Computer
6. G.M. Amdahl, Validity of the Single Pro- Sciences at the University of Wisconsin
cessor Approach to Achieving Large-scale Madison, where he also leads the Vertical
Computing Capabilities, Proc. Joint Com- Research Group. His research interests
puter Conf. American Federation of Infor- include microarchitecture, architecture, and
mation Processing Societies (AFIPS 67), very-large-scale integration (VLSI). Sankar-
ACM, 1967, doi:10.1145/1465482.1465560. alingam has a PhD in computer science from
7. M. Bhadauria, V. Weaver, and S. McKee, the University of Texas at Austin.
Understanding PARSEC Performance on
Contemporary CMPs, Proc. IEEE Intl Doug Burger is the director of client and
Symp. Workload Characterization (IISWC cloud applications at Microsoft Research,
09), IEEE CS, 2009, pp. 98-107. where he manages multiple strategic re-
8. C. Bienia et al., The PARSEC Benchmark search projects covering new user interfaces,
Suite: Characterization and Architectural datacenter specialization, cloud architec-
Implications, Proc. 17th Intl Conf. Paral- tures, and platforms that support persona-
lel Architectures and Compilation Tech- lized online services. Burger has a PhD
niques (PACT 08), ACM, 2008, pp. 72-81. in computer science from the University
of Wisconsin. He is a fellow of IEEE and
Hadi Esmaeilzadeh is a PhD student in the
the ACM.
Department of Computer Science and
Engineering at the University of Washing-
ton. His research interests include power- Direct questions and comments about this
efficient architectures, approximate general- article to Hadi Esmaeilzadeh, University of
purpose computing, mixed-signal architec- Washington, Computer Science & Engi-
tures, machine learning, and compilers. neering, Box 352350, AC 101, 185 Stevens
Esmaeilzadeh has an MS in computer science Way, Seattle, WA 98195; hadianeh@
from the University of Texas at Austin and cs.washington.edu.
an MS in electrical and computer engineer-
ing from the University of Tehran.

Emily Blem is a PhD student in the


Department of Computer Sciences at the

....................................................................

134 IEEE MICRO

You might also like