Customizable Computing
Customizable Computing
Series
SeriesISSN:
ISSN:
ISSN:1935-3235
1935-3235
1935-3235
CHEN •• CONG
CHEN
SSyntheSiS
yntheSiS L
LectureS
ectureS on
on MOR
MORGGA
ANN&
&CL
C LAY
AYPOOL
POOL PU
PUBLI
BLISSH
HERS
ERS
c omputer A
computer Architecture
CONG •• GILL
rchitecture
GILL •• REINMAN
Series
Series
SeriesEditor:
Editor:
Editor:Margaret
Margaret
MargaretMartonosi,
Martonosi,
Martonosi,Princeton
Princeton
PrincetonUniversity
University
University
Customizable
Founding
Founding
FoundingEditor
Editor
EditorEmeritus:
Emeritus:
Emeritus:Mark
Mark
MarkD.D.
D.Hill,
Hill,
Hill,University
University
Universityof
of
ofWisconsin
Wisconsin
Wisconsin
REINMAN •• XIAO
Customizable
Customizable Computing
Computing
Computing
XIAO
Yu-Ting
Yu-Ting
Yu-TingChen,
Chen,
Chen,Jason
Jason
JasonCong,
Cong,
Cong,Michael
Michael
MichaelGill,
Gill,
Gill,Glenn
Glenn
GlennReinman,
Reinman,
Reinman,and
and
andBingjun
Bingjun
BingjunXiao
Xiao
Xiao
University
University
Universityof
of
ofCalifornia,
California,
California,Los
Los
LosAngeles
Angeles
Angeles
Since
Since
Sincethe
the
theend
end
endofof
ofDennard
Dennard
Dennardscaling
scaling
scalingin in
inthe
the
theearly
early
early2000s,
2000s,
2000s,improving
improving
improvingthe the
theenergy
energy
energyefficiency
efficiency
efficiencyof of
ofcomputation
computation
computation
has
has
hasbeen
been
beenthe
the
themain
main
mainconcern
concern
concernof of
ofthe
the
theresearch
research
researchcommunity
community
communityand and
andindustry.
industry.
industry.TheThe
Thelarge
large
largeenergy
energy
energyefficiency
efficiency
efficiencygap gap
gap
between
between
betweengeneral-purpose
general-purpose
general-purposeprocessors
processors
processorsand and
andapplication-specific
application-specific
application-specificintegrated integrated
integratedcircuits
circuits
circuits(ASICs)
(ASICs)
(ASICs)motivates
motivates
motivates
the
the
theexploration
exploration
explorationof of
ofcustomizable
customizable
customizablearchitectures,
architectures,
architectures,where
where
whereoneone
onecan can
canadapt
adapt
adaptthethe
thearchitecture
architecture
architectureto to
tothe
the
theworkload.
workload.
workload.
CUSTOMIZABLE COMPUTING
CUSTOMIZABLE
In
In
In this
this
this Synthesis
Synthesis
Synthesis lecture,
lecture,
lecture,we
we
we present
present
present an an
an overview
overview
overview andand
and introduction
introduction
introduction of of
of the
the
the recent
recent
recent developments
developments
developments on on
on
energy-efficient
energy-efficient
energy-efficientcustomizable
customizable
customizablearchitectures,
architectures,
architectures,including
including
includingcustomizable
customizable
customizablecores cores
coresand and
andaccelerators,
accelerators,
accelerators,on-chip
on-chip
on-chip
memory
memory
memory customization,
customization,
customization,and and
and interconnect
interconnect
interconnect optimization.
optimization.
optimization.In In
In addition
addition
addition to to
to aaa discussion
discussion
discussion of of
of the
the
the general
general
general
techniques
techniques
techniquesand and
andclassification
classification
classificationof of
ofdifferent
different
differentapproaches
approaches
approachesused used
usedin in
ineach
each
eacharea,
area,
area,wewe
wealso
also
alsohighlight
highlight
highlightand and
andillus-
illus-
illus-
COMPUTING
trate
trate
tratesome
some
someof of
ofthe
the
themost
most
mostsuccessful
successful
successfuldesign
design
designexamples
examples
examplesin in
ineach
each
eachcategory
category
categoryand and
anddiscuss
discuss
discusstheir
their
theirimpact
impact
impacton on
onper-
per-
per-
formance
formance
formanceand and
andenergy
energy
energyefficiency.
efficiency.
efficiency.We We
Wehope
hope
hopethat
that
thatthis
this
thiswork
work
workcaptures
captures
capturesthe the
thestate-of-the-art
state-of-the-art
state-of-the-artresearchresearch
researchand and
and
development
development
developmenton on
oncustomizable
customizable
customizablearchitectures
architectures
architecturesand and
andserves
serves
servesasas
asaaauseful
useful
usefulreference
reference
referencebasis
basis
basisfor
for
forfurther
further
furtherresearch,
research,
research,
design,
design,
design,and
and
andimplementation
implementation
implementationfor for
forlarge-scale
large-scale
large-scaledeployment
deployment
deploymentin in
infuture
future
futurecomputing
computing
computingsystems.systems.
systems. Yu-Ting
Yu-Ting Chen
Chen
Jason
Jason Cong
Cong
Michael
Michael Gill
Gill
Glenn
Glenn Reinman
Reinman
ABOUT
ABOUT
ABOUT SYNTHESIS
This
This
SYNTHESIS
SYNTHESIS
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionofof
ofaaawork
work
workthat
that
thatappears
appears
appearsinin
inthe
the
theSynthesis
Synthesis
Synthesis
Bingjun
Bingjun Xiao
Xiao
Digital
Digital
DigitalLibrary
Library
Libraryof of
ofEngineering
Engineering
Engineeringand and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
Synthesis Lectures
Lectures
Lectures
provide
provide
provideconcise,
concise,
concise,original
original
originalpresentations
presentations
presentationsof of
ofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development
M
MOR
MOR
OR G
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,in
in
indigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information
GA
GA
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com
SSyntheSiS
yntheSiS L
LectureS
AN
ectureS on
on
N
N&& CL
ISBN:
ISBN:
ISBN:978-1-62705-767-7
978-1-62705-767-7
978-1-62705-767-7
ccomputer
omputer AArchitecture
CL AY
MORGAN&
MORGAN
MORGAN &CLAYPOOL
CLAYPOOL
CLAYPOOL PUBLISHERS
PUBLISHERS
PUBLISHERS 999000000000000 rchitecture
AY P
w
w
www
www
w...m
m
mooorrrgggaaannnccclllaaayyypppoooooolll...cccooom
m
m
POOL
POOL
OOL
999 781627
781627
781627 057677
057677
057677
Margaret
MargaretMartonosi,
Martonosi,Series
SeriesEditor
Editor
Customizable Computing
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. e scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015
Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015
Shared-Memory Synchronization
Michael L. Scott
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
www.morganclaypool.com
DOI 10.2200/S00650ED1V01Y201505CAC033
Lecture #33
Series Editor: Margaret Martonosi, Princeton University
Series ISSN
Print 1935-3235 Electronic 1935-3243
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
University of California, Los Angeles
M
&C Morgan & cLaypool publishers
ABSTRACT
Since the end of Dennard scaling in the early 2000s, improving the energy efficiency of compu-
tation has been the main concern of the research community and industry. e large energy
efficiency gap between general-purpose processors and application-specific integrated circuits
(ASICs) motivates the exploration of customizable architectures, where one can adapt the ar-
chitecture to the workload. In this Synthesis lecture, we present an overview and introduction of
the recent developments on energy-efficient customizable architectures, including customizable
cores and accelerators, on-chip memory customization, and interconnect optimization. In addi-
tion to a discussion of the general techniques and classification of different approaches used in
each area, we also highlight and illustrate some of the most successful design examples in each
category and discuss their impact on performance and energy efficiency. We hope that this work
captures the state-of-the-art research and development on customizable architectures and serves
as a useful reference basis for further research, design, and implementation for large-scale deploy-
ment in future computing systems.
KEYWORDS
accelerator architectures, memory architecture, multiprocessor interconnection, par-
allel architectures, reconfigurable architectures, memory, green computing
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Customizable System-On-Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Compute Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 On-Chip Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Network-On-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Software Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Customization of Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Dynamic Core Scaling and Defeaturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Core Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Customized Instruction Set Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Vector Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2 Custom Compute Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Reconfigurable Instruction Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.4 Compiler Support for Custom Instructions . . . . . . . . . . . . . . . . . . . . . . . 23
6 Interconnect Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Topology Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1 Application-Specific Topology Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.2 Reconfigurable Shortcut Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.3 Partial Crossbar Synthesis and Reconfiguration . . . . . . . . . . . . . . . . . . . . 72
6.3 Routing Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1 Application-Aware Deadlock-Free Routing . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 Data Flow Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Customization Enabled by New Device/Circuit Technologies . . . . . . . . . . . . . . 80
6.4.1 Optical Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.2 Radio-Frequency Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.3 RRAM-Based Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Acknowledgments
is research is supported by the NSF Expeditions in Computing Award CCF-0926127, by C-
FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA),
and by the NSF Graduate Research Fellowship Grant #DGE-0707424.
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
June 2015
1
CHAPTER 1
Introduction
Since the introduction of the microprocessor in 1971, the improvement of processor performance
in its first thirty years was largely driven by the Dennard scaling of transistors [45]. is scaling
calls for for reduction of transistor dimensions by 30% every generation (roughly every two years)
while keeping electric fields constant everywhere in the transistor to maintain reliability (which
implies that the supply voltage needs to be reduced by 30% as well in each generation). Such
scaling doubles the transistor density each generation, reduces the transistor delay by 30%, and at
the same time improves the power by 50% and energy by 65% [7]. e increased transistor count
also leads to more architecture design innovations, such as better memory hierarchy designs and
more sophisticated instruction scheduling and pipelining supports. ese factors combined led
to over 1,000 times performance improvement of Intel processors in 20 years (from the 1.5m
generation down to the 65 nm generation), as shown in [7].
Unfortunately, Dennard scaling came to an end in the early 2000s. Although the transistor
dimension reduction by 30% per generation continues to follow Moore’s law, the supply voltage
scaling had to almost come to a halt due to the rapid increase of leakage power. In this case,
transistor density can continue to increase, but so can the power density. As a result, in order to
continue meeting the ever-increasing computing needs, yet maintaining a constant power bud-
get, in the past ten years the computing industry stopped simple processor frequency scaling and
entered the era of parallelization, with tens to hundreds of computing cores integrated in a sin-
gle processor, and hundreds to thousands of computing servers connected in a warehouse-scale
data center. However, such highly parallel, general-purpose computing systems now face serious
challenges in terms of performance, power, heat dissipation, space, and cost, as pointed out by a
number of researchers. e term “utilization wall” was introduced in [128], where it shows that if
the chip fills up with 64-bit adders (with input and output registers) designed in a 45 nm TSMC
process technology running at the maximum operating frequency (5.2Ghz in their experiment),
only 6.5% of 300mm2 of the silicon can be active at the same time. is utilization ratio drops
further to less than 3.5% in the 32nm fabrication technology, roughly by a factor of two in each
technology generation following their leakage-limited scaling model [128].
A similar but more detailed and realistic study on dark silicon projection was carried out
in [51]. It uses a set of 20 representative Intel and AMD cores to build up empirical models
which capture the relationship between area vs. performance and the relationship between power
vs. performance. ese models, together with the device-scaling models, are used for projection
of the core area, performance, and power in various technology generations. is also considers
2 1. INTRODUCTION
real parallel application workloads as represented by the PARSEC benchmark suite [9]. It further
considers different multicore models, including the symmetric multicores, asymmetric multicores
(consisting of both large and small cores), dynamic multicores (either large or small cores depend-
ing on if the power or area constraint is imposed), and composable multicores (where small cores
can be fused into large cores). eir study concludes that at 22 nm, 21% of a fixed-size chip must
be powered off, and at 8 nm, this dark silicon ratio grows to more than 50% [51]. is study also
points to the end of simple core scaling.
Given the limitation of core scaling, the computing industry and research community are
actively looking for new disruptive solutions beyond parallelization that can bring further signifi-
cant energy efficiency improvement. Recent studies suggest that the next opportunity for signifi-
cant power-performance efficiency improvement comes from customized computing, where one
may adapt the processor architecture to optimize for intended applications or application domains
[7, 38].
e performance gap between a totally customized solution using an application-specific
integrated circuit (ASIC) and a general-purpose processor can be very large, as documented in
several studies. An early case study of the 128-bit key AES encryption algorithm was presented
in [116]. An ASIC implementation of this algorithm in a 0.18m CMOS technology achieves
a 3.86Gbits/second processing rate at 350mW power consumption, while the same algorithm
coded in assembly languages yields a 31Mbits/second processing rate with 240mW power run-
ning on a StrongARM processor, and a 648Mbits/second processing rate with 41.4W power
running on a Pentium III processor. is results in a performance/energy efficiency (measured in
Gbits/second/W) gap of a factor of 85X and 800X, respectively, when compared with the ASIC
implementation. In an extreme case, when the same algorithm is coded in the Java language and
executed on an embedded SPARC processor, it yields 450bits/second with 120mW power, re-
sulting in a performance/energy efficiency gap as large as a factor of 3 million (!) when compared
to the ASIC solution.
Recent work studied a much more complex application for such gap analysis [67]. It uses
a 720p high-definition H.264 encoder as the application driver, and a four-core CMP system
using the Tensilica extensible RISC cores [119] as the baseline processor configuration. Com-
pared to an optimized ASIC implementation, the baseline CMP is 250X slower and consumes
500X more energy. Adding 16-wide SIMD execution units to the baseline cores improves the
performance by 10X and energy efficiency by 7X. Addition of custom-fused instructions is also
considered, and it improves the performance and energy efficiency by an additional 1.4X. Despite
these enhancements, the resulting enhanced CMP is still 50X less energy efficient than ASIC.
e large energy efficiency gap between the ASIC and general-purpose processors is the
main motivation for architecture customization, which is the focus of this lecture. In particular,
one way to significantly improve the energy efficiency is to introduce many special-purpose on-
chip accelerators implemented in ASIC and share them among multiple processor cores, so that
as much computation as possible is carried out on accelerators instead of using general-purpose
3
cores. is leads to accelerator-rich architectures, which have received a growing interest in recent
years [26, 28, 89]. Such architectures will be discussed in detail in Chapter 4.
ere are two major concerns about using accelerators. One relates to their low utilization
and the other relates to their narrow workload coverage. However, given the utilization wall [128]
and the dark silicon problem [51] discussed earlier, low accelerator utilization is no longer a se-
rious problem, as only a fraction of computing resources on-chip can be activated at one time
in future technology generation, given the tight power and thermal budgets. So, it is perfectly
fine to populate the chip with many accelerators, knowing that many of them will be inactive at
any given time. But once an accelerator is used, it can deliver one to two orders of magnitude
improvement in energy efficiency over the general-purpose cores.
e problem of narrow workload coverage can be addressed by introducing reconfigurability
and using composable accelerators. Examples include the use of fine-grain field-programmable
gate arrays (FPGAs), coarse-grain reconfigurable arrays [61, 62, 91, 94, 118], or dynamically
composable accelerator building blocks [26, 27]. ese approaches will be discussed in more detail
in Section 4.4.
Given the significant energy efficiency advantage of accelerators and the promising progress
in widening accelerator workload coverage, we increasingly believe that the future of processor
architecture should be rich in accelerators, as opposed to having many general-purpose cores.
To some extent, such accelerator-rich architectures are more like a human brain, which has many
specialized neural microcircuits (accelerators), each dedicated to a different function (such as navi-
gation, speech, vision, etc.). e high degree of customization in the human brain leads to a great
deal of efficiency; the brain can perform various highly sophisticated cognitive functions while
consuming only about 20W, an inspiring and challenging performance for computer architects
to match.
Not only can the compute engine be customized, but so can the memory system and on-
chip interconnects. For example, instead of only using a general-purpose cache, one may use
program-managed or accelerator-managed buffers (or scratchpad memories). Customization is
needed to flexibly partition these two types of on-chip memories. Memory customization will be
discussed in Chapter 5. Also, instead of using a general-purpose mesh-based network-on-chip
(NoC) for packet switching, one may prefer a customized circuit-switching topology between
accelerators and the memory system. Customization of on-chip interconnects will be discussed
in Chapter 6.
e remainder of this lecture is organized as follows. Chapter 2 gives a broad overview of
the trajectory of customization in computing. Customization of compute cores, such as custom
instructions, will be covered in Chapter 3. Loosely coupled compute engines will be discussed in
Chapter 4. Chapter 5 will discuss customizations to the memory system, and Chapter 6 discusses
custom network interconnect designs. Finally, Chapter 7 concludes the lecture with discussions
of industrial trends and future research topics.
5
CHAPTER 2
Road Map
Customized computing involves the specialization of hardware for a particular domain, and often
includes a software component to fully leverage this specialization in hardware. In this section, we
will lay the foundation for customized computing, enumerating the design trade-offs and defining
vocabulary.
• Programmability
• Specialization
• Reconfigurability
Programmability
A fixed function compute unit can do one operation on incoming data, and nothing else. For ex-
ample, a compute unit that is designed to perform an FFT operation on any incoming data is fixed
function. is inflexibility limits how much a compute unit may be leveraged, but it streamlines
the design of the unit such that it may be highly optimized for that particular task. e amount
6 2. ROAD MAP
of bits used within the datapath of the unit and the types of mathematical operators included for
example can be precisely tuned to the particular operation the compute unit will perform.
Contrasting this, a programmable compute unit executes sequences of instructions to de-
fine the tasks they are to perform. e instructions understood by the programmable compute
unit constitute the instruction set architecture (ISA). e ISA is the interface for use of the pro-
grammable compute unit. Software that makes use of the programmable compute unit will consist
of these instructions, and these instructions are typically chosen to maximize the expressive nature
of the ISA to describe the nature of computation desired in the programmable unit. e hardware
of the programmable unit will handle these instructions in a generally more flexible datapath than
that of the fixed function compute unit. e fetching, decoding, and sequencing of instructions
leads to performance and power overhead that is not required in a fixed function design. But the
programmable compute unit is capable of executing different sequences of instructions to handle
a wider array of functions than a fixed function pipeline.
ere exists a broad spectrum of design choices between these two alternatives. Pro-
grammable units may have a large number of instructions or a small number of instructions for
example. A pure fixed function compute unit can be thought of as a programmable compute unit
that only has a single implicit instruction (i.e., perform an FFT). e more instructions supported
by the compute unit, the more compact the software needs to be to express desired functional-
ity. e fewer instructions supported by the compute unit, the simpler the hardware required to
implement these instructions and the more potential for an optimized and streamlined imple-
mentation. us the programmability of the compute unit refers to the degree to which it may
be controlled via a sequence of instructions, from fixed function compute units that require no
instructions at all to complex, expressive programmable designs with a large number of instruc-
tions.
Specialization
Customized computing targets a smaller set of applications and algorithms within a domain to
improve performance and reduce power requirements. e degree to which components are cus-
tomized to a particular domain is the specialization of those components. ere are a large number
of different specializations that a hardware designer may utilize, from the datapath width of the
compute unit, to the number of type of functional units, to the amount of cache, and more.
is is distinct from a general purpose design, which attempts to cover all applications
rather than providing a customized architecture for a particular domain. General purpose designs
may use a set of benchmarks from a target performance suite, but the idea is not to optimize
specifically for those benchmarks. Rather, that performance suite may simply be used to gauge
performance.
ere is again a broad spectrum of design choices between specialized and general purpose
designs. One may consider general purpose designs to be those specialized for the domain of all
applications. In some cases, general purpose designs are more cost effective since the design time
2.1. CUSTOMIZABLE SYSTEM-ON-CHIP DESIGN 7
may be amortized over more possible uses—an ALU that can be designed once and then used in
a variety of compute units may amortize the cost of the design of the ALU, for example.
Reconfigurability
Once a design has been implemented, it can be beneficial to allow further adaptation to continue
to customize the hardware to react to (1) changes in data usage patterns, (2) algorithmic changes
or advancements, and (3) domain expansion or unintended use. For example, a compute unit
may have been optimized to perform a particular algorithm for FFT, but a new approach may be
faster. Hardware that can flexibly adapt even after tape out is reconfigurable hardware. e degree
to which hardware may be reconfigured depends on the granularity of reconfiguration. While
finer granularity reconfiguration can allow greater flexibility, the overhead of reconfiguration can
mean that a reconfigurable design will perform worse and/or be less energy efficient than a static
(i.e., non-reconfigurable) alternative. One example of a fine-grain reconfigurable platform is an
FPGA, which can be used to implement a wide array of different compute units, from fixed
function to programmable units, with all levels of specialization. But an FPGA implementation
of a particular compute unit is less efficient than an ASIC implementation of the same compute
unit. However, the ASIC implementation is static, and cannot adapt after design tape out. We
will examine more coarse-grain alternatives for reconfigurable compute units in Section 4.4.
Examples
• Accelerators—early GPUs, mpeg/media decoders, crypto accelerators
• Programmable Cores—modern GPUs, general purpose cores, ASIPs
• Future designs may feature accelerators in primary computational role
• Some programmable cores and or programmable fabric are still included for
generality/longevity
Section 3 covers the customization of processor cores and Section 4 covers coprocessors
and accelerators. We split compute components into two sections to better manage the diversity
of the design space for these components.
Sharing
On-chip memory may be kept private to a particular compute unit or may be shared among
multiple compute units. Private on-chip memory means that the application will not need to
contend for space with another application, and will get the full benefit of the on-chip memory.
Shared on-chip memory can amortize the cost of on-chip memory over several compute units,
providing a potentially larger pool of space that can be leveraged by these compute units than if the
space was partitioned among the units as private memory. For example, four compute units can
each have 1MB of on-chip memory dedicated to them. Each compute unit will always have 1MB
regardless of the demand from other compute units. However, if four compute units share 4MB
of on-chip memory, and if the different compute units use different amounts of memory, one
compute unit may, for example, use more than 1MB of space at a particular time since there is a
large pool of memory available. Sharing works particularly well when compute units use different
amounts of memory at different times. Sharing is also extremely effective when compute units
make use of the same memory locations. For example, if compute units are all working on an image
in parallel, storing the image in a single memory shared among the units allows the compute units
to more effectively cooperate on the shared data.
2.1.3 NETWORK-ON-CHIP
On-chip memory stores the data needed by the compute units, but an important part of the over-
all CSoC is the communication infrastructure that allows this stored data to be distributed to
the compute units, that allows the data to be delivered to/from the on-chip memory from/to the
memory interfaces that communicate off-chip, and that allows compute units to synchronize and
communicate with one another. In many applications there is a considerable amount of data that
2.1. CUSTOMIZABLE SYSTEM-ON-CHIP DESIGN 9
must be communicated to the compute units used to accelerate application performance. And
with multiple compute units often employed to maximize data level parallelism, there are of-
ten multiple data streams being communicated around the CSoC. ese requirements transcend
the conventional bus-based design of older multicore designs, with designers instead choosing
network-on-chip (NoC) designs. NoC designs enable the communication of more data between
more CSoC components.
Components interfacing with an NoC typically bundle transmitted data into packets,
which contain at least address information as to the desired communication destination and the
payload itself, which is some portion of the data to be transmitted to a particular destination.
NoCs transmit messages via packets to enable flexible and reliable data transport—packets may
be buffered at intermediate nodes within the network or reordered in some situations. Packet-
based communication also avoids long latency arbitration that is associated with communication
in a single hop over an entire chip. Each hop through a packet-based NoC performs local arbi-
tration instead.
e creation of an NoC involves a rich set of design decisions that may be highly cus-
tomized for a set of applications in a particular domain. Most design decisions impact the latency
or bandwidth of the NoC. In simple terms, the latency of the NoC is how long it takes a given
piece of data to pass through the NoC. e bandwidth of the NoC is how much data can be
communicated in the NoC at a particular time. Lower latency may be more important for syn-
chronizing communication, like locks or barriers that impact multiple computational threads in
an application. Higher bandwidth is more important for applications with streaming computation
(i.e., low data locality) for example.
One example of a design decision is the topology of an NoC. is refers to the pattern
of links that connect particular components of the NoC. A simple topology is a ring, where
each component in the NoC is connected to two neighboring components, forming a chain of
components. More complex communication patterns may be realized by more highly connected
topologies that allow more simultaneous communication or a shorter communication distance.
Another example is the bandwidth of an individual link in the topology—wire that is tra-
versed in one cycle of the network’s clock. Larger links can improve bandwidth but require more
buffering space at intermediate network nodes, which can increase power cost.
An NoC is typically designed with a particular level of utilization in mind, where decisions
like topology or link bandwidth are chosen based on an expected level of service. For example,
NoCs may be designed for worst case behavior, where the bandwidth of individual links is sized
for peak traffic requirements, and every path in the network is capable of sustaining that peak
bandwidth requirement. is is a flexible design in that the worst case behavior can manifest
on any particular communication path in the NoC, and there will be sufficient bandwidth to
handle it. But it can mean overprovisioning the NoC if worst case behavior is infrequent or
sparsely exhibited in the NoC. In other words, the larger bandwidth components can mean wasted
power (if only static power) or area in most cases. NoCs may also be designed for average case
10 2. ROAD MAP
behavior, where the bandwidth is sized according to the average traffic requirement, but in such
cases performance can suffer when worst case behavior is exhibited.
Topological Customization
Customized designs can specialize different parts of the NoC for different communication pat-
terns seen in applications within a domain. For example, an architecture may specialize the NoC
such that there is a high bandwidth connection between a memory interface and a particular com-
pute unit that performs workload balancing and sorting for particular tasks, and then there is a
lower bandwidth connection between that compute unit for workload balancing and the remain-
der of the compute units that perform the actual computation (i.e., work). More sophisticated
designs can adapt bandwidth to the dynamic requirements of the application in execution. Cus-
tomized designs may also adapt the topology of the NoC to the specific requirements of the
application in execution. Section 6.2 will explore such flexible designs, along with some of the
complexity in implementing NoC designs that are specialized for particular communication pat-
terns.
Routing Customization
Another approach to specialization is to change the routing of packets in the NoC. Packets may
be scheduled in different ways to avoid congestion in the NoC, for example. Another example
would be circuit switching, where a particular route through the NoC is reserved for a particular
communication, allowing packets in that communication to be expedited through the NoC with-
out intermediate arbitration. is is useful in bursty communication where the cost of arbitration
may be amortized over the communication of many packets.
CHAPTER 3
Customization of Cores
3.1 INTRODUCTION
Because processing cores contribute greatly to energy consumption in modern processors, the
conventional processing core is a good place to start looking for customizations to computation
engines. Processing cores are pervasive, and their architecture and compilation flow are mature.
Modifications made to processing cores then have the advantage that existing hardware mod-
ules and infrastructure invested in building efficient and high-performance processors can be
leveraged, without having to necessarily abandon existing software stacks as may be required
when designing hardware from the ground up. Additionally, programmers can use their exist-
ing knowledge of programming conventional processing cores as a foundation toward learning
new techniques that build upon conventional cores, instead of having to adopt new programming
paradigms, or near languages.
In addition to benefiting from mature software stacks, any modifications made to a con-
ventional processing core can also take advantage of many of the architectural components that
have made cores so effective. Examples of these architectural components are caches, mechanisms
for out-of-order scheduling and speculative execution, and software scheduling mechanisms. By
integrating modifications directly into a processing core, new features can be designed to blend
into these components. For example, adding a new instruction to the existing execution pipeline
automatically enables this instruction to benefit from aggressive instruction scheduling already
present in a conventional core.
However, introducing new compute capability, such as new arithmetic units, into existing
processing cores means being burdened by many of the design restrictions that these cores al-
ready exert on arithmetic unit design. For example, out-of-order processing benefits considerably
from short latency instructions, as long latency instructions can cause pipeline stalls. Conven-
tional cores are also fundamentally bound, both in terms of performance and efficiency, by the
infrastructure necessary to execute instructions. As a result, conventional cores cannot be as ef-
ficient at performing a particular task as a hardware structure that is more specialized to that
purpose [26]. Figure 3.1 illustrates this point, showing that the energy cost of executing an in-
struction is much greater than the energy that is required to perform the arithmetic computation
(e.g., energy devoted to integer and floating point arithmetic). e rest of the energy is spent to
implement the infrastructure internal to the processing core that is used to perform tasks such as
scheduling instructions, fetch and decode, extracting instruction level parallelism, etc. Figure 3.1
shows only the comparison of structures internal to the processing core itself, and excludes exter-
14 3. CUSTOMIZATION OF CORES
nal components such as memory systems and networks. ese are burdens that are ever present
in conventional processing cores, and they represent the architectural cost of generality and pro-
grammability. is can be contrasted against the energy proportions shown in Figure 3.2, which
show the energy saving when the compute engine is customized for a particular application, in-
stead of a general-purpose design. e difference in energy cost devoted to computation is pri-
marily the result of relaxing the design requirements of functional units, so that functional units
operate only at precisions that are necessary and are designed to emphasize energy efficiency per
computation, and potentially exhibit deeper pipelines and longer latencies than would be tolerable
when couched inside a conventional core.
is chapter will cover the following topics related to customization of processing cores:
Figure 3.2: Energy cost of subcomponents in a conventional compute core as a proportion of the total
energy consumed by the core. is shows the energy savings attainable if computation is performed
in an energy-optimal ASIC. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes
L1 cache energy only. Taken from [26].
• Core Fusion: Architectures that enable one “big” core to act as if it were really many
“small cores,” and vice versa, to dynamically adapt to different amounts of thread-level or
instruction-level parallelism.
Figure 3.3: A 4-core core fusion processor bundle with components added to support merging of
cores. Adapted from [74].
Figure 3.4: Comparison of performance between various processors of issue widths and 6-issue
merged core fusion. Taken from [74].
CPU Control
CPU
CPU CPU CPU CPU
I$ D$ I$ D$ I$ D$ I$ D$
L1 I$ L1 D$
I$ D$ I$ D$ I$ D$ I$ D$
Writeback
Decode
mory
cute
ue
Fetcch
Issu
Mem
Exec
I$ D$ I$ D$ I$ D$ I$ D$
Figure 3.6: How a BERET module fits into a processor pipeline. Taken from [66].
Buffer
Store
Register File index bits
Configuration RAM
MUX
(CRAM)
Input Latch
SEB
SEB 1 SEB 2 SEB N
w/ mem
config.
ALU LD
config. bits
Writeback Bus
<<
Configure Execute
Writeback Output Latch
SEB SEB
1 2 cycles 1 5 cycles 1 2 cycles (c) Subgraph Execution
Block (SEB)
Figure 3.7: Internal architecture and use of BERET modules. (a) Shows internal BERET architecture
along with communication mechanisms with SEBs, (b) shows approximate overhead associated with
different stages of a BERET invocation, (c) shows a configuration stored in a SEB. Taken from [66].
ecution blocks (SEBs) and configuration memory to hold trace configurations. Each trace config-
uration holds a small set of pre-decoded instructions that are tightly scheduled. A user program
uses two instructions to interact with configurable instructions: 1) a command to program the
configuration memory, and 2) a command to invoke a stored trace. To simplify integration into
the rest of the core pipeline, a BERET engine is restricted to working on a subset of registers that
would normally be available, thus limiting the number of inputs and outputs, and is restricted
from issuing memory accesses directly.
Figure 3.8 shows the process of transforming a program to use a BERET engine, which
is a process performed statically at compile time. A program is first separated into “hot code,”
which is frequently executed code where the majority of time is spent, and “cold code,” which is
the large volume of code that is executed infrequently (Figure 3.8.a). e hot code region is then
broken up into communicating sub-regions, on the granularity of a basic block (Figure 3.8.b).
ese sub-regions are then converted into BERET SEB configurations, each of which describes
a small program segment with no control dependencies (Figure 3.8.c) which can be loaded and
invoked at runtime. e hot code region is then expressed in terms of dependent BERET calls,
and possibly some flow control being executed within the unmodified portions of the processing
core. At run time, these configurations are loaded into the SEB units, and the hot code is executed
by invoking the BERET engine (Figure 3.8.d). Tasks like dependency need only be performed
when transitioning from one BERET call to another, and can be performed by the mechanisms
that normally perform instruction dependency tracking within the unmodified processing core.
3.4. CUSTOMIZED INSTRUCTION SET EXTENSIONS 23
(c) Data flow 1
(b) Hot Trace subgraphs
(a) Program (d) BERET with Subgraph
Execution Blocks (SEBs)
+
MPY
ADD Configuration
SUB
1
LD
-
BR
Control
2 SEB 0
exit LD & BR
Hot Traces
(with high loop AND
back probability)
2 SHIFT SEB 1
<<
ST
RF
ADD ST SEB 2
ADD
OR + +
3 SEB 3
BR
3 |
exit
BR
Figure 3.8: Program transformation process to map software to a BERET module. A hot region is
first broken down into chunks, with each chunk stored as a SEB configuration. e core dependency
tracking is used to schedule inter-SEB activities. Taken from [66].
is simple architecture allows the BERET engine to perform the majority of the work associ-
ated with executing a compute kernel, and leaves management of control structures and memory
operations for the general-purpose portion of the core.
While a configurable architecture like BERET is not going to execute a particular func-
tionality as efficiently or as fast as a dedicated compute engine, the configurable nature of the
engine allows for high utilization while executing hot code, without being restricted to particular
types of operations. It also accomplishes the primary goal of customized instruction sets, which
allows resource utilization to be further tilted toward the execution stage of the pipeline and away
from other stages, and increases the amount of work performed per instruction.
CHAPTER 4
• Loosely Coupled Accelerators Coarse-grain compute engines that act autonomously from
processing cores.
Packet processing
PCIe - gen2
External engine
PBus
interface 4 x 10 Gigabit Ethernet MAC
A significant shortcoming of LCAs in general is that they are implemented as ASIC, and
are thus of fixed functionality. is restricts LCAs to only being reasonably used in instances in
which an algorithm is (1) mature enough that it is unlikely to change in the near future and (2)
important enough that there is justification for creating a specialized processor. While there are a
set of workloads that meet these criteria, such as encryption and decryption for web servers, work-
loads that are worth the up-front cost associated with designing a special-purpose processor are
infrequent. While tapeout costs are dropping for processors, and the tools for processor author-
ship are improving, it is still difficult to economically justify the inclusion of LCAs in commodity
processors in most cases.
Architectures like that found in the WSP are highly customized for a particular workload
or set of workloads. While this allows them to achieve performance and efficiency far beyond
that which would normally be found in a conventional processor for these workloads, they are
indistinguishable from a resource-constrained conventional processor for workloads other than
those for which they are customized.
X X X X X X X X
+ + + +
+ +
+
(A)
X X X X X X X X
+ + + +
+ +
+
(B)
Figure 4.2: A sample equation to act as a driver for further discussion. (a) Shows the full computation
to be performed, (b) shows the computation broken into work to be performed by individual compute
engines.
Shared Memory
From Neighbor CEs
CE CE CE CE CE
CE CE CE CE CE
Local
Config
Memory
Memory ALU
CE CE CE CE CE
To Neighbor CEs
CE CE CE CE CE
(B)
(A)
Figure 4.3: An example CGRA example, inspired by [118]. (a) Shows the interconnection of multiple
compute engines, (b) shows the internal structure of a single compute engine.
compute graph, with Figure 4.4(b) showing a potential mapping of these nodes to compute en-
gines such that our desired communication pattern can be achieved. At run time, we can select
from any arrangement of compute engine such that the pattern shown in Figure 4.4(b) can fit,
and so that the compute engines are not otherwise occupied with other tasks. One such allocation
of our sample 3x4 CGRA is shown in Figure 4.4(c), though there are four possible configurations
to choose from (shifting the pattern to the right, up, or to the upper right). At compile time the
pattern shown in Figure 4.4(b) would be generated, while the allocation shown in Figure 4.4(c)
would be performed at run time. is example also illustrates the importance of homogeneity in
the CGRAs layout, as an unusable compute engine anywhere in the center of the sample CGRA
would eliminate all possible mapping opportunities.
X A X X B X X C X X D X
A B C
+ + + +
+ E + D E
+ (B)
(A)
A B C
D E
(C)
Figure 4.4: e procedure of mapping a sample program to a statically mapped sample CGRA ar-
chitecture. (a) Shows the broken up program to be mapped with node labels, (b) shows a sample com-
munication pattern that can be used to implement this computation, (c) shows a sample allocation of
this mapping to a sample 3x4 CGRA with neighbor links.
(A) (B)
Figure 4.5: A CGRA architecture designed for mapping at runtime with a packet switched network.
(a) Shows how the compiler & mapping system views the CGRA (as fully connected), with (b) illus-
trating the actual topological layout.
Figure 4.6: A diagram of a CHARM processor, featuring cores and ABBs bundled into islands. Taken
from [26].
4.4.3 CHARM
An example CGRA architecture is CHARM [26], shown in Figure 4.6, which focuses on virtu-
alization and rapid scheduling. CHARM introduces a hardware resource management mecha-
nism called an accelerator block composer (ABC), which manages work done by a series of small
compute engines called accelerator building blocks (ABBs) that are distributed throughout the
processor in a series of islands. ere is a single ABC on a chip, and this device has control over a
large number of ABBs which are distributed throughout the processor. e ABC acts as a gate-
way through which a processor can interact with accelerators. Internal to each island, there is a
DMA that choreographs data transfer in and out of the island, an amount of scratchpad memory
(SPM) for use as ABB buffers, an internal network to facilitate intra-island communication, and
a network interface to enable inter-island communication. An island is illustrated in Figure 4.7.
A conventional core invokes an accelerator by writing a configuration to normal shared
memory that describes a communicating acyclic graph of ABBs. e ABC then schedules this
graph of ABBs among free resources with the objective of maximizing performance for the newly
instantiated accelerator, and assigns to each involved ABB a portion of work to perform. To
further boost performance, the ABC continues to schedule additional instances of the accelerator
until it either runs out of resources, or runs out of work to assign. When all the work is done, the
ABC signals to the calling core that work has been completed.
Using the ABC, the hardware scheduling mechanism CHARM is able to recruit a large
volume of compute resources to participate in any task that has ample parallelism.
Router
NoC
DMA Engine
Interface
Figure 4.7: An island of ABBs and associated internal structures. is example island features a wide
unidirectional ring for internal connectivity, and uses the normal processor interconnect for inter-island
communication. Taken from [26].
ing point operations, integer arithmetic, etc.) the mapping of a program region to a CGRA is
relatively straightforward. In a practical sense, the process of mapping a program to a CGRA
is identical to that of mapping custom instructions described in Chapter 3.4.4, with the added
expectation that the CGRA will likely be used to target computations that operate over large
volumes of data.
39
CHAPTER 5
On-Chip Memory
Customization
5.1 INTRODUCTION
On-chip memory provides local memory for computational units, such as general processors and
accelerators, for better efficiency to access data. Instead of directly fetching data from off-chip
DRAM, on-chip memory can provide both short access latency and high bandwidth to hide
DRAM access latency. e data blocks with high locality can be cached in on-chip memory.
In this chapter, we will first introduce different types of on-chip memories and then discuss the
customization techniques for different types of on-chip memory system designs.
Performance
Both caches and buffers are used to hide the access latencies to off-chip DRAM and can provide
reasonable performance gain. For applications with complicated memory access patterns or access
patterns that cannot be predicted at compile time, caches are better choices. Many applications
that run on top of general-purpose cores have this kind of behavior. Buffers are usually used for
special-purpose cores and accelerators, which usually run applications or computation kernels
with predictable memory access patterns. It is common for an accelerator to request multiple data
elements per computation. In order to fetch multiple data elements simultaneously with the same
fixed latency, designers usually provide multiple individual buffers to perform banking during the
accelerator design time. e throughput of the accelerator can thus be significantly improved.
Although the internal banking provided in a cache can interleave simultaneous accesses, it does
not guarantee that all of the accesses can be interleaved successfully without conflicts. erefore,
buffers are superior in this case, especially when the access patterns are known in the design time
of the accelerator.
Another advantage of a buffer is the predictable access latency. e access latency when
using caches is difficult to predict since the cache blocks are managed by most general replacement
policies. A cache miss can occur during runtime, which triggers the cache controller to send a
request to the memory controller to retrieve a data block. e access latency is unpredictable
since the block may be at different levels of the cache hierarchy and there may be contention for
memory hierarchy resources. In contrast, buffers can provide absolutely predictable performance
since the data blocks are explicitly managed by software or hardware. Buffers can guarantee that
5.1. INTRODUCTION 41
critical performance targets are met. With compiler optimizations, buffers can further achieve
better data reuse and save next-level memory accesses.
Figure 5.1: A 4-way set-associative cache using selective cache ways. Taken from [1].
connected transistors [135]. With the support of the circuit-level technique, a cache can be con-
figured at a certain granularity dynamically.
Figure 5.2 demonstrates the architectural design of the DRI i-cache. e DRI i-cache can
dynamically record the miss rate within a fixed-length interval. is is achieved through a miss
counter, a miss-bound, and the end of interval flag. e miss counter records the number of misses
in an interval; this is used to measure the cache demand of the DRI i-cache. e miss-bound is a
value preset to be the upper bound of misses. If the miss counter is larger than the miss bound,
the DRI i-cache is then downsized. Otherwise, the DRI i-cache is upsized. e cache size can
be changed by a factor of two by alternating the number of bits used in the index. To prevent the
cache from thrashing and downsizing to a prohibitively small size, a size-bound is provided to
specify the minimum cache size. A DRI i-cache can reduce the overall energy-delay product by
62% while increasing the execution time by at most 4%.
5.2. CPU CACHE CUSTOMIZATIONS 45
Figure 5.4: Fraction of time cached data are “dead.” Taken from [77].
would be reset. If the counter saturates, it means that the cache line is unlikely to be accessed and
thus can be powered off. However, the decay interval can be at the range of tens of thousands of
cycles. e counter design can thus be impractical since it requires a large number of bits.
erefore, the authors proposed a hierarchical counter design where a single global counter
is used to provide the ticks for smaller cache-line counters, as shown in Figure 5.5. e local two-
bit counter is reset when the corresponding cache line is accessed. To avoid the possibility of a
burst of write-backs when the global tick signal is generated, the tick signal is cascaded from one
local counter to the next with a one-cycle latency. e proposed cache decay idea can reduce the
L1 leakage energy by 5x for the SPEC CPU2000 with negligible performance degradation [77].
In conclusion, the fine-grain strategies have higher flexibility and can have better leakage
reduction within a performance degradation limit compared to coarse-grain strategies. However,
the control logic design and circuit-level design may introduce overhead in area consumption and
routing resources. is further complicates the design to implement fine-grain strategies. ese
trade-offs need to be considered by designers when choosing a cache customization strategy.
In [33], the authors provide an optimal microarchitecture that minimizes the number of
buffers and the size of each buffer for stencil computation. e details are covered in Chapter 6
and in [33].
Figure 5.8: Associativity-based partitioning for reconfigurable caches. Taken from [113].
For a direct-mapped cache, the authors use the overlapped wide-tag partitioning scheme,
as shown in Figure 5.9. e additional tag bits are used to indicate the partitions. e number
of partitions is limited to be powers of two for simpler decoding. e results show that IPC
improvements range from 1.04x to 1.20x for eight media processing benchmarks.
Figure 5.9: Overlapped wide-tag partitioning for reconfigurable caches. Taken from [113].
cache and scratchpads without adapting to the run time cache behavior. Since cache sets are not
uniformly utilized [110], this uniform mapping of SPM blocks onto cache blocks may create hot
cache sets at run time, which will increase the conflict miss rate and degrade the performance.
Figure 5.10 shows the cache set utilization statistics for a hybrid cache. Each column represents
a set in the cache, while each row represents one million cycles of time. A darker point means a
hotter cache set. We see that the cache set utilizations vary a lot for different applications. For a
given application, the utilization still varies over time. A good adaption technique is required for
the development of a hybrid cache.
Figure 5.10: Non-uniformed cache sets utilization in a hybrid cache. Taken from [32].
Two challenges are encountered for an efficient hybrid cache design. e first challenge is
how to balance cache set utilization when SPMs are allocated in the cache. e second challenge
is how to efficiently find the blocks in SPMs when the blocks are frequently remapped to different
52 5. ON-CHIP MEMORY CUSTOMIZATION
cache blocks. erefore, hardware support is required. e software only focuses on the use of a
logically continuous SPM.
e authors in [32] proposed the adaptive hybrid cache (AH-Cache) to address these chal-
lenges. First, the lookup operation of the SPM location is hidden in the execution (EX) stage of
the pipeline of the processor, and a clean software interface is provided as a non-adaptive hybrid
cache. Second, a victim tag buffer, similar to the missing tag [137], is used to assess the cache set
utilization by sharing the tag array, resulting in no storage overhead. ird, an adaptive mapping
scheme is proposed for fast adaptation to the cache behavior without the circular bouncing effect.
e circular bouncing effect means that the allocated SPM blocks keep bouncing between several
hot cache sets, which incurs energy and performance overheads.
Figure 5.11 shows an example of SPM management in AH-Cache. e system software
is provided with two system APIs to specify the scratchpad base address and size. As shown in
Figure 5.11(b), spm_pos sets the SPM base address register as the address of the first element of
array amplitude, and spm_size sets the SPM size register as the size of the array amplitude and
state. Note that these system APIs do not impact the ISA since they use regular instructions for
register value assignment.
As shown in Figure 5.11(d)(e), the partition between cache and SPM in AH-Cache is at
a cache-block granularity. If the requested SPM size is not a multiple of a cache block, it will
be increased to the next block-sized multiple. e mapping information of SPM blocks onto the
cache blocks is stored into an SPM mapping lookup table (SMLT). e number of entries in SMLT
is the maximum number of cache blocks that can be configured as SPM. Since AH-Cache must
hold at least one cache block for each cache set to maintain the cache functionality, the maximum
SPM size on a M-way N-set set-associative cache is (M-1)*N blocks. In each SMLT entry, there
are (1) a valid bit indicating whether this SPM block falls into the real SPM space, since the
requested SPM size may be smaller than the maximum SPM size; and (2) a set index and a way
index which locate the cache block upon which the SPM block is mapped.
AH-Cache needs an additional step in order to use the low-order bits of the virtual address
to look up the SMLT. is further increases the pipeline critical path. To solve this problem,
inspired by the zero-cycle load idea [3], the address checking and SMLT lookup are performed
in parallel with the virtual address calculation of the memory operation in a pipelined architecture,
as shown in Figure 5.12.
In AH-Cache, the low-demand sets are used to accommodate more SPM blocks than the
high-demand sets, as shown in Figure 5.11(e). Miss rate cannot be used to recognize a high-
demand cache set, since for streaming applications with little locality or applications hopelessly
thrashing the cache, even if the miss rate is high, there is little benefit in increasing the cache
blocks. AH-Cache uses a victim tag buffer (VTB) to capture the demand of each set. is is
similar to the miss tag introduced in [137], but with no memory overhead. e details of the
VTB management can be found it [32].
5.4. PROVIDING BUFFERS IN CACHES FOR CPUS AND ACCELERATORS 53
Figure 5.11: (a) Original code, (b) transformed code for AH-Cache, (c) memory space view of SPM
in AH-Cache, (d) SPM blocks, (e) SPM mapping in AH-cache, (f ) SPM mapping lookup table
(SMLT). Taken from [32].
We call the blocks bouncing around different sets “floating blocks.” AH-Cache uses a float-
ing block holder queue and a specific reinsertion bit table to handle the circular bouncing problem
and perform adaptive block mapping. Readers can refer to [32] for more details.
Overall, the AH-Cache can reduce the cache miss rates by up to 52% and provide around
20% energy-delay product gain over prior work.
54 5. ON-CHIP MEMORY CUSTOMIZATION
Figure 5.12: SPM mapping lookup and access in AH-Cache. Taken from [32].
A buffer in BiC can be identified by a set index and a way index and is allocated at the
cache line granularity. Buffer allocation is performed by allocating contiguous cache lines within
5.4. PROVIDING BUFFERS IN CACHES FOR CPUS AND ACCELERATORS 55
the same cache way across cache sets. is can avoid the starvation problem if buffers are allocated
in the same set. Moreover, disparate buffers must not overlap in the cache lines.
Figure 5.14 shows the implementation of BiC. An additional buffer/cache (B/C) sticky bit
is added to each cache line. If the line is used by a buffer, the sticky bit is set to 1. If the line is
used for cache, it is set to 0. e replacement logic would avoid selecting the cache lines with
their B/C sticky bit equal to 1 as replacement candidates. For a buffer operation, the B/C sticky
bit can be set to disable tag comparison. Instead, the way _sel signal is used to fetch the data in
the data array. erefore, the dynamic power from tag comparison can be eliminated.
BiC can eliminate the need to build a 215KB dedicated SRAM for accelerators with only
3% extra area added to the baseline L2 cache. e cache miss rates only increase by no more than
0.3%.
Buffer-in-NUCA
e buffer-in-NUCA architecture (BiN) [29] further extends the BiC work from allocating
buffers in a centralized cache to a non-uniform cache architecture (NUCA) with distributed cache
banks. In NUCA there are multiple physically distributed memory banks in the L2 cache or LLC.
Figure 5.15(a) shows the architectural overview of BiN. BiN further extends the accelerator-rich
architecture [28] discussed in Section 4.2. ABM stands for the accelerator and BiN manager,
which is a centralized controller to manage the accelerator and BiN resource. e interactions
between the core, ABM, accelerators, and L2 cache banks are described in Figure 5.15(b).
In BiN, the authors aim to solve the following two problems: (1) how to dynamically assign
buffer sizes to accelerators that can best utilize buffers to reduce the off-chip bandwidth demand,
and (2) how to limit the buffer fragmentation during allocation.
In general, the off-chip memory bandwidth demand can be reduced by increasing the sizes
of accelerator buffers, which is discussed in [42]. is is because longer data reuse distance can
be covered with larger buffers. e trade-offs between buffer sizes and bandwidth demands can be
depicted as a curve, which is called BB-Curve. In BiN ABM can collect the buffer requests in
56 5. ON-CHIP MEMORY CUSTOMIZATION
a short fixed-time interval and then perform the global allocation for the collected requests. An
optimal algorithm that can dynamically allocate the buffers is proposed to guarantee the short-
time optimality. e details of the algorithm can be found in [29].
In order to simplify the buffer allocation and the location decode of buffer accesses, prior
work allocates buffers in a physically contiguous fashion. is may introduce fragmentation, espe-
cially when many accelerators request buffer resources. In BiN, the authors propose paged buffer
allocation, borrowing the idea from virtual memory, to provide flexible buffer allocation at a page
granularity. e page size of each buffer can be different and is able to adapt to the buffer size
required by an accelerator. Figure 5.16 demonstrates an example of buffer allocation on three ac-
celerators. e buffer allocator in ABM first selects the nearby L2 banks for allocation. To reduce
the page fragments, BiN allows the last page of a buffer to be smaller than the other pages of this
buffer. is does not affect the page table lookup. erefore, the maximum page fragment for any
buffer is smaller than the minimum page size.
e two major hardware components to support BiN are (1) the buffer allocator module
in ABM, and (2) the buffer page table and address generation logic of an accelerator. Figure 5.17
shows the design of the buffer allocator module in the ABM. For each L2 buffer bank, the buffer
allocation status of each cache way needs to be recorded. For an L2 cache with N ways, a .N 1/-
entry table is used to keep track of the allocation status for each bank. At most, N 1 ways can
be allocated for buffers to prevent starvation. e maximum number of buffer requests that can
be handled in the fixed-time interval is set to a given number (eight in this example). erefore,
eight SRAM tables are reused to record the BB-Curve points.
For each accelerator, a local buffer page table is required to perform address translation, as
shown in Figure 5.18. For a 2MB L2 cache with 32-bank and 64-byte cache line, a 5-bit cache
5.5. CACHES WITH DISPARATE MEMORY TECHNOLOGIES 57
Figure 5.16: BiN: an example of the paged buffer allocation. Taken from [29].
Figure 5.17: BiN: the buffer allocator module in ABM. Taken from [29].
bank ID and a 10-bit cache block ID are required. Compared to BiC, BiN can further improve
performance and energy by 35% and 29%, respectively.
Figure 5.18: BiN: the block address generation with the page table in an accelerator. Taken from
[29].
Table 5.1 shows a brief comparison of SRAM, STT-RAM, and PRAM technologies. e
exact access time and dynamic power depend on the cache size and the peripheral circuit im-
plementation. In sum, SRAM suffers from high leakage and low density while providing great
endurance, i.e., cell lifetime, while STT-RAM and PRAM provide high density and low leak-
age at the cost of weak endurance. Moreover, STT-RAM outperforms PRAM in terms of en-
durance, access time, and dynamic power, while PRAM has higher density. Based on endurance,
STT-RAM is more suitable for on-chip last-level cache design due to its higher endurance
[14, 16, 17, 47, 76, 121, 131, 132], while PRAM is promising as an alternative for DRAM in the
main memory design due to its higher density [86]. In this section we focus on the discussion of
on-chip memory.
Table 5.1: Comparison among SRAM, STT-RAM, and PRAM. Taken from [16].
In this section we use the term “hybrid cache” to refer to a cache that uses disparate mem-
ory technologies. A hybrid cache can leverage advantages from both SRAM and NVM while
hiding their disadvantages. In general, a hybrid cache provides a larger cache size than that of
the conventional SRAM-based cache by using higher density NVM cells. Moreover, the leakage
consumption is much smaller in a hybrid cache. A hybrid cache can utilize its SRAM cells to hide
5.5. CACHES WITH DISPARATE MEMORY TECHNOLOGIES 59
the drawbacks of low endurance and high dynamic write energy in NVM cells. With the benefits
of denser and near-zero leakage NVM cells, a hybrid cache is best suited as the LLC.
Hybrid cache architectures were first proposed in [121, 132]. In [132] the authors try to
explore the performance and energy of different types of hybrid cache architectures. e explo-
ration includes disparate NVM technologies, different multilevel configurations, and 2D/3D hy-
brid caches, as shown in Figure 5.19. In this section, we focus on hybrid cache designs similar
to the region-based hybrid cache architecture (RHCA), which considers disparate memory tech-
nologies in the same cache level. Readers who are interested in the other design styles can refer
to [132] for more details.
Figure 5.19: Exploration of different hybrid cache architecture configurations. Taken from [132].
Figure 5.20 shows an example of a hybrid cache design presented in [16]. First, a hybrid
cache contains data arrays with disparate technologies. Second, the accesses to the tag array and
data array are done sequentially (i.e., the data array will be accessed after the tag array). Such a
serialized tag/data array access has already been widely adopted in a modern low-level large-scale
cache for dynamic energy reduction. ird, the tag array is fully implemented with SRAM cells
to avoid long access latencies if NVM are applied. Fourth, a hybrid cache can be partitioned based
on the cache way granularity [16, 17, 76, 131, 132]. e NVM region contains more cache ways,
i.e., cache capacity, than those of the SRAM region since NVM is denser. Finally, for performance
consideration, a hybrid cache with disparate technologies is usually deployed at an L2 cache or a
LLC, but not at an L1 cache. is is because the read and write access latency for an NVM block
is usually longer than that of a SRAM block, which leads to significant performance overhead if
NVM is applied at an L1 cache. However, the access latency of a hybrid cache in an L2 cache
or an LLC can be hidden by the SRAM-based L1 cache. For example, when a miss occurs in an
STT-RAM line in LLC, a request would be issued to the memory controller to fetch the data
back. e fetched data can be directly forwarded to upper-level caches (L1 or L2) without waiting
for the whole write process on the STT-RAM block to complete.
However, the baseline hybrid cache can still suffer from the low endurance and high dy-
namic write energy arising from NVM cells. In this section we will discuss the customization
strategies to leverage the NVM benefits while hiding the drawbacks of NVM cells. Similar to the
60 5. ON-CHIP MEMORY CUSTOMIZATION
Figure 5.20: Hybrid cache with disparate memory technologies. Taken from [16].
customization techniques discussed in Section 5.2, the customization techniques used in a hybrid
cache can be summarized into coarse-grain and fine-grain techniques as well. First, we discussed
the coarse-grain techniques, such as the selective cache ways [1] and DRI i-caches [107] in Sec-
tion 5.2. Similarly, more sophisticated dynamic reconfiguration strategies can be performed at a
coarse-grain level for a hybrid cache to reduce leakage [16]. Second, for the fine-grain techniques,
we discussed the cache decay idea [77] used for SRAM caches in Section 5.2. However, the hybrid
caches with disparate memory technologies require more sophisticated techniques to handle both
the SRAM and NVM cache blocks. For example, techniques such as adaptive block placement
and block migration between SRAM and NVM cache ways are proposed to hide two drawbacks of
NVM blocks: the high dynamic write energy and the low endurance [16, 17, 76, 131, 132]. Note
that the dynamic reconfiguration techniques are orthogonal to the fine-grain block placement and
migration techniques. erefore, the coarse-grain technique and the fine-grain techniques can be
applied together to optimize performance, energy, and endurance simultaneously.
5.5. CACHES WITH DISPARATE MEMORY TECHNOLOGIES 61
5.5.1 COARSE-GRAIN CUSTOMIZATION STRATEGIES
Based on the research proposed in [16, 17, 76, 131, 132], leakage consumption still accounts for
a significant percentage of the total energy consumption (> 30%) in a hybrid cache even if STT-
RAM cells are deployed. erefore, researchers have explored dynamic reconfiguration techniques
for hybrid caches to further reduce leakage. In Section 5.2 we introduced several reconfiguration
techniques for SRAM-based caches, such as selective cache ways [1], Gated-Vdd [107], and cache
decays [77]. In this section we will introduce the dynamically reconfigurable hybrid cache (RHC)
[16], which provides effective methods to reconfigure a hybrid cache.
e RHC architecture shown in Figure 5.20 can be dynamically reconfigured at the way
granularity based on the cache demand. Figure 5.21 illustrates the power-gating design adopted
in RHC to perform dynamic reconfiguration. A centralized power management unit (PMU) is
introduced to send sleep/wakeup signals to power on/off each SRAM or NVM way. e power-
gating circuits of each way in SRAM tag/data arrays are implemented with NMOS sleep transis-
tors to minimize the leakage. In this design the stacking effect of three NMOS transistors from
the bitline to GND, substantially reduces leakage [107]. In RHC, the SRAM cells in the same
cache way will be connected to a shared virtual GND while the virtual GNDs among different
cache ways are disconnected. is can ensure that the behaviors of cache ways that are powered-on
will not be influenced by the powering-off process in other ways.
Figure 5.21: PMU and the power-gating design of RHC. Taken from [16].
For the dynamic reconfiguration strategy, the authors addressed the following key ques-
tions: (1) how to measure cache demand accurately, (2) how to make power-off decisions without
cache thrashing, and (3) how to deal with hybrid data arrays. Figure 5.22 shows the potential hit
counter scheme used in [16]. e cache demand is measured by the potential hit counters, which
count the potential hits that occur in SRAM or STT-RAM data arrays. A potential hit is a hit
that could occur if the cache way was on. is is achieved by comparing the tag of the current
62 5. ON-CHIP MEMORY CUSTOMIZATION
access and the tags in the powered-off cache ways. e potential hit idea is similar to the missing
tags [32] and victim tags [137]. To address the hybrid data arrays question, RHC provides two
potential hit counters for the SRAM data array and STT-RAM data array, respectively. Also, the
powered-on and powered-off thresholds are different. When the value of a potential hit counter
is over the threshold measured in a given time period, the whole cache way is then powered on/off
automatically.
To alleviate the cache thrashing problem that arises from powered-off the whole cache way,
the authors in [16] proposed the asymmetric powered-on and powered-off strategy. e powered-
on speed is generally faster than that of the powered-off speed and is achieved using the following
two rules. First, in the given time period, only one cache way can be powered-off. In contrast,
multiple cache ways can be powered-on in the next time period when the measured cache demand
is high. Second, to further reduce the impact of the thrashing problem, the powered-off speed is
set to be slower. For example, the powered-off action can be triggered only when the low cache
demand lasts for a given number of consecutive periods, e.g., 10 periods.
According to [16], the proposed RHC achieves an average 63%, 48%, and 25% energy
savings over non-reconfigurable SRAM-based cache, non-reconfigurable hybrid cache, and re-
configurable SRAM-based cache, while maintaining the system performance (at most 4% per-
formance overhead) for a wide range of workloads.
Figure 5.23: Block migration scheme in the region-based hybrid cache architecture (RHCA). Taken
from [132].
However, the authors in [132] did not distinguish policies applied between reads and writes
to STT-RAM cache lines. e writes to STT-RAM lines are harmful due to their high dynamic
64 5. ON-CHIP MEMORY CUSTOMIZATION
write energy and low endurance to STT-RAM cells. e authors in [76] tried to improve the
endurance of a hybrid cache by proposing two management policies: intra-set remapping and
inter-set remapping. Figure 5.24 shows the hybrid cache architecture proposed in this work.
e intra-set remapping can be divided into two types of migrations: (1) data migration
between SRAM and STT-RAM lines, and (2) data migration within STT-RAM lines (in the
same set). e data migration between SRAM and STT-RAM lines helps migrate the write-
intensive blocks from STT-RAM lines to SRAM lines, while the data migration within STT-
RAM lines averages the write intensity for the STT-RAM lines in the same set.
e line saturation counter (LSC) is used to monitor the recent write intensity of each line.
LSC increments when a write access occurs. When the LSC of a STT-RAM line saturates, the
cache controller tries to find a victim with the lowest LSC value within the SRAM lines. e
non-uniformity of writes on STT-RAM lines still exists even if the migrations to SRAM lines
are performed. e authors used the wear-level saturation counter (WSC) for each STT-RAM
line to record the write intensity. When the WSC in a line saturates, the cache controller tries
to find the line with minimum WSC in the same set as the migration target. If the difference
between the two lines is larger than a threshold, migration will be performed.
Figure 5.24: Pure dynamic migration scheme: intra-set and inter-set migrations. Taken from [76].
e authors further proposed the inter-set migration to solve the write non-uniformity
between cache sets. Eight cache sets that differ in the three most significant bits of their tags
form a merge group. e inter-set migration can be performed inside the same merge group. e
STT-RAM saturation counter (TSC) is used to measure the write intensity of the STT-RAM
lines in this set while the SRAM saturation counter (SSC) is used for the SRAM lines in this set.
e merge destination (MD) can be used to build a bidirectional link between the target set (hot
set) and victim set (cold set) pair. A STT-RAM line can be migrated to a SRAM line in the same
5.5. CACHES WITH DISPARATE MEMORY TECHNOLOGIES 65
group by using TSC, SSC, and MD. e details of the data replacement algorithm and cache
line search algorithm can be found in [76]. Compared to the baseline configuration, the work in
[76] can achieve a 49 times improvement in endurance and more than a 50% energy reduction
for PARSEC benchmarks.
5.9x compared to pure static and pure dynamic schemes, respectively. Furthermore, the system
energy can be reduced by 17% compared to a pure dynamic scheme.
Figure 5.25: An example for illustrating read-range and depth-range. Taken from [131].
should be migrated to STT-RAM lines. Otherwise, the block is dead and should be evicted from
the LLC.
Figure 5.26: e distribution of access pattern for each type of LLC write access. Taken from [131].
e zero-depth-range core-write accesses should remain in this original cache line for
avoiding read misses and block migrations. e immediate-depth-range core-write accesses are
the write-intensive accesses with write burst behavior. erefore, it is preferable that they are
placed in the SRAM line. e distant-depth-range core-write accesses should remain in the orig-
inal cache line for minimizing the migration overhead as well.
e zero-read-range blocks account for 61.2% of the demand-write blocks. ey are known
in the literature as “dead-on-arrival” blocks, and are never referenced again before being evicted. It
is unnecessary to place the zero-read-range blocks into LLC, so the blocks should bypass the cache
(assuming a non-inclusive cache). For immediate-read-range and distant-read-range demand-
write blocks, the authors suggest placing them in the STT-RAM ways to utilize the large capacity
of the STT-RAM portion and to reduce the pressure on the SRAM portion. e reader can refer
to [131] for the required architectural support and the policy flow chart for dynamic placement
and migration. According to [131], the hybird LLC can outperform the SRAM-based LLC on
average by 8.0% for single-thread workloads and 20.5% for multicore workloads. e technique
reduces power consumption in the LLC by 18.9% and 19.3% for single-thread and multicore
workloads, respectively.
69
CHAPTER 6
Interconnect Customization
6.1 INTRODUCTION
After we discuss the customization of computing units and memory systems, we continue with
the communication infrastructure between them. is is a key component since the intercon-
nect latency and bandwidth directly determine whether the computing units and memory system
can achieve their designed peak performance. A suboptimal interconnect design can make the
customized computing units and the memory system underutilized, and can lead to the corre-
sponding waste of chip area and energy consumption. Fortunately the interconnects have a high
potential for improvement from customization as well. We observe that the major difference
between a variety of applications usually lies in their data access patterns, while the comput-
ing functions needed by these application are quite similar (such as addition and multiplication).
Different data access patterns lead to different optimizations on the interconnect topologies and
routing policies.
An interconnect infrastructure can be customized from three aspects. First, the intercon-
nect topology can be optimized during the chip design time based on the analysis over the target
applications. Second, the routing policy can be optimized during execution based on the compi-
lation and runtime information of the running applications. ird, driven by emerging technolo-
gies, the underlying interconnect medium can be optimized by matching the physical property
with the application properties. e following sections of this chapter offer a detailed discussion.
Figure 6.1: (a) Shortcut routing channels overlaid with conventional mesh interconnects, (b) adaptive
shortcuts for 1Hotspot trace. Taken from [10].
Figure 6.1(a) demonstrates a conventional mesh topology with a set of overlaid shortcut
routing channels. Here, this work constrains the number of shortcut-enabled routers to half of
the total routers (50 routers). In this figure the routers appear to have a small diagonal connection
to the set of shortcut routing channels, which is represented as a single thick line winding through
the mesh.
To reconfigure the set of shortcuts dynamically for each application (or per workload),
application communication statistics can be introduced into the cost equation. Intuitively, the
goal is to accelerate communication on paths that are most frequently used by the application,
operating under the assumption that these paths are most critical to application performance. To
identify these paths, this work needs to rely on information that can be readily collected by event
counters in the network. e metric it uses to guide the selection is inter-router communication
frequency. From a given router X to another router Y, communication frequency is measured as
the number of messages sent from X to Y. To determine the maximum benefit of this approach,
it assumes that this profile is available for the applications it wishes to run. en the target of the
P
shortcut selection algorithm is to minimize x;y Fx;y W x; y where Fx;y is the total number of
messages sent from router x to router y , and Wx;y is the length of the shortest path between x
and y . is can be solved using the heuristic approach in [10]. An example of application-specific
shortcut selection is shown for the 1Hotspot trace in Figure 6.1(b). Experimental results in [10]
show that adaptive shortcut insertion can enable a 65% NoC power savings while maintaining
comparable performance.
Figure 6.2: Difference between memory sharing among general-purpose CPU cores and among
accelerators. (a) A simple interconnect for CPU cores, (b) demanding interconnects for accelerators.
In contrast, accelerators may run >100x faster than CPUs [67], and each accelerator needs
to perform several loads/stores every clock cycle. e interconnects between accelerators and
shared memories need to be high-speed, high-bandwidth and contain many conflict-free data
channels to prevent accelerators from being starved for data, as shown in Fig. 6.2(b). An accel-
erator needs to have at least n ports if it wants to fetch n data every cycle. e n ports of the
accelerator need to be connected to n memory banks via n conflict-free data paths in the inter-
connects.
Another problem with conventional interconnect designs is that the interconnect arbitra-
tion among requesting accelerators is performed upon each data access. NoC designs even perform
multiple arbitrations in a single data access as the data packet goes through multiple routers. Since
accelerators implement computation efficiently, but have no way of reducing the number of nec-
essary data accesses, the extra energy consumed by the interconnect arbitration during each data
access will become a major concern. e arbitration upon each data access also leads to a large
and unexpected latency for each access. Since accelerators aggressively schedule many operations
(computation and data accesses) into every time slot [36], any late response of data access will stall
many operations and lead to significant performance loss. Many accelerator designs prefer small
and fixed latencies to keep their scheduled performance, and because of this many accelerators
[89] have to give up memory sharing.
e work in [40] designs the interconnects into a configurable crossbar as shown in Fig. 6.3.
e configuration of the crossbar is performed only upon accelerator launch. Each memory
bank will be configured to connect to only one accelerator (only one of the switches that connect
the memory bank will be turned on). When accelerators are working, they can access their con-
nected memory banks just like private memories, and no more arbitration is performed on their
data paths. Fig. 6.3 shows that acc 1, which contains three data ports (i.e., demands three data
accesses every cycle), is configured to connect memory banks 1–3. Note that there can be other
74 6. INTERCONNECT CUSTOMIZATION
Figure 6.3: Interconnects designed as a configurable crossbar between accelerators and shared mem-
ories to keep data access cost small. Taken from [40]. X W switch turned on.
design choices for the configurable network between accelerators and shared memories. However
the crossbar design contains the fewest logics (only one switch) in the path from an accelerator
port to a memory bank, and thus helps minimize the data access latency. It can achieve an access
latency of only one clock cycle in an FPGA prototyping so that the timing of access to shared
memories acts exactly as private memories within accelerators.
e primary goal of the crossbar design is that for any set of accelerators that are powered
on in an accelerator-rich platform and require t memory banks in total, a feasible configuration
of the crossbar can be found to route the t data ports of the accelerators to t memory banks. A
full crossbar that connects every accelerator port to every memory bank provides a trivial solution.
However, it is extremely area-consuming. One would like to find a sparsely populated crossbar
(e.g., Fig. 6.3) to achieve high routability. e definition of routability is given as follows:
• An accelerator contains multiple data ports, and in the interconnect design the relations of
the ports from the same accelerator should be handled differently compared to the ports
from different accelerators. A two-step optimization can be used rather than optimizing
6.3. ROUTING CUSTOMIZATION 75
all the ports of all accelerators globally in a single procedure. Many unnecessary connec-
tions associated with each individual accelerator are identified and removed before all the
accelerators are optimized together.
• Due to the power budget in dark silicon, only a limited number of accelerators will be
powered on in an accelerator-rich platform. e interconnects can be partially populated to
just fit the data access demand limited by the power budget.
• Accelerators are heterogeneous. Some accelerators will have a higher chance of being turned
on or off together if they belong to the same application domain. is kind of information
can be used to customize the interconnect design to remove potential data path conflicts
and use fewer transistors to achieve the same efficiency.
Experimental results in [40] show that the crossbar customization for accelerators can achieve
15x area savings and 10x performance improvement compared to conventional NoCs that were
optimized for CPU cores.
Figure 6.4: Examples to illustrate ACES routing optimization. Taken from [37]. (Continues.)
south-last routing, the use of the shortcut is severely restricted. Figure 6.4(f ) shows the acyclic
CDG generated by ACES, where only the channel dependency edges that are never or rarely used
are removed. It should be noted that reachability should be guaranteed while breaking cycles in
the ASCDG; i.e., for each pair of communicating nodes of the application, there is at least one
directed path from the source node to the destination node.
Experimental results in [37] show that ACES can either reduce the NoC power by 11%
35% while maintaining approximately the same network performance, or improve the network
performance by 10% 36% with slight NoC power overhead ( 5% 7%) on a wide range of
examples.
78 6. INTERCONNECT CUSTOMIZATION
Figure 6.4: (Continued.) Examples to illustrate ACES routing optimization. Taken from [37].
v o i d d e n o i s e 2 D ( f l o a t A[ 7 6 8 ] [ 1 0 2 4 ] ,
f l o a t B[768][1024] )
{
6.3. ROUTING CUSTOMIZATION 79
f o r ( i n t i = 1 ; i < 7 6 7 ; i ++ )
f o r ( i n t j = 1 ; j < 1 0 2 3 ; j ++ )
B[ i ] [ j ] =
pow (A[ i ] [ j ] A[ i ] [ j 1] , 2) +
pow (A[ i ] [ j ] A[ i ] [ j + 1 ] , 2) +
pow (A[ i ] [ j ] A[ i 1][ j ] , 2) +
pow (A[ i ] [ j ] A[ i + 1 ] [ j ] , 2);
}
Listing 6.1: Example C code of a typical stencil computation (5-point stencil window in the kernel
‘DENOISE’ in medical imaging [38]).
Listing 6.1 shows an example stencil computation in the kernel ‘DENOISE’ in medical imaging
[38].
Figure 6.5: Iteration domain of the example stencil computation in Listing 6.1. Taken from [33].
Its grid shape is a 768 1024 rectangle, and its stencil window contains five points, as
shown in Fig. 6.5. Five data elements need to be accessed in each iteration. In addition, many data
elements will be repeatedly accessed among these iterations. For example, AŒ2Œ2 will be accessed
five times, when .i; j / 2 f.1; 2/; .2; 1/; .2; 2/; .2; 3/; .3; 2/g. is leads to high on-chip memory
port contention and off-chip traffic, especially when the stencil window is large (e.g., after loop
fusion of stencil applications for computation reduction as proposed in [97]). erefore, during
the hardware development of a stencil application, a large portion of engineering effort is spent
on data reuse and memory partitioning optimization.
80 6. INTERCONNECT CUSTOMIZATION
Figure 6.6: e example circuit structure of the memory system generated for array A in the stencil
computation of Listing 6.1. Taken from [33].
e work in [33] uses a chain structure of the interconnects, as illustrated in the example
in Fig. 6.6, which is generated for the stencil computation in Listing 6.1. Suppose the stencil
window contains n points (n D 5 in the example of Listing 6.1). e memory system will contain
n 1 data reuse FIFOs as well as n data path splitters and n data filters connected together in the
way shown in Fig. 6.6. e data reuse FIFOs provide the same storage as conventional data reuse
buffers, and the data path splitters and filters work as memory controllers and data interconnects.
e n 1 buffers and the n routers are the theoretical lower-bound of the module count to satisfy
n data accesses in the stencil window every clock cycle. After the data flow synthesis, the data
required by each data access port of the computation kernel can be provided by each FIFO which
receives data from its precedent FIFO at the same time. Interconnect contention is completely
eliminated here. Experimental results show 25 66% area savings while maintaining the same
network performance.
Figure 6.7: Illustration of NVM-based switches taken from [41]. (a) Hysteresis characteristic of a
two-terminal NVM device, (b) function as a routing switch in place of a pass transistor and its six-
transistor SRAM cell [13, 59, 60, 101, 122].
Driven by emerging non-volatile memories (NVMs), they can also be replaced by NVM-
based switches, as shown in Figure 6.7(b). is kind of use is enabled by a common property of
these emerging NVMs. at is, the connection between two terminals of these devices can be
programmed to turn on or turn off, as shown in Figure 6.7(a). By applying specific programming
voltages, the resistance between the two terminals can be switched between the “on” state and
the “off ” state. e programmed resistance value can be kept either under operating voltages or
without supply voltage due to nonvolatility. is kind of NVM use saves the area of SRAMs
and also the pass transistors that build routing switches. An example provided in [41] uses an
interconnect architecture based on resistive RAMs (RRAMs).
In [41], the programmable interconnects are composed of three disjoint structures:
• Transistor-less programmable interconnects
• A programming grid
• An on-demand buffering architecture
is composition takes routing buffers from programmable interconnects and puts them in a sep-
arate architecture. e transistor-less programmable interconnects correspond to SRAM-based
6.4. CUSTOMIZATION ENABLED BY NEW DEVICE/CIRCUIT TECHNOLOGIES 83
configuration bits and MUX-based routing switches in conventional programmable intercon-
nects. ey are built by RRAMs and metal wires alone and are placed over CMOS transistors, as
shown in Fig. 6.8.
Figure 6.8: Switch blocks and connection blocks in transistor-less programmable interconnects are
placed over logic blocks in the same die according to existing RRAM fabrication structures [65, 117,
124, 129]. Taken from [41].
In the structure of the transistor-less programmable interconnects, RRAMs and metal wires
are stacked over CMOS transistors. e layout will be very different from that of conventional
programmable interconnects and will be applied with very tight space constraints. is work pro-
vides an RRAM-friendly layout design which solves these constraints and at the same time fits
into the footprint of the CMOS transistors below. e programming transistors in the program-
ming grid are heavily shared among RRAMs via the transistor-less programmable interconnects.
e on-demand buffering architecture provides opportunities to allocate buffers in interconnects
during the implementation phase. It allows utilization of the application information for a bet-
ter allocation of buffers. Note that the feasibility of the disjoint structures, and the feasibility
of all their improvements mentioned above, is based on the use of RRAMs as programmable
switches. Simulation results show that the RRAM-based programmable interconnects achieve a
96% smaller footprint, 55% higher performance, and 79% lower power consumptions.
85
CHAPTER 7
Concluding Remarks
Customizable SoC processors have tremendous promise for meeting the future scaling require-
ments of application writers while adapting to the scaling challenges anticipated in future tech-
nology nodes. Customization can benefit all levels of CSoC design, including cores, memory,
and interconnect.
Customized cores and compute engines can dramatically reduce power dissipation by re-
ducing or even eliminating unnecessary non-computational sub-components. Preliminary work
in this area has demonstrated cores that can be customized with specialized instructions or com-
ponents. Further work has shown that accelerator-rich designs can outperform general-purpose
processor designs, but can still retain some adaptability and flexibility to enhance design longevity
through either compile-time reconfiguration or runtime composition. Future work in this area
may explore migrating computation closer to memory, building customized cores and accelera-
tors in or near main memory or disk. Such embedded accelerators can help to not only reduce
communication latency and power, but can customize components external to the SoC.
Customized on-chip memory includes the specialization of memory resources for an appli-
cation’s memory requirements from reducing energy hungry associative ways to careful software-
directed placement of key memory blocks to reduce cache misses. Preliminary work has begun to
expand upon the conventional memory resources available to the CSoC with emerging memory
technology that features various power and performance trade-offs. Future work may provide a
set of different memory resources that can be flexibly mapped to an application depending on
resource requirements. For example, memory with low leakage power can be used for higher ca-
pacity read-only memory.
Interconnect customization allows adaptation of the on-chip network to the communi-
cation patterns of a particular application. While preliminary work has demonstrated gains in
topology customization and routing customization, there is still tremendous potential in alterna-
tive interconnects that can provide high-bandwidth, extremely low latency communication across
large CSoCs. Such interconnect has potential to accelerate critical communication primitives in
parallel code, such as locks, mutexes, and barriers. With faster synchronization and communica-
tion, we may find that more applications can efficiently leverage the growing number of cores and
accelerators on CSoCs. Moreover, with growing computational power from customized cores,
alternative interconnects may hold the key to supplying sufficient bandwidth to feed these high
performing cores.
86 7. CONCLUDING REMARKS
Most of these techniques were evaluated using simulators. To further validate the CSoC
concept, recent work in [15, 18] prototypes a real ARA on the Xilinx Zynq-7000 SoC [133] with
four medical imaging accelerators. e Zynq SoC, is composed of a dual-core ARM Cortex-A9
and FPGA fabrics, which can be used to realize the accelerators, interconnects and on-chip shared
memories in a CSoC. Table 7.1 shows the performance and power results of denoise application
in the ARA prototype and the state-of-the-art processors [18]. e prototype can achieve 7.44x
and 2.22x energy efficiency over the state-of-the-art processors, Xeon and ARM, respectively.
As reported in [85], the power gap between FPGA and ASIC is around 12X. If the ARA is
implemented in ASIC, a 24 to 84X energy saving over Xeon processors is expected. is is an
encouraging step to further validate the effectiveness of a CSoC.
Table 7.1: Performance and power comparison over (1)ARM Cortex-A9, (2)Intel Xeon (Haswell),
and (3)ARA
Bibliography
[1] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In Proceed-
ings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture, MI-
CRO 32, pages 248–259, Washington, DC, USA, 1999. IEEE Computer Society. DOI:
10.1109/MICRO.1999.809463. 41, 43, 44, 60, 61
[2] K. Atasu, O. Mencer, W. Luk, C. Ozturan, and G. Dundar. Fast custom instruction
identification by convex subgraph enumeration. In Application-Specific Systems, Architec-
tures and Processors, 2008. ASAP 2008. International Conference on, pages 1–6. IEEE, 2008.
DOI: 10.1109/ASAP.2008.4580145. 20, 23
[3] T. M. Austin and G. S. Sohi. Zero-cycle loads: Microarchitecture support for reducing
load latency. In Proceedings of the 28th Annual International Symposium on Microarchitecture,
MICRO 28, pages 82–92, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press.
DOI: 10.1109/MICRO.1995.476815. 52
[4] A. Baniasadi and A. Moshovos. Instruction flow-based front-end throttling for
power-aware high-performance processors. In Proceedings of the 2001 international
symposium on Low power electronics and design, pages 16–21. ACM, 2001. DOI:
10.1145/383082.383088. 16
[5] C. F. Batten. Simplified vector-thread architectures for flexible and efficient data-parallel ac-
celerators. PhD thesis, Massachusetts Institute of Technology, 2010. 20
[6] C. J. Beckmann and C. D. Polychronopoulos. Fast barrier synchronization hardware. In
Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 180–189. IEEE
Computer Society Press, 1990. DOI: 10.1109/SUPERC.1990.130019. 19
[7] S. Borkar and A. A. Chien. e future of microprocessors. Communications of the ACM,
54(5):67–77, May 2011. DOI: 10.1145/1941487.1941507. 1, 2
[8] A. Buyuktosunoglu, T. Karkhanis, D. H. Albonesi, and P. Bose. Energy effi-
cient co-adaptive instruction fetch and issue. In Computer Architecture, 2003. Pro-
ceedings. 30th Annual International Symposium on, pages 147–156. IEEE, 2003. DOI:
10.1145/871656.859636. 16
[9] C. Bienia et al. e PARSEC benchmark suite: Characterization and architec-
tural implications. Technical Report TR-811-08, Princeton University, 2008. DOI:
10.1145/1454115.1454128. 2
90 BIBLIOGRAPHY
[10] M.-C. F. Chang, J. Cong, A. Kaplan, C. Liu, M. Naik, J. Premkumar, G. Reinman,
E. Socher, and S.-W. Tam. Power reduction of CMP communication networks via RF-
interconnects. 2008 41st IEEE/ACM International Symposium on Microarchitecture, pages
376–387, Nov. 2008. DOI: 10.1109/MICRO.2008.4771806. 69, 71, 72, 75
[11] M. F. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher, and S.-W. Tam.
CMP network-on-chip overlaid with multi-band RF-interconnect. 2008 IEEE 14th Inter-
national Symposium on High Performance Computer Architecture, pages 191–202, Feb. 2008.
DOI: 10.1109/HPCA.2008.4658639. 81
[12] C. Chen, W. S. Lee, R. Parsa, S. Chong, J. Provine, J. Watt, R. T. Howe, H. P. Wong, and
S. Mitra. Nano-Electro-Mechanical Relays for FPGA Routing : Experimental Demon-
stration and a Design Technique. In Design, Automation and Test in Europe Conference and
Exhibition (DATE), 2012. DOI: 10.1109/DATE.2012.6176703. 42
[14] Y. Chen, W.-F. Wong, H. Li, and C.-K. Koh. Processor caches built using multi-level
spin-transfer torque ram cells. In Low Power Electronics and Design (ISLPED) 2011 Inter-
national Symposium on, pages 73–78, Aug 2011. DOI: 10.1109/ISLPED.2011.5993610.
58
[15] Y.-T. Chen, J. Cong, M. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou.
Accelerator-rich cmps: From concept to real hardware. In Computer Design
(ICCD), 2013 IEEE 31st International Conference on, pages 169–176, Oct 2013. DOI:
10.1109/ICCD.2013.6657039. 86
[16] Y.-T. Chen, J. Cong, H. Huang, B. Liu, C. Liu, M. Potkonjak, and G. Reinman. Dynam-
ically reconfigurable hybrid cache: An energy-efficient last-level cache design. In Proceed-
ings of the Conference on Design, Automation and Test in Europe, DATE ’12, pages 45–50,
San Jose, CA, USA, 2012. EDA Consortium. DOI: 10.1109/DATE.2012.6176431. 42,
58, 59, 60, 61, 62, 65
[17] Y.-T. Chen, J. Cong, H. Huang, C. Liu, R. Prabhakar, and G. Reinman. Static and
dynamic co-optimizations for blocks mapping in hybrid caches. In Proceedings of the 2012
ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED ’12,
pages 237–242, New York, NY, USA, 2012. ACM. DOI: 10.1145/2333660.2333717. 42,
58, 59, 60, 61, 63, 65, 66
BIBLIOGRAPHY 91
[18] Y.-T. Chen, J. Cong, and B. Xiao. Aracompiler: a prototyping flow and evaluation frame-
work for accelerator-rich architectures. In Performance Analysis of Systems and Software
(ISPASS), 2015 IEEE International Symposium on, pages 157–158, March 2015. DOI:
10.1109/ISPASS.2015.7095795. 86
[19] E. Chi, A. M. Salem, R. I. Bahar, and R. Weiss. Combining software and hardware
monitoring for improved power and performance tuning. In Interaction Between Compilers
and Computer Architectures, 2003. INTERACT-7 2003. Proceedings. Seventh Workshop on,
pages 57–64. IEEE, 2003. DOI: 10.1109/INTERA.2003.1192356. 16
[21] Y. K. Choi, J. Cong, and D. Wu. Fpga implementation of em algorithm for 3d ct re-
construction. In Proceedings of the 2014 IEEE 22Nd International Symposium on Field-
Programmable Custom Computing Machines, FCCM ’14, pages 157–160, Washington, DC,
USA, 2014. IEEE Computer Society. DOI: 10.1109/FCCM.2014.48. 46
[23] E. S. Chung, J. D. Davis, and J. Lee. Linqits: Big data on little clients. In Proceedings of
the 40th Annual International Symposium on Computer Architecture, pages 261–272. ACM,
2013. DOI: 10.1145/2485922.2485945. 30
[24] N. T. Clark, H. Zhong, and S. A. Mahlke. Automated custom instruction generation for
domain-specific processor acceleration. Computers, IEEE Transactions on, 54(10):1258–
1270, 2005. DOI: 10.1109/TC.2005.156. 20, 23
[25] J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-specific instruction generation for
configurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th inter-
national symposium on Field programmable gate arrays, pages 183–189. ACM, 2004. DOI:
10.1145/968280.968307. 20, 23
[29] J. Cong, M. A. Ghodrat, M. Gill, C. Liu, and G. Reinman. Bin: A buffer-in-nuca scheme
for accelerator-rich cmps. In Proceedings of the 2012 ACM/IEEE International Symposium
on Low Power Electronics and Design, ISLPED ’12, pages 225–230, New York, NY, USA,
2012. ACM. DOI: 10.1145/2333660.2333715. 42, 46, 49, 55, 56, 57, 58
[30] J. Cong, M. Gill, Y. Hao, G. Reinman, and B. Yuan. On-chip interconnection network for
accelerator-rich architectures. In Proceedings of the 52th Annual Design Automation Con-
ference, DAC ’15, New York, NY, USA, 2015. ACM. DOI: 10.1145/2744769.2744879.
70
[33] J. Cong, P. Li, B. Xiao, and P. Zhang. An Optimal Microarchitecture for Stencil Com-
putation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers. In
Proceedings of the e 51st Annual Design Automation Conference on Design Automation Con-
ference - DAC ’14, pages 1–6, 2014. DOI: 10.1145/2593069.2593090. 46, 49, 75, 79, 80
[34] J. Cong, P. Li, B. Xiao, and P. Zhang. An Optimal Microarchitecture for Stencil
Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers.
Technical report, Computer Science Department, UCLA, TR140009, 2014. DOI:
10.1145/2593069.2593090. 78
[40] J. Cong and B. Xiao. Optimization of Interconnects Between Accelerators and Shared
Memories in Dark Silicon. In International Conference on Computer-Aided Design (IC-
CAD), 2013. DOI: 10.1109/ICCAD.2013.6691182. 73, 74, 75
[41] J. Cong and B. Xiao. FPGA-RPI: A Novel FPGA Architecture With RRAM-Based
Programmable Interconnects. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 22(4):864–877, Apr. 2014. DOI: 10.1109/TVLSI.2013.2259512. 82, 83
[42] J. Cong, P. Zhang, and Y. Zou. Combined loop transformation and hierarchy allocation
for data reuse optimization. In Proceedings of the International Conference on Computer-
Aided Design, ICCAD ’11, pages 185–192, Piscataway, NJ, USA, 2011. IEEE Press. DOI:
10.1109/ICCAD.2011.6105324. 55
[43] H. Cook, K. Asanović, and D. A. Patterson. Virtual local stores: Enabling software-
managed memory hierarchies in mainstream computing environments. Technical Report
UCB/EECS-2009-131, EECS Department, University of California, Berkeley, Sep 2009.
42
[44] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub) graph isomorphism algorithm
for matching large graphs. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
26(10):1367–1372, 2004. DOI: 10.1109/TPAMI.2004.75. 23
[59] P.-E. Gaillardon, M. Haykel Ben-Jamaa, G. Betti Beneventi, F. Clermidy, and L. Perniola.
Emerging memory technologies for reconfigurable routing in FPGA architecture. In Inter-
national Conference on Electronics, Circuits and Systems (ICECS), pages 62–65, Dec. 2010.
DOI: 10.1109/ICECS.2010.5724454. 82
[64] P. Greenhalgh. Big. little processing with arm cortex-a15 & cortex-a7. ARM White Paper,
2011. 17
[65] W. Guan, S. Long, Q. Liu, M. Liu, and W. Wang. Nonpolar Nonvolatile Resistive
Switching in Cu Doped ZrO2 . IEEE Electron Device Letters, 29(5):434–437, May 2008.
DOI: 10.1109/LED.2008.919602. 83
[66] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. Bundled execution of recur-
ring traces for energy-efficient general purpose processing. In Proceedings of the 44th An-
nual IEEE/ACM International Symposium on Microarchitecture, pages 12–23. ACM, 2011.
DOI: 10.1145/2155620.2155623. 19, 21, 22, 23
96 BIBLIOGRAPHY
[67] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richard-
son, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-
purpose chips. International Symposium on Computer Architecture, page 37, 2010. DOI:
10.1145/1816038.1815968. 2, 73
[68] T. Hayes, O. Palomar, O. Unsal, A. Cristal, and M. Valero. Vector extensions for
decision support dbms acceleration. In Microarchitecture (MICRO), 2012 45th Annual
IEEE/ACM International Symposium on, pages 166–176. IEEE, 2012. DOI: 10.1109/MI-
CRO.2012.24. 20
[69] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins
on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference
on Management of data, pages 511–524. ACM, 2008. DOI: 10.1145/1376616.1376670.
87
[70] T. Henretty, J. Holewinski, N. Sedaghati, L.-N. Pouchet, A. Rountev, and P. Sadayappan.
Stencil Domain Specific Language (SDSL) User Guide 0.2.1 draft. Technical report, OSU
TR OSU-CISRC-4/13-TR09, 2013. 78
[71] H. P. Hofstee. Power efficient processor architecture and the cell processor. In High-
Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on,
pages 258–262. IEEE, 2005. DOI: 10.1109/HPCA.2005.26. 17
[72] IBM. Power8 coherent accelerator processor interface (CAPI). 86
[73] Intel. Intel QuickAssist acceleration technology for embedded systems. 86
[74] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating soft-
ware diversity in chip multiprocessors. In ACM SIGARCH Computer Architecture News,
volume 35, pages 186–197. ACM, 2007. DOI: 10.1145/1273440.1250686. 17, 18
[75] I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. Drdu: A data reuse analysis tech-
nique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron.
Syst., 12(2), Apr. 2007. DOI: 10.1145/1230800.1230807. 41
[76] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad. High-endurance and performance-
efficient design of hybrid cache architectures through adaptive line replacement. In Low
Power Electronics and Design (ISLPED) 2011 International Symposium on, pages 79–84,
Aug 2011. DOI: 10.1109/ISLPED.2011.5993611. 42, 58, 59, 60, 61, 63, 64, 65
[77] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to
reduce cache leakage power. In Proceedings of the 28th Annual International Symposium
on Computer Architecture, ISCA ’01, pages 240–251, New York, NY, USA, 2001. ACM.
DOI: 10.1109/ISCA.2001.937453. 41, 43, 45, 46, 47, 60, 61
BIBLIOGRAPHY 97
[78] S. Kaxiras and M. Martonosi. Computer Architecture Techniques for Power-
Efficiency. Morgan and Claypool Publishers, 1st edition, 2008. DOI:
10.2200/S00119ED1V01Y200805CAC004. 43
[79] Y. Kim, G.-S. Byun, A. Tang, C.-P. Jou, H.-H. Hsieh, G. Reinman, J. Cong, and
M. Chang. An 8gb/s/pin 4pj/b/pin single-t-line dual (base+rf ) band simultaneous bidirec-
tional mobile memory i/o interface with inter-channel interference suppression. In Solid-
State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, pages
50–52. IEEE, 2012. DOI: 10.1109/ISSCC.2012.6176874. 87
[80] T. Kluter, P. Brisk, P. Ienne, and E. Charbon. Way stealing: Cache-assisted auto-
matic instruction set extensions. In Proceedings of the 46th Annual Design Automa-
tion Conference, DAC ’09, pages 31–36, New York, NY, USA, 2009. ACM. DOI:
10.1145/1629911.1629923. 42
[81] M. Koester, M. Porrmann, and H. Kalte. Task placement for heterogeneous reconfig-
urable architectures. In Field-Programmable Technology, 2005. Proceedings. 2005 IEEE
International Conference on, pages 43–50. IEEE, 2005. 32
[83] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha. Express virtual channels: Towards the
ideal interconnection fabric. In Proceedings of the 34th Annual International Symposium
on Computer Architecture, ISCA ’07, pages 150–161, New York, NY, USA, 2007. ACM.
DOI: 10.1145/1273440.1250681. 70
[85] I. Kuon and J. Rose. Measuring the Gap Between FPGAs and ASICs. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, Feb. 2007.
DOI: 10.1109/TCAD.2006.884574. 86
[86] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change memory as a
scalable dram alternative. In Proceedings of the 36th Annual International Symposium on
Computer Architecture, ISCA ’09, pages 2–13, New York, NY, USA, 2009. ACM. DOI:
10.1145/1555815.1555758. 58
98 BIBLIOGRAPHY
[87] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong. Memory partitioning
and scheduling co-optimization in behavioral synthesis. In International Conference on
Computer-Aided Design, pages 488–495, 2012. DOI: 10.1145/2429384.2429484. 78
[88] M. Lipson. Guiding, modulating, and emitting light on Silicon-challenges and op-
portunities. Journal of Lightwave Technology, 23(12):4222–4238, Dec. 2005. DOI:
10.1109/JLT.2005.858225. 81
[89] M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. e accelerator store: A shared
memory framework for accelerator-based systems. ACM Trans. Archit. Code Optim.,
8(4):48:1–48:22, Jan. 2012. DOI: 10.1145/2086696.2086727. 3, 41, 42, 46, 47, 48, 73
[90] J. D. C. Maia, G. A. Urquiza Carvalho, C. P. Mangueira Jr, S. R. Santana, L. A. F.
Cabral, and G. B. Rocha. Gpu linear algebra libraries and gpgpu programming for accel-
erating mopac semiempirical quantum chemistry calculations. Journal of Chemical eory
and Computation, 8(9):3072–3081, 2012. DOI: 10.1021/ct3004645. 87
[91] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings. A reconfigurable
arithmetic array for multimedia applications. In International Symposium on FPGAs, pages
135–143, 1999. DOI: 10.1145/296399.296444. 3
[92] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. Exploiting loop-level
parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In
Computers and Digital Techniques, IEE Proceedings-, volume 150, pages 255–61. IET, 2003.
DOI: 10.1049/ip-cdt:20030833. 30
[93] A. Meyerson and B. Tagiku. Approximation, Randomization, and Combinatorial Optimiza-
tion. Algorithms and Techniques, volume 5687 of Lecture Notes in Computer Science. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2009. 69
[94] E. Mirsky and A. Dehon. MATRIX: a reconfigurable computing architecture with config-
urable instruction distribution and deployable resources. In IEEE Symposium on FPGAs for
Custom Computing Machines, pages 157–166, 1996. DOI: 10.1109/FPGA.1996.564808.
3
[95] R. K. Montoye, E. Hokenek, and S. L. Runyon. Design of the ibm risc system/6000
floating-point execution unit. IBM Journal of research and development, 34(1):59–70, 1990.
DOI: 10.1147/rd.341.0059. 19, 20
[96] C. A. Moritz, M. I. Frank, and S. Amarasinghe. Flexcache: A framework for flexible
compiler generated data caching, 2001. DOI: 10.1007/3-540-44570-6_9. 42
[97] A. A. Nacci, V. Rana, F. Bruschi, D. Sciuto, I. Beretta, and D. Atienza. A high-level
synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices.
In Design Automation Conference, page 1, 2013. DOI: 10.1145/2463209.2488797. 79
BIBLIOGRAPHY 99
[98] U. Nawathe, M. Hassan, L. Warriner, K. Yen, B. Upputuri, D. Greenhill, A. Ku-
mar, and H. Park. An 8-core, 64-thread, 64-bit, power efficient sparc soc (nia-
gara 2). ISSCC, http://www. opensparc. net/pubs/preszo/07/n2isscc. pdf, 2007. DOI:
10.1145/1231996.1232000. 26
[99] U. Ogras and R. Marculescu. Energy- and Performance-Driven NoC Communication
Architecture Synthesis Using a Decomposition Approach. In Design, Automation and Test
in Europe, number 9097, pages 352–357, 2005. DOI: 10.1109/DATE.2005.137. 69, 70
[100] U. Ogras and R. Marculescu. ”It’s a small world after all”: NoC performance optimization
via long-range link insertion. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 14(7):693–706, July 2006. DOI: 10.1109/TVLSI.2006.878263. 69, 76
[101] S. Onkaraiah, P.-e. Gaillardon, M. Reyboz, F. Clermidy, J.-m. Portal, M. Bocquet,
and C. Muller. Using OxRRAM memories for improving communications of recon-
figurable FPGA architectures. In International Symposium on Nanoscale Architectures
(NANOARCH), pages 65–69, June 2011. DOI: 10.1109/NANOARCH.2011.5941485.
82
[102] J. Ouyang, S. Lin, Z. Hou, P. Wang, Y. Wang, and G. Sun. Active ssd design for energy-
efficiency improvement of web-scale data analysis. In Proceedings of the International Sym-
posium on Low Power Electronics and Design, pages 286–291. IEEE Press, 2013. DOI:
10.1109/ISLPED.2013.6629310. 87
[103] M. Palesi, R. Holsmark, S. Kumar, and V. Catania. Application Specific Routing Al-
gorithms for Networks on Chip. IEEE Transactions on Parallel and Distributed Systems,
20(3):316–330, Mar. 2009. DOI: 10.1109/TPDS.2008.106. 75, 76
[104] H. Park, Y. Park, and S. Mahlke. Polymorphic pipeline array: a flexible multicore accel-
erator with virtualized execution for mobile multimedia applications. In Proceedings of the
42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 370–380.
ACM, 2009. DOI: 10.1145/1669112.1669160. 30, 32
[105] D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Har-
vey, P. Harvey, H. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masub-
uchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki, O. Takahashi,
J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa. Overview of the architecture, circuit
design, and physical implementation of a first-generation cell processor. Solid-State Cir-
cuits, IEEE Journal of, 41(1):179–196, Jan 2006. DOI: 10.1109/JSSC.2005.859896. 39,
42
[106] A. Pinto, L. Carloni, and A. Sangiovanni-Vincentelli. Efficient synthesis of networks
on chip. In Proceedings 21st International Conference on Computer Design, pages 146–150,
2003. DOI: 10.1109/ICCD.2003.1240887. 69, 70
100 BIBLIOGRAPHY
[107] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-vdd: A circuit
technique to reduce leakage in deep-submicron cache memories. In Proceedings of the 2000
International Symposium on Low Power Electronics and Design, ISLPED ’00, pages 90–95,
New York, NY, USA, 2000. ACM. DOI: 10.1145/344166.344526. 41, 43, 45, 60, 61
[110] M. K. Qureshi, D. ompson, and Y. N. Patt. e v-way cache: Demand based associa-
tivity via global replacement. In Proceedings of the 32Nd Annual International Symposium
on Computer Architecture, ISCA ’05, pages 544–555, Washington, DC, USA, 2005. IEEE
Computer Society. DOI: 10.1145/1080695.1070015. 51
[113] P. Ranganathan, S. Adve, and N. P. Jouppi. Reconfigurable caches and their application
to media processing. In Proceedings of the 27th Annual International Symposium on Com-
puter Architecture, ISCA ’00, pages 214–224, New York, NY, USA, 2000. ACM. DOI:
10.1145/342001.339685. 42, 49, 50, 51
[115] R. Rodrigues, A. Annamalai, I. Koren, S. Kundu, and O. Khan. Performance per watt
benefits of dynamic core morphing in asymmetric multicores. In Parallel Architectures and
Compilation Techniques (PACT), 2011 International Conference on, pages 121–130. IEEE,
2011. DOI: 10.1109/PACT.2011.18. 17
BIBLIOGRAPHY 101
[116] P. Schaumont and I. Verbauwhede. Domain-specific codesign for embedded security.
Computer, 36(4):68–74, Apr. 2003. DOI: 10.1109/MC.2003.1193231. 2
[117] S.-S. Sheu, P.-C. Chiang, W.-P. Lin, H.-Y. Lee, P.-S. Chen, T.-Y. Wu, F. T. Chen, K.-L.
Su, M.-J. Kao, and K.-H. Cheng. A 5ns Fast Write Multi-Level Non-Volatile 1 K bits
RRAM Memory with Advance Write Scheme. In VLSI Circuits, Symposium on, pages
82–83, 2009. 83
[118] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho.
Morphosys: an integrated reconfigurable system for data-parallel and computation-
intensive applications. Computers, IEEE Transactions on, 49(5):465–481, 2000. DOI:
10.1109/12.859540. 3, 32, 33
[119] A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar,
M. Wachs, R. Hameed, and M. Horowitz. Chip multi-processor generator. In Proceed-
ings of the 44th annual conference on Design automation - DAC ’07, page 262, 2007. DOI:
10.1145/1278480.1278544. 2
[120] D. Starobinski, M. Karpovsky, and L. Zakrevski. Application of network calculus
to general topologies using turn-prohibition. IEEE/ACM Transactions on Networking,
11(3):411–421, June 2003. DOI: 10.1109/TNET.2003.813040. 75
[121] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel architecture of the 3d stacked
mram l2 cache for cmps. In High Performance Computer Architecture, 2009. HPCA 2009.
IEEE 15th International Symposium on, pages 239–249, Feb 2009. DOI: 10.1109/H-
PCA.2009.4798259. 42, 58, 59
[122] S. Tanachutiwat, M. Liu, and W. Wang. FPGA Based on Integration of CMOS and
RRAM. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(11):2023–
2032, Nov. 2011. DOI: 10.1109/TVLSI.2010.2063444. 82
[123] D. Tarjan, M. Boyer, and K. Skadron. Federation: Repurposing scalar cores for out-of-
order instruction issue. In Proceedings of the 45th annual Design Automation Conference,
pages 772–775. ACM, 2008. DOI: 10.1145/1391469.1391666. 17
[124] K. Tsunoda, K. Kinoshita, H. Noshiro, Y. Yamazaki, T. Iizuka, Y. Ito, A. Takahashi,
A. Okano, Y. Sato, T. Fukano, M. Aoki, and Y. Sugiyama. Low Power and High Speed
Switching of Ti-doped NiO ReRAM under the Unipolar Voltage Source of less than
3V. In International Electron Devices Meeting (IEDM), pages 767–770, Dec. 2007. DOI:
10.1109/IEDM.2007.4419060. 83
[125] O. S. Unsal, C. M. Krishna, and C. Mositz. Cool-fetch: Compiler-enabled power-
aware fetch throttling. Computer Architecture Letters, 1(1):5–5, 2002. DOI: 10.1109/L-
CA.2002.3. 16
102 BIBLIOGRAPHY
[126] D. Vantrease, N. Binkert, R. Schreiber, and M. H. Lipasti. Light speed arbitra-
tion and flow control for nanophotonic interconnects. Proceedings of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture - Micro-42, page 304, 2009.
DOI: 10.1145/1669112.1669152. 69
[129] C.-H. Wang, Y.-H. Tsai, K.-C. Lin, M.-F. Chang, Y.-C. King, C.-J. Lin, S.-S. Sheu,
Y.-S. Chen, H.-Y. Lee, F. T. Chen, and M.-J. Tsai. ree-Dimensional 4F2 ReRAM
Cell with CMOS Logic Compatible Process. In IEDM Technical Digest, pages 664–667,
2010. DOI: 10.1109/IEDM.2010.5703446. 83
[130] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. Memory partitioning for multidimen-
sional arrays in high-level synthesis. In Design Automation Conference, page 1, 2013. DOI:
10.1145/2463209.2488748. 78
[131] Z. Wang, D. Jimenez, C. Xu, G. Sun, and Y. Xie. Adaptive placement and migration pol-
icy for an stt-ram-based hybrid cache. In High Performance Computer Architecture (HPCA),
2014 IEEE 20th International Symposium on, pages 13–24, Feb 2014. DOI: 10.1109/H-
PCA.2014.6835933. 42, 58, 59, 60, 61, 63, 66, 67
[132] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie. Hybrid cache architec-
ture with disparate memory technologies. In Proceedings of the 36th Annual International
Symposium on Computer Architecture, ISCA ’09, pages 34–45, New York, NY, USA, 2009.
ACM. DOI: 10.1145/1555815.1555761. 42, 43, 58, 59, 60, 61, 63
[134] S.-H. Yang, B. Falsafi, M. D. Powell, and T. N. Vijaykumar. Exploiting choice in re-
sizable cache design to optimize deep-submicron processor energy-delay. In Proceedings
of the 8th International Symposium on High-Performance Computer Architecture, HPCA ’02,
pages 151–, Washington, DC, USA, 2002. IEEE Computer Society. DOI: 10.1109/H-
PCA.2002.995706. 43
BIBLIOGRAPHY 103
[135] Y. Ye, S. Borkar, and V. De. A new technique for standby leakage reduction in high-
performance circuits. In VLSI Circuits, 1998. Digest of Technical Papers. 1998 Symposium
on, pages 40–41, June 1998. DOI: 10.1109/VLSIC.1998.687996. 44
[136] P. Yu and T. Mitra. Scalable custom instructions identification for instruction-set
extensible processors. In Proceedings of the 2004 international conference on Compil-
ers, architecture, and synthesis for embedded systems, pages 69–78. ACM, 2004. DOI:
10.1145/1023833.1023844. 19, 20, 23
[137] M. Zhang and K. Asanović. Fine-grain cam-tag cache resizing using miss tags. In Pro-
ceedings of the 2002 International Symposium on Low Power Electronics and Design, ISLPED
’02, pages 130–135, New York, NY, USA, 2002. ACM. DOI: 10.1109/LPE.2002.146725.
52, 62, 65
105
Authors’ Biographies
YU-TING CHEN
Yu-Ting Chen is a Ph.D. candidate in the Computer Science Department at the University of
California, Los Angeles. He received a B.S. degree in computer science, a B.A. degree in eco-
nomics, and an M.S. degree in computer science from National Tsing Hua University, HsinTsu,
Taiwan, R.O.C., in 2005 and 2007, respectively. He worked at TSMC as a summer intern in
2005 and at Intel Labs as a summer intern in 2013. His research interests include computer ar-
chitecture, cluster computing, and bioinformatics in DNA sequencing technologies.
JASON CONG
Jason Cong received his B.S. degree in computer science from Peking University in 1985, and
his M.S. and Ph.D. degrees in computer science from the University of Illinois at Urbana–
Champaign in 1987 and 1990, respectively. Currently, he is a Chancellors Professor at the Com-
puter Science Department, with a joint appointment from the Electrical Engineering Depart-
ment, at the University of California, Los Angeles. He is the director of the Center for Domain-
Specific Computing (CDSC), co-director of the UCLA/Peking University Joint Research Insti-
tute in Science and Engineering, and director of the VLSI Architecture, Synthesis, and Technol-
ogy (VAST) Laboratory. He also served as the chair the UCLA Computer Science Department
from 2005–2008. Dr. Cong’s research interests include synthesis of VLSI circuits and systems,
programmable systems, novel computer architectures, nano-systems, and highly scalable algo-
rithms. He has over 400 publications in these areas, including 10 best paper awards, two 10-Year
Most Influential Paper Awards (from ICCAD’14 and ASPDAC’15), and the 2011 ACM/IEEE
A. Richard Newton Technical Impact Award in Electric Design Automation. He was elected
to an IEEE Fellow in 2000 and ACM Fellow in 2008. He is the recipient of the 2010 IEEE
Circuits and System (CAS) Society Technical Achievement Award “for seminal contributions
to electronic design automation, especially in FPGA synthesis, VLSI interconnect optimization,
and physical design automation.”
MICHAEL GILL
Michael Gill received a B.S. degree in computer science from California Polytechnic University,
Pomona, and an M.S. and a Ph.D. in computer science from the University of California, Los
106 AUTHORS’ BIOGRAPHIES
Angeles. His research is primarily focused on high-performance architectures, and the interaction
between these architectures and compilers, run time systems, and operating systems.
GLENN REINMAN
Glenn Reinman received his B.S. in computer science and engineering from the Massachusetts
Institute of Technology in 1996. He earned his M.S. and Ph.D in computer science from the
University of California, San Diego, in 1999 and 2001, respectively. He is currently a professor
in the Computer Science Department at the University of California, Los Angeles.
BINGJUN XIAO
Bingjun Xiao received a B.S. degree in microelectronics from Peking University, Beijing, China,
in 2010. He received an M.S. degree and a Ph.D. degree in electrical engineering from UCLA in
2012 and 2015, respectively. His research interests include machine learning, cluster computing,
and data flow optimization.