FPGA-Accelerated Simulation of Computer Systems
FPGA-Accelerated Simulation of Computer Systems
Series
SeriesISSN:
ISSN:
ISSN:
1935-3235
1935-3235
1935-3235
ANGEPAT
ANGEPAT •• CHIOU
ANGEPAT
SSSyntheSiS
yntheSiSL
yntheSiS L
LectureS
ectureS
ectureS MOR
MOR
MORGG
GAA
ANN&
N &
&CL
CL
C LAY
AY
AYPOOL
POOL
POOL PU
PU
PUBLI
BLI
BLISSSH
H
HERS
ERS
ERS
cc omputerA
computer
omputer A
Architecture
rchitecture
rchitecture
• CHIOU •• CHUNG
CHIOU • CHUNG
Series
Series
SeriesEditor:
Editor:
Editor:Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,University
University
UniversityofofofWisconsin
Wisconsin
Wisconsin
FPGA-Accelerated
FPGA-Accelerated
CHUNG •• HOE
FPGA-Accelerated
FPGA-Accelerated
FPGA-AcceleratedSimulation
Simulation
Simulationof
of
of
• HOE
Computer
Computer
ComputerSystems
Systems
Systems
HOE
Hari
Hari
HariAngepat,
Angepat,
Angepat,The
The
TheUniversity
University
UniversityofofofTexas
Texas
Texasand
and
andMicrosoft
Microsoft
Microsoft Simulation
Simulation of
of
Computer
Computer Systems
Systems
Derek
Derek
DerekChiou,
Chiou,
Chiou,Microsoft
Microsoft
Microsoftand
and
andThe
The
TheUniversity
University
UniversityofofofTexas
Texas
Texas
FPGA-ACCELERATED
FPGA-ACCELERATED SIMULATION
FPGA-ACCELERATED
Eric
Eric
EricS.
S.
S.Chung,
Chung,
Chung,Microsoft
Microsoft
Microsoft
James
James
JamesC.
C.
C.Hoe,
Hoe,
Hoe,Carnegie
Carnegie
CarnegieMellon
Mellon
MellonUniversity
University
University
To
To
Todate,
date,
date,the
the
themost
most
mostcommon
common
commonform
form
formofof
ofsimulators
simulators
simulatorsofof
ofcomputer
computer
computersystems
systems
systemsare
are
aresoftware-based
software-based
software-basedrunning
running
running
on
on
onstandard
standard
standardcomputers.
computers.
computers.One
One
Onepromising
promising
promisingapproach
approach
approachtoto
toimprove
improve
improvesimulation
simulation
simulationperformance
performance
performanceisisistoto
toap-
ap-
ap-
ply
ply
plyhardware,
hardware,
hardware,specifically
specifically
specificallyreconfigurable
reconfigurable
reconfigurablehardware
hardware
hardwareinin
inthe
the
theform
form
formofof
offield
field
fieldprogrammable
programmable
programmablegate
gate
gatearrays
arrays
arrays
SIMULATION OF
SIMULATION OF
(FPGAs).
(FPGAs).
(FPGAs).This
This
Thismanuscript
manuscript
manuscriptdescribes
describes
describesvarious
various
variousapproaches
approaches
approachesofof
ofusing
using
usingFPGAs
FPGAs
FPGAstoto
toaccelerate
accelerate
acceleratesoftware
software
softwareim-
im-
im-
plemented
plemented
plementedsimulation
simulation
simulationofof
ofcomputer
computer
computersystems
systems
systemsand
and
andselected
selected
selectedsimulators
simulators
simulatorsthat
that
thatincorporate
incorporate
incorporatethose
those
thosetechniques.
techniques.
techniques.
More
More
Moreprecisely,
precisely,
precisely,we
we
wedescribe
describe
describeaaasimulation
simulation
simulationarchitecture
architecture
architecturetaxonomy
taxonomy
taxonomythat
that
thatincorporates
incorporates
incorporatesaaasimulation
simulation
simulationar-
ar-
ar-
OF COMPUTER
chitecture
chitecture
chitecturespecifically
specifically
specificallydesigned
designed
designedfor
for
forFPGA
FPGA
FPGAaccelerated
accelerated
acceleratedsimulation,
simulation,
simulation,survey
survey
surveythe
the
thestate-of-the-art
state-of-the-art
state-of-the-artinin
in
COMPUTER SYSTEMS
COMPUTER SYSTEMS
FPGA-accelerated
FPGA-accelerated
FPGA-acceleratedsimulation,
simulation,
simulation,and
and
anddescribe
describe
describeinin
indetail
detail
detailselected
selected
selectedinstances
instances
instancesofof
ofthe
the
thedescribed
described
describedtechniques.
techniques.
techniques.
Hari
Hari
HariAngepat
Angepat
Angepat
SYSTEMS
Derek
Derek
DerekChiou
Chiou
Chiou
Eric
Eric
EricS.
S.
S.Chung
Chung
Chung
ABOUT
ABOUT
ABOUTSYNTHESIS
SYNTHESIS
SYNTHESIS
James
James
JamesC.
C.
C.Hoe
Hoe
Hoe
This
This
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionofof
ofaaawork
work
workthat
that
thatappears
appears
appearsinin
inthe
the
theSynthesis
Synthesis
Synthesis
Digital
Digital
DigitalLibrary
Library
LibraryofofofEngineering
Engineering
Engineeringand and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
SynthesisLectures
Lectures
Lectures
provide
provide
provideconcise,
concise,
concise,original
original
originalpresentations
presentations
presentationsofofofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development
M
M OR
M OR G
OR G
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,ininindigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information
G AN
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com
SSSyntheSiS
yntheSiSL
L
LectureS
AN &
AN &
yntheSiS ectureS
ectureS
& CL
cccomputer
omputerA A
Architecture
CL AY
CL AY
MORGAN&
&
&CLAYPOOL
ISBN:
ISBN:
ISBN:978-1-62705-213-9
978-1-62705-213-9
978-1-62705-213-9
MORGAN
MORGAN CLAYPOOL
CLAYPOOLPUBLISHERS
PUBLISHERS
PUBLISHERS 90000
90000
90000 omputer rchitecture
rchitecture
AY P
wwwwwwwww. .m
.m
mooorrgrggaaannncccl lalayayypppoooool l.l.c.cocoomm
m
P OOL
P OOL
OOL
999781627
781627
781627052139
052139
052139
Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,Series
Series
SeriesEditor
Editor
Editor
FPGA-Accelerated Simulation
of Computer Systems
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. e scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00586ED1V01Y201407CAC029
Lecture #29
Series Editors: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
FPGA-Accelerated Simulation
of Computer Systems
Hari Angepat
e University of Texas and Microsoft
Derek Chiou
Microsoft and e University of Texas
Eric S. Chung
Microsoft
James C. Hoe
Carnegie Mellon University
M
&C Morgan & cLaypool publishers
ABSTRACT
To date, the most common form of simulators of computer systems are software-based running
on standard computers. One promising approach to improve simulation performance is to ap-
ply hardware, specifically reconfigurable hardware in the form of field programmable gate arrays
(FPGAs). is manuscript describes various approaches of using FPGAs to accelerate software-
implemented simulation of computer systems and selected simulators that incorporate those tech-
niques. More precisely, we describe a simulation architecture taxonomy that incorporates a sim-
ulation architecture specifically designed for FPGA accelerated simulation, survey the state-of-
the-art in FPGA-accelerated simulation, and describe in detail selected instances of the described
techniques.
KEYWORDS
simulation, cycle-accurate, functional, timing, FPGA accelerated
ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Host vs. Target Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Why are Fast, Accurate Simulators of Computer Targets Needed? . . . . . . . . . . . 2
1.4 Harnessing FPGAs for Simulation Not Prototyping . . . . . . . . . . . . . . . . . . . . . . 3
1.5 e Rest of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Simulator Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Uses of Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Desired Simulator Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Performance Simulation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Simulator Design Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Simulator Partitioning for Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Spatial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Temporal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Functional/Timing Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.4 Hybrid Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Functional/Timing Simulation Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Monolithic Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.2 Timing-Directed Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.3 Functional-First Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.4 Timing-First Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.5 Speculative Functional-First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Simulation Events and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.1 Centralized Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.2 Decentralized Event Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
x
3 Accelerating Computer System Simulators with FPGAs . . . . . . . . . . . . . . . . . . 25
3.1 Exploiting Target Partitioning on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Accelerating Traditional Simulator Architectures with FPGAs . . . . . . . . . . . . . 26
3.2.1 Accelerating Monolithic Simulators with FPGAs . . . . . . . . . . . . . . . . . . 26
3.2.2 Accelerating Timing-Directed Simulators with FPGAs . . . . . . . . . . . . . 26
3.2.3 Accelerating Functional-First Simulators with FPGAs . . . . . . . . . . . . . . 27
3.2.4 Accelerating Timing-First Simulators with FPGAs . . . . . . . . . . . . . . . . 27
3.2.5 Accelerating Speculative Functional-First with FPGAs . . . . . . . . . . . . . 27
3.2.6 Accelerating Combined Simulator Architectures with FPGAs . . . . . . . . 28
3.3 Managing Time rough Simulation Event Sychronization in an
FPGA-Accelerated Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Centralized Barrier Synchronization in an FPGA-Accelerated
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Decentralized Barrier Synchronization in an FPGA-Accelerated
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 FPGA Simulator Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Case Study: FPGA-Accelerated Simulation Technologies (FAST) . . . . . . . . . . 30
4 Simulation Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Full-System and Multiprocessor Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Hierarchical Simulation with Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Hierarchical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Hierarchical Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Virtualized Simulation of Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Time-multiplexed Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Virtualizing Memory Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Case Study: the ProtoFlex Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 ProtoFlex Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 BlueSPARC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.4 Hierarchical Simulation and Virtualization in a Performance
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xiii
Preface
e slow speed of accurate computer system simulators is a significant bottleneck in the study
of computer architectures and the systems built around them. Since computers use considerable
hardware parallelism to obtain the performance they achieve, it is difficult for simulators to track
the performance of such systems without utilizing hardware parallelism as well. e significant
capabilities and the flexibility of FPGAs make them an ideal vehicle for accelerating and address-
ing the challenges of computer system simulation.
is book presents the current state-of-the art in the use of Field Programmable Gate
Arrays (FPGAs) to improve the speed of accurate computer system simulators. e described
techniques and the related work are the result of active research in this area by the authors and
others over the last ten years.
Chapters 3 and 4 present solutions that address the major challenges in building fast and
accurate FPGA-accelerated simulators without undue implementation effort and cost. Chap-
ter 3 describes how FPGA acceleration can be applied to different simulator architectures, while
Chapter 4 describes how virtualization, via simulator multithreading and transplanting simula-
tion activity back to software, can be used to further extend the performance and capabilities of
FPGA-accelerated simulators.
Chapter 2 describes simulator architectures that apply not only to FPGA-accelerated sim-
ulators, but also to pure software simulators and pure FPGA simulators. Four of the simulator
architectures (monolithic, functional-first, timing-directed, and timing-first) are well known in
the literature, while the fifth, speculative functional-first (Section 2.6.5) was an architecture that
we developed specifically to be accelerated by FPGAs. In addition to a survey of related work in
Chapter 5, the Appendix provides a brief introduction to FPGA technologies.
Acknowledgments
e authors would like to thank all of those who have contributed to the FPGA-Accelerated Sim-
ulation Technologies (FAST) project and the ProtoFlex FPGA-accelerated simulation project.
e FAST project started in 2005 at e University of Texas at Austin and continues today.
In addition to the authors (Hari Angepat and Derek Chiou), Dam Sunwoo, Nikhil A. Patil, Yi
Yuan, Gene Y. Wu, Lauren Guckert, Mike omson, Joonsoo Kim, William H. Reinhart, D.
Eric Johnson, Zheng Xu, and Jebediah Keefe have contributed to the development of FAST
concepts and artifacts. Funding for FAST was provided in part by grants from the Department
of Energy (DE-FG02-05ER25686), the National Science Foundation (CSR-1111766, CCF-
0917158, CNS-0747438, CNS-0615352, and CCF-0541416) and gifts of funds, hardware, and
software from AMD, Bluespec, IBM, Intel, Freescale, Synopsys, VMWare, and Xilinx.
e ProtoFlex project was research carried out at Carnegie Mellon University between 2005
and 2011. In addition to the authors (Eric S. Chung and James C. Hoe), Michael K. Papamichael,
Eriko Nurvitadhi, Babak Falsafi and Ken Mai contributed to the development of the ProtoFlex
concept and simulator. Funding for the ProtoFlex project was provided in part by grants from the
C2S2 Marco Center, NSF CCF-0811702, NSF CNS-0509356, and SUN Microsystems. e
ProtoFlex project also received software and hardware donations from Xilinx and Bluespec.
e FAST and ProtoFlex projects were part of the RAMP (Research Accelerator for Mul-
tiple Processors) multi-university community effort to investigate the application of FPGAs to
the emulation and simulation of complete, large-scale multiprocessor/multicore systems. Other
participants of the RAMP project included Krste Asanovic (UC Berkeley), Christos Kozyrakis
(Stanford University), Shih-Lien Lu (Intel), Mark Oskin (University of Washington), Dave A.
Patterson (UC Berkeley), John Wawrzynek (UC Berkeley). Several different styles of FPGA-
accelerated simulation came out of the RAMP effort and are discussed in Chapter 5.
CHAPTER 1
Introduction
1.1 OVERVIEW
All scientific and engineering disciplines are supported by models that enable the experimenta-
tion, prediction, and testing of hypotheses and designs. Modeling can be performed in a variety of
ways, ranging from mathematical equations (for example, using differential equations to model
circuits), to miniatures (for example, using a scale model of a car in a wind tunnel to estimate
drag), and to software-implemented algorithms (for example, using financial models to predict
stock prices). Simulation, which is often considered a form of modeling itself, applies a model
repetitively to predict behavior over simulated time. For example, one can simulate climate change
over one hundred years, or simulate the lifecycle of a star.
e technique of simulation is commonly used to guide the design of future computer
systems, or to better understand the behaviors of existing ones. Computer system simulation
enables the prediction of many different behaviors without the explicit need to build the system
itself. e most commonly predicted behavior of a computer system is performance, often in
terms of the number of cycles it takes to execute a sequence of instructions. Two other commonly
predicted behaviors are energy/power consumption and reliability in the presence of faults.
Computer system simulators are typically implemented in software due to the need for flex-
ibility and visibility. As computers have grown faster over time, their ability to simulate natural
phenomenon, such as the weather, has grown faster as well. ough our understanding may im-
prove, the inherent complexity of the lifecycle of a star, overall, does not increase over time. e
inherent complexity of computer systems, however, continues to advance rapidly over time—in
fact, at a rate faster than the growth in computer performance. us, the performance of computer
simulation is ever decreasing relative to the next generation computer being simulated. is phe-
nomenon is known as the simulation wall. Paradoxically, as computer designers have improved
virtually all other fields’ simulation capabilities, they have simultaneously reduced the ability to
simulate their next-generation designs.
CHAPTER 2
Simulator Background
2.1 USES OF COMPUTER SIMULATION
In computer systems research and design, simulation studies are used when the target system does
not exist or when a given design is being studied. A simulator implements target behavior in a
manner that is simpler than building the target—otherwise, one would just construct the target.
Even when the target system is available, a simulator offers increased controllability, flexibility,
and observability. Simulators, however, have notable disadvantages compared to the target, such
as being slower, not accurately modeling the target, or not predicting all the behaviors of the
target.
ere are two major forms of computer system simulators: functional simulators and per-
formance simulators. Functional simulation predicts the behavior of the target with little or no
concern for timing accuracy. Functional behaviors include the execution of CPU instructions and
the activities of peripheral devices such as a network card sending packets or a DMA transfer from
disk. e modeling of micro-architectural state that affects performance, such as cache tags, is not
considered a part of functional simulation. Functional simulation is used for a variety of purposes,
ranging from: (1) prototyping software development before a machine is built, (2) providing pre-
liminary performance modeling and tuning, (3) collecting traces for performance modeling, and
(4) generating a reference execution to check that a performance simulator executes correctly.
A performance simulator predicts the performance and timing behaviors of the target. Per-
formance simulation is used for a variety of purposes, ranging from evaluating micro-architectural
proposals, to studying performance bottlenecks in existing systems, to comparing machines to
decide which to procure, or to enable the tuning of compilers. Since there are many timing-
dependent behaviors that a practitioner may want to predict, such as resiliency or power con-
sumption, “performance” simulation refers to the simulation of any or all of the non-functional
aspects of the target. e output of a performance simulator can assume a variety of forms—the
most common example is the aggregate Instructions Per Cycle (IPC) of the target running a
specific workload. ere are, however, other possibilities such as a cycle-by-cycle accounting of
the number of instructions fetched or issued, or even a cycle-by-cycle accounting of processor
resources consumed by every in-flight instruction (e.g., re-order buffer).
Performance can be simulated in a variety of ways. For example, for a simple microcoded
machine with a fixed latency for every instruction, the performance can be accurately predicted
using a closed-form mathematical function that accepts the number of each instruction as argu-
ment. For nearly all other machines, however, a cycle-accurate simulator must model the target’s
8 2. SIMULATOR BACKGROUND
micro-architecture in great detail. e standard way to achieve this is to build a model of every
performance-impacting component, connect such component models together as they would be
in the target system, and simulate those models in tandem.
Functional simulation is typically much faster than performance simulation. Fast functional
simulators run at roughly the same speed as the host machine, but can incur large slowdowns when
instrumentation is introduced. Accurate performance simulators are generally at least four orders
of magnitude slower than their targets. Note that this distinction does not apply to sampling-
based simulation methodologies that only simulate the computer system accurately for a subset of
time or instructions and extrapolating results from the samples. Sampling-based simulators still
require an accurate performance simulator.
Figure 2.2: Simulator architectures. Functional model components are in light green, timing model
components in dark green, monolithic components in gray.
2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 17
2.6.2 TIMING-DIRECTED SIMULATORS
Timing-directed simulators were developed to address the design complexity of monolithic sim-
ulators and to promote code reuse.
Timing-directed simulators are factored into a timing model and a functional model. e
functional model performs the actual tasks associated with fetching an instruction, decoding an
instruction, renaming registers, and actually executing the instruction. When the timing model
determines that some functionality should be performed, it calls the appropriate functional model
to perform that function and to return the result (if the timing model depends on that result to
proceed). us, the functionality is performed at the correct target time as it would in a monolithic
simulator. However, the functional model is implemented separately and can be reused along with
multiple timing models.
At a high level, a timing model only needs to model activity that impacts timing and,
therefore, does not need to model activity that only impacts functionality. For example, given a
fixed latency integer ALU, one only needs to model its latency in the timing model, rather than
modeling the functionality of the ALU. One simple way to model latency is to delay the output
by the desired latency. Another example is the fact that simulated caches do not require actual
data. A third example is that instructions do not need to be decoded (since they have already
been decoded by the functional model). us, timing-directed timing models (and many timing
models in general) appear to be aggressively stripped down targets, with only the performance
skeleton remaining.
To be accurate, the timing model must capture a tremendous amount of tightly connected,
parallel activity. In fact, this is the reason why fast microprocessors must be implemented in hard-
ware. If there was a way to efficiently simulate an aggressive microprocessor target on a multi-
processor host, one could likely use those techniques to make a faster processor. us, as the
complexity of the target computer system grows, the timing model gets progressively slower. e
timing model is the bottleneck for both a functional-first simulator (described later in this section)
and a timing-directed simulator.
e Intel Asim [17] simulator is a timing-directed simulator that utilizes a functional model
to perform decode, execute, memory operations, kill, and commit. PTLSim [51] is a timing-
directed simulator that has a functional model that performs instruction operations, but does not
actually update state, which is left to the timing model. e M5 “execute-in-execute” simulator is
a timing-directed simulator that functionally performs the entire instruction when the instruction
is executed in the timing model.
Depending on (i) the level of accuracy desired, (ii) the target, and (iii) the decision as to how
functionality is partitioned, the functional model of a timing-directed simulator often reflects the
target system at least to some degree. For example, an Asim functional model provides infinite
register renaming to enable simulation of targets with register renaming. us, it is possible that
a target might require a specialized functional model to accommodate it.
18 2. SIMULATOR BACKGROUND
e timing model and the functional model in a timing-directed simulator are very tightly
coupled with bi-directional communication occurring several times per target cycle. As a result,
exploiting timing-directed partitioning to parallelize simulators is unlikely to result in speedups.
For example, when simulating a target with an idealized pipeline with one instruction committed
per cycle (IPC=1), there will be an average of one set of blocking calls between the timing model
and functional model every target cycle. e blocking calls sequentialize the computation, limiting
parallelism. us, if 100 Million Instructions Per Second (MIPS) of simulation performance is
desired, assuming a minimal one call into the functional model per instruction, an interaction
occurs every 10ns. e communication latency between any CPU and any off-chip component
by itself, not counting any time to perform the functional model, will be significantly longer
than 10ns. us, it is not surprising that we are not aware of any software-hosted simulators
parallelized on timing-directed boundaries. Instead, this partitioning is intended purely for reuse
and complexity mitigation.
Simulating target time and events is another aspect of simulator organization. Software is not
naturally parallel but hardware is. Software must execute each concurrent operation sequentially,
but in the correct order to correctly read and update state. One way to simulate concurrent op-
erations is using simulator events. An event is a piece of code that models a particular operation
that might execute concurrently with other events in the target. Each event is stamped with a
time that indicates the target time when it will execute. As events are created or recycled, they are
stored in an event wheel or queue. Events can be created for a future time. Events are dequeued
and executed from the event queue in target time order.
2.7. SIMULATION EVENTS AND SYNCHRONIZATION 23
It is possible that multiple events are for the same target time. For example, the first event
may be the producer of the data of a pipeline register implemented with a master-slave flop-flop
and the second event may be the consumer of the data of a pipeline register. us, both events
should execute at the same target time. Depending on how the simulator is written, however, the
order in which those two events are executed in the simulator may or may not matter.
If, for example, the pipeline register is implemented as two variables, each representing a
latch in a master-slave flip-flop, either the first event or the second event can execute first, since
the first event will write into the first variable and the second event will read from the second
variable. ere is not the possibility of the second event reading the data written by the first event
on that same cycle. Such an approach, however, requires the first variable be copied to the second
variable after both events have completed executing, thus requiring another simulator event to
occur after target events have completed.
As an alternative, the pipeline register could be implemented as a single variable. In that
case, the second event must be executed before the first, ensuring that the second event executes
with the pipeline register’s value from the previous cycle. In general, the events must be sorted in
reverse pipeline order, where the end of the pipeline executes first and the front of the pipeline
executes first. If, however, the pipeline has a loop, sorting is impossible. In that case, at least one
pipeline register must be simulated by double variables.
Event queues provide the capability to “skip” target time when there is no other activity.
For example, if one was simulating a HEP-like eight stage multithreaded microprocessor that
only executes an instruction from a thread every eight cycles and was only running one thread,
there is no point to simulate seven stages that are not active at any given target time. One would
only need to simulate the one active stage out of eight at any given target time. One event-based
simulator strategy would, as it executes each event, enqueue an event to simulate each successive
stage into the next target time.
An alternative to event-based simulation is cycle-by-cycle simulation that executes every
component/event every cycle. Such a scheme may seem less efficient, since there may be times
when a particular component has nothing to do, but doing so eliminates the event queue over-
heads. ere are cases where a cycle-by-cycle simulator is faster than an event-driven simulator,
especially if the events are appropriately statically scheduled to eliminate overheads [21].
Maintaining synchronization between timing events requires simulating a consistent tar-
get clock. Maintaining synchronization between functional events on the other hand, requires
simulating an instruction interleaving that adheres to a desired target cycle interleaving.
In monolithic or timing-directed simulator designs, it is important to decide how to sim-
ulate a consistent target clock as all events are timing events. In a functional-first simulator, how
functional instruction execution is interleaved across multiple target cores can be chosen to bal-
ance a tradeoff between accuracy and efficiency. For example, simulation efficiency might be very
high if one target core’s instructions are completely executed before another target core’s instruc-
tions are completely executed. However, since one target core’s instruction execution can affect
24 2. SIMULATOR BACKGROUND
another target core’s instruction execution, executing one to completion before the other can result
in inaccurate results. Fine-grain interleaving may more closely model the actual target instruction
interleaving but reduce the overall simulation efficiency.
CHAPTER 3
CHAPTER 4
Simulation Virtualization
As discussed in Chapter 1, there is no strict requirement that a structural correspondence exists
between the target system and what is actually implemented on an FPGA. Given this relaxed
demand for structural fidelity, a well-engineered FPGA-accelerated simulator should achieve,
in comparison to a structurally accurate prototype, a much higher simulation rate (measured in
instruction count or other architecturally visible metrics) and incur lower design effort and logic
resources.
is chapter discusses the significant benefits that can arise from harnessing FPGAs not
as a hardware prototyping substrate but as a virtualizable compute resource for executing and
accelerating simulations. In particular, this section examines two key virtualization techniques
developed and utilized by the ProtoFlex project [14] for accelerating full-system and multiprocessor
simulations.
e first virtualization technique is Hierarchical Simulation with Transplanting for simplify-
ing the construction of an FPGA-accelerated full-system simulator. In Hierarchical Simulation,
one accelerates in FPGAs only the subset of the most frequently encountered behaviors (e.g.,
ALU and load/store instructions) and relies on a reference software simulator to support simula-
tions of rare and complex behaviors (e.g., system-level instructions and I/O devices.)
e second technique is time-multiplexed virtualization of multiple processor contexts onto
fewer high-performance multiple-context simulation engines. Simulation virtualization decou-
ples the required complexity and scale of the physical hardware on FPGAs from the complexity
and scale of the target multiprocessor system. Unlike a direct prototype, the scale of the accel-
eration hardware on the FPGA host is an engineering decision that can be set judiciously in
accordance with the desired level of simulation throughput. Before delving into the details of the
two virtualization techniques, the next section first explains the background and requirements of
full-system and multiprocessor simulations.
4.2.2 TRANSPLANTING
A complex component such as a processor encompasses a small set of frequent behaviors (ADDs,
LOADs, TLB/cache accesses, etc.) and a much more extensive set of complicated and fortunately
often also rare behaviors (privileged instructions, MMU activities, etc.). Assigning the complete
set of processor behaviors statically to either the software simulation or FPGA simulation host
would result in either the simulation being too slow or the FPGA development being too compli-
cated. ese conflicting goals can be reconciled by supporting transplantable components, which
34 4. SIMULATION VIRTUALIZATION
Figure 4.1: Partitioning a simulated target system across FPGA and software simulation in the
ProtoFlex simulator.
can be re-assigned to the FPGA host or software simulation dynamically at runtime during hybrid
simulation.
Continuing with the processor example, the FPGA would only implement the subset of
the most frequently encountered instruction subset. When this partially implemented proces-
sor encounters an unimplemented behavior (e.g., a page table walk following a TLB miss), the
FPGA-hosted processor component is suspended and its processor state is transplanted (that is,
copied) to its corresponding software-simulated processor model in the reference simulator. e
software-simulated processor model, which supports the complete set of behaviors, is activated
to carry out the unimplemented behavior. Afterward, the processor state is transplanted back to
the FPGA-hosted processor model to resume accelerated execution of common case behaviors.
Tby FPGA is the time required to execute one instruction on the FPGA host or the time to deter-
mine that it is an unsupported instruction. Tby txplant is the time required to execute one instruction
by the software-host, including the transplant latency. Rmiss is the percentage of dynamic instruc-
tions that is not supported by the FPGA host. In the example scenario above, Tby FPGA =10 nsec;
Tby txplant =1 msec and Rmiss =0.00001.
e equation above should be strongly reminiscent of the effective memory access time
through a cache. is interesting parallel points to a simple, yet effective solution. Just as computer
architects would introduce more levels of cache hierarchies to bridge the gap between processor
and DRAM speed (as oppose to building bigger caches or building faster DRAMs), one can sim-
ilarly introduce a hierarchy of intermediate software transplant hosts with staggered, increasing
instruction coverage and performance costs. For example, today’s FPGAs can support embedded
processors realized as either soft- or hard-logic cores which can execute a software simulation
kernel for the entire processor behaviors. e simulation on the embedded processor is still slow
relative to the FPGA-hosted instructions but incur much less cost than a full transplant to the
full-system software simulator. At the same time when writing a software simulation kernel, it
is much easier to capture enough, if not all, of the processor behaviors to achieve a sufficiently
higher dynamic instruction coverage to reduce the number of times one needs to pay the full cost
of transplanting to the external software-host. If all of the processor behaviors are captured by
the software simulation kernel running on the embedded processor core, the reference software
simulator is relegated to providing simulation support of the I/O subsystem only.
To complete the analogy with hierarchical caches, the effective average instruction execu-
tion time of two levels can be expressed as
Teffective D Tby FPGA C Rmiss FPGA Tby txplant effective
Figure 4.2: Large-scale multiprocessor simulation using a small number of multiple-context inter-
leaved engines.
Table 4.2: Assignment of target behavior to simulation host (FPGA, microtransplant, full-transplant)
4.4. CASE STUDY: THE PROTOFLEX SIMULATOR 41
Figure 4.3: Allocating components for hierarchical simulation in the BlueSPARC simulator.
how the various UltraSPARC III behaviors are assigned to the three hosting options—FPGA,
embedded PowerPC microtransplant, PC full-transplant. ese assignment decisions were made
based on rigorous dynamic instruction profiling of various applications simulated in Simics. With
hierarchical transplanting, only 99.95% of the dynamic instructions are executed in hardware on
the FPGA while the remainder is carried out in the microtransplant kernel (running on the em-
bedded PowerPC) and software full-system PC host.
a steady-state execution (where database transactions are committing steadily.) As shown next,
the workloads’ characteristics have a large effect on the throughput of both the Simics and the
ProtoFlex simulator.
When Simics is invoked with the default “fast” option, it achieves tens of MIPS in simu-
lation throughput. However, there is roughly a factor of 10x reduction in simulation throughput
when Simics is enabled with trace callbacks for instrumentation [31], such as memory address
tracing. e two columns in Table 4.3 labeled Simics-fast and Simics-trace report Simics through-
put for the simulated 16-way SMP server. Simics simulations were run on a Linux PC workstation
with a 2.0 GHz Core 2 Duo and 8 GBytes of memory. e performance most relevant to archi-
tecture research activities is represented by the performance Simics-trace column. e simulation
throughput of the ProtoFlex simulator is reported in the left-most column of Table 4.3. For these
measurements, the BlueSPARC engine is clocked at 90MHz. e ProtoFlex simulator achieves
speed comparable to Simics-fast on the SPECINT and Oracle-TPCC workloads. In compari-
son to the more relevant Simics-trace performance, the speedup is more dramatic, on average 38x
faster.
CHAPTER 5
Categorizing FPGA-based
Simulators
To summarize, there are three high-level orthogonal characteristics of FPGA-accelerated simu-
lators: (1) simulator architecture, (2) the partitioning between FPGA and software host, and (3)
providing virtualization support within the simulator to better utilize host resources to better sup-
port simulation of targets. To review, the five simulator architectures described in Chapter 3 are
monolithic, timing-directed, functional-first, timing-first, and speculative functional-first. Parti-
tioning between the FPGA and software host refers to partitioning between a software functional
model and an FPGA timing model or an FPGA-based common-case functional model and a
complete software functional model. Virtualization refers to techniques such as multithreading
that enable multiple target components to share the same host resources.
Designing an FPGA-based simulator requires selecting a number of points in the design
space, ranging from the simulator architecture to the particular optimization strategies used to
cope with the restrictions of hardware-based accelerated simulation. Table 5.1 summarizes a few
of the common FPGA-based simulator artifacts along with the selected choices in terms of sim-
ulator architecture, partition mapping, synchronization, schemes, and optimization techniques.
5.2.1 PROTOFLEX
As discussed in the last chapter, the ProtoFlex simulator was developed at Carnegie Mellon Uni-
versity to support FPGA-accelerated functional simulation of full-system, large-scale multipro-
cessor systems [14]. e ProtoFlex functional model targets a 64-bit UltraSPARC III ISA (com-
pliant with the commercially available software-based full-system simulator model from Sim-
ics [26]) and is capable of booting commercial operating systems such as Solaris 10 and running
commercial workloads (with no available source code) such as Oracle TPC-C. ProtoFlex was
the first system to introduce hierarchical simulation and host multithreading as techniques for
reducing the complexity of simulator development and to virtualize finite hardware resources.
e ProtoFlex simulator is available at [36] and targets the XUPV5-LX110T platform, a widely
available and low-cost commodity FPGA platform.
5.2.2 HASIM
e HAsim project was developed at MIT and Intel and employs the use of host multithreading,
hierarchical simulation, and timing-directed simulation with functional-timing partition. HAsim
currently supports the Alpha ISA and has been used to target a Nallatech ACP accelerator with
a Xilinx Virtex 5 LX330T FPGA connected to Intel’s Front-Side Bus protocol. HAsim has been
used to simulate a detailed 4x4 multicore with 64-bit Alpha out-of-order processors on a single
FPGA. HAsim is available for download at [20].
5.2. OPEN-SOURCED FPGA-BASED SIMULATORS 47
5.2.3 RAMP GOLD
RAMP Gold is a simulator of a 64-core 32-bit SPARC V8 target developed at UC Berkeley.
e first implementation was done on an Xilinx XUPV5 board. RAMP Gold employs host mul-
tithreading and a functional first simulator and is capable of booting Linux. e RAMP Gold
package is available at [37] and includes CPU, cache, and DRAM timing models. RAMP Gold
simulators were aggregated together onto 24 Xilinx FPGAs in the Diablo project that has been
used to reproduce effects-at-scale such as TCP Incast [45].
49
CHAPTER 6
Conclusion
is book describes techniques for practical and efficient simulation of computer systems using
FPGAs. ere is a distinction between using FPGAs as a vehicle for simulation and the use
of FPGAs for prototyping. FPGA-accelerated simulation implies that a significant portion of
the simulation is implemented in software, and that at least part of the simulator is structurally
different than the target.
is manuscript surveys simulator architectures and describes how different simulator ar-
chitectures have been accelerated with FPGAs. One simulator architecture in particular, specu-
lative functional first, was designed from the ground up to enable FPGA acceleration of perfor-
mance simulators. ough SFF is not limited to software-based functional models and FPGA-
based timing models, SFF provides many advantages including completeness, reduced FPGA
resources, and the ability to tolerate latency. FAST-UP was the first implementation of an SFF
simulator that simulated a dual-issue, branch-predicted, out-of-order x86-based computer in suf-
ficient detail to boot both Linux and Windows and running interactive Microsoft Word and
YouTube on Internet Explorer. e FPGA-based timing model was the bottleneck. FAST-MP
leverages both SFF and FPGA multithreading to build a 256-core target, supported a branch
predicted seven-stage pipeline that is also intended to boot Linux and Windows while running
arbitrary off-the-shelf software.
is book also describes hierarchical simulation that implements commonly used func-
tionality on the FPGA, and less commonly used functionality in software. In addition, FPGA
virtualization enables the mapping of multiple virtual components, such as a CPU, onto a sin-
gle physical execution engine. ProtoFlex’s BlueSPARC leverages both techniques to provide an
FPGA-accelerated, full system functional model that is capable of functionally simulating sixteen
64-bit UltraSPARC V9 cores at 90MHz on a single FPGA coupled to a microprocessor.
Because SFF is intended for performance simulation and because its functional model and
timing model can both be parallelized, it occupies a different point in the simulator space than hi-
erarchical simulation described in this book. Rather than accelerating functionality on the FPGA,
it places all of the functionality in software, where it can run very quickly due to a fast baseline
simulator, and most of the timing is carried out in the FPGA. ProtoFlex/BlueSPARC accelerate
common functional instructions in the FPGA, and transplanting to software is only necessary to
provide full functionality. In both cases, however, the optimal host platform/simulator is used to
maximize performance.
50 6. CONCLUSION
In conclusion, FPGA-accelerated simulators are highly performant while providing many
of the desirable simulator benefits including accuracy, completeness, and usability. FPGA-
accelerated simulators’ main challenge is ease of programmability. e overall promise of FPGA-
accelerated simulators, however, is a compelling reason to continue researching the area.
51
APPENDIX A
Figure A.1: A high-level depiction of an Altera Stratix V ALM. ere are a pair of six-input LUTs
(though they share inputs), a pair of adders, and four registers. Figure used with permission from
Altera.
A.2. EMBEDDED SRAM BLOCKS 53
Figure A.2: A detailed depiction of an Altera Stratix V ALM. Note that each six-input LUT is
implemented as a four-input LUT, a pair of three-input LUTs, and muxes. Figure used with permission
from Altera.
54 A. FIELD PROGRAMMABLE GATE ARRAYS
Figure A.3: e different possible configurations of an Altera ALM. Figure used with permission
from Altera.
A.3. HARD “MACROS” 55
tera BRAM can be configured in the following configurations, all dual-ported: 512x32b, 512x40b,
1Kx16b, 1Kx20b, 2Kx8b, 2Kx10b, 4Kx4b, 4Kx5b, 8Kx2b, and 16Kx1b.
Bibliography
[1] H. Angepat, D. Sunwoo, and D. Chiou. RAMP-White: An FPGA-Based Coherent
Shared Memory Parallel Computer Emulator. In 8th Annual Austin CAS Conference, Mar.
2007. 26
[2] K. Barr, R. Matas-Navarro, C. Weaver, T. Juan, and J. Emer. Simulating a chip multipro-
cessor with a symmetric multiprocessor. Boston area ARChitecture Workshop, Jan. 2005.
11
[3] F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In USENIX 2005 Annual
Technical Conference, FREENIX Track, pages 41–46, 2005. 30, 32
[7] J. Chen, M. Annavaram, and M. Dubois. Slacksim: a platform for parallel simula-
tions of cmps on cmps. SIGARCH Comput. Archit. News, 37(2):20–29, 2009. DOI:
10.1145/1577129.1577134. 3
[12] Chisel. 29
[13] E. S. Chung and J. C. Hoe. High-level design and validation of the bluesparc multithreaded
processor. Trans. Comp.-Aided Des. Integ. Cir. Sys., 29(10):1459–1470, Oct. 2010. DOI:
10.1109/TCAD.2010.2057870. 30, 39
[16] J. Donald and M. Martonosi. An Efficient, Practical Parallelization Methodology for Mul-
ticore Architecture Simulation. Computer Architecture Letters, July 2006. DOI: 10.1109/L-
CA.2006.14. 11
[49] A. Waterman, Z. Tan, R. Avizienis, Y. Lee, D. Patterson, and K. Asanovic. RAMP Gold
- Architecture and Timing Model. RAMP Retreat, Austin, TX, June 2009. 13, 14, 25, 26,
29
[50] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: Accelerating mi-
croarchitecture simulation via rigorous statistical sampling. In Computer Architecture, 2003.
Proceedings. 30th Annual International Symposium on, pages 84–95. IEEE, 2003. DOI:
10.1145/871656.859629. 13
[51] M. T. Yourst. PTLSim: A Cycle Accurate Full System x86-64 Microarchitectural Simu-
lator. In Proceedings of ISPASS, Jan. 2007. DOI: 10.1109/ISPASS.2007.363733. 17
63
Authors’ Biographies
HARI ANGEPAT
Hari Angepat is a Ph.D. candidate at e University of Texas at Austin. He holds a B.Eng in
Computer Engineering from McGill University and an M.S. in Computer Engineering from
e University of Texas at Austin. Hari is interested in developing domain-specific FPGA mi-
croarchitectures and productivity tools to enable widespread adoption of hardware acceleration.
Between 2008–2012, Hari led the FAST-MP project that enabled accurate functional-first sim-
ulation of multiprocessor systems on FPGAs. For more information, please visit
http://hari.angepat.com.
DEREK CHIOU
Derek Chiou is a Principal Architect at Microsoft where he leads a team working on FPGAs for
data center applications. He is also an Associate Professor at e University of Texas at Austin
where his research areas are FPGA acceleration, high performance computer simulation, rapid
system design, computer architecture, parallel computing, Internet router architecture, and net-
work processors. In a past life, Dr. Chiou was a system architect and led the performance model-
ing team at Avici Systems, a manufacturer of terabit core routers. Dr. Chiou received his Ph.D.,
S.M., and S.B. degrees in Electrical Engineering and Computer Science from MIT. For more
information on Dr. Chiou and his research, please visit
http://www.ece.utexas.edu/~derek.
ERIC S. CHUNG
Eric S. Chung is a Researcher at Microsoft Research in Redmond. Eric is interested prototyping
and developing productive ways to harness massively parallel hardware systems that incorporate
specialized hardware such as FPGAs. Eric received his Ph.D. in 2011 from Carnegie Mellon
University and was the recipient of the Microsoft Research Fellowship award in 2009. His paper
on CoRAM, a memory abstraction for programming FPGAs more effectively, received the best
paper award in FPGA’11. Between 2005 and 2011, Eric led the ProtoFlex project that enabled
practical FPGA-accelerated simulation of full-system multiprocessors. For more information,
please visit http://research.microsoft.com/en-us/people/erchung.
64 AUTHORS’ BIOGRAPHIES
JAMES C. HOE
James C. Hoe is Professor of Electrical and Computer Engineering at Carnegie Mellon Univer-
sity. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000 (S.M.,
1994). He received his B.S. in EECS from UC Berkeley in 1992. He is a Fellow of IEEE. Dr.
Hoe is interested in many aspects of computer architecture and digital hardware design, including
the specific areas of FPGA architecture for computing; digital signal processing hardware; and
high-level hardware design and synthesis. He was a contributor to RAMP (Research Accelera-
tor for Multiple Processors). He worked on the ProtoFlex FPGA-accelerated simulation project
between 2005 and 2011 with Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, Babak
Falsafi, and Ken Mai. Earlier, he worked on the SMARTS sampling simulation project. For more
information on Dr. Hoe and his research, please visit http://www.ece.cmu.edu/~jhoe.