0% found this document useful (0 votes)
17 views82 pages

FPGA-Accelerated Simulation of Computer Systems

The document discusses the use of field programmable gate arrays (FPGAs) to enhance the performance of software-based simulations of computer systems. It outlines various techniques for implementing FPGA-accelerated simulations and provides a taxonomy of simulation architectures specifically designed for this purpose. The manuscript is part of the Synthesis Lectures on Computer Architecture series, which publishes concise works on important topics in the field.

Uploaded by

herman188668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views82 pages

FPGA-Accelerated Simulation of Computer Systems

The document discusses the use of field programmable gate arrays (FPGAs) to enhance the performance of software-based simulations of computer systems. It outlines various techniques for implementing FPGA-accelerated simulations and provides a taxonomy of simulation architectures specifically designed for this purpose. The manuscript is part of the Synthesis Lectures on Computer Architecture series, which publishes concise works on important topics in the field.

Uploaded by

herman188668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Series

Series
SeriesISSN:
ISSN:
ISSN:
1935-3235
1935-3235
1935-3235

ANGEPAT
ANGEPAT •• CHIOU
ANGEPAT
SSSyntheSiS
yntheSiSL
yntheSiS L
LectureS
ectureS
ectureS MOR
MOR
MORGG
GAA
ANN&
N &
&CL
CL
C LAY
AY
AYPOOL
POOL
POOL PU
PU
PUBLI
BLI
BLISSSH
H
HERS
ERS
ERS
cc omputerA
computer
omputer A
Architecture
rchitecture
rchitecture

• CHIOU •• CHUNG
CHIOU • CHUNG
Series
Series
SeriesEditor:
Editor:
Editor:Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,University
University
UniversityofofofWisconsin
Wisconsin
Wisconsin

FPGA-Accelerated
FPGA-Accelerated

CHUNG •• HOE
FPGA-Accelerated
FPGA-Accelerated
FPGA-AcceleratedSimulation
Simulation
Simulationof
of
of

• HOE
Computer
Computer
ComputerSystems
Systems
Systems

HOE
Hari
Hari
HariAngepat,
Angepat,
Angepat,The
The
TheUniversity
University
UniversityofofofTexas
Texas
Texasand
and
andMicrosoft
Microsoft
Microsoft Simulation
Simulation of
of
Computer
Computer Systems
Systems
Derek
Derek
DerekChiou,
Chiou,
Chiou,Microsoft
Microsoft
Microsoftand
and
andThe
The
TheUniversity
University
UniversityofofofTexas
Texas
Texas

FPGA-ACCELERATED
FPGA-ACCELERATED SIMULATION
FPGA-ACCELERATED
Eric
Eric
EricS.
S.
S.Chung,
Chung,
Chung,Microsoft
Microsoft
Microsoft
James
James
JamesC.
C.
C.Hoe,
Hoe,
Hoe,Carnegie
Carnegie
CarnegieMellon
Mellon
MellonUniversity
University
University

To
To
Todate,
date,
date,the
the
themost
most
mostcommon
common
commonform
form
formofof
ofsimulators
simulators
simulatorsofof
ofcomputer
computer
computersystems
systems
systemsare
are
aresoftware-based
software-based
software-basedrunning
running
running
on
on
onstandard
standard
standardcomputers.
computers.
computers.One
One
Onepromising
promising
promisingapproach
approach
approachtoto
toimprove
improve
improvesimulation
simulation
simulationperformance
performance
performanceisisistoto
toap-
ap-
ap-
ply
ply
plyhardware,
hardware,
hardware,specifically
specifically
specificallyreconfigurable
reconfigurable
reconfigurablehardware
hardware
hardwareinin
inthe
the
theform
form
formofof
offield
field
fieldprogrammable
programmable
programmablegate
gate
gatearrays
arrays
arrays

SIMULATION OF
SIMULATION OF
(FPGAs).
(FPGAs).
(FPGAs).This
This
Thismanuscript
manuscript
manuscriptdescribes
describes
describesvarious
various
variousapproaches
approaches
approachesofof
ofusing
using
usingFPGAs
FPGAs
FPGAstoto
toaccelerate
accelerate
acceleratesoftware
software
softwareim-
im-
im-
plemented
plemented
plementedsimulation
simulation
simulationofof
ofcomputer
computer
computersystems
systems
systemsand
and
andselected
selected
selectedsimulators
simulators
simulatorsthat
that
thatincorporate
incorporate
incorporatethose
those
thosetechniques.
techniques.
techniques.
More
More
Moreprecisely,
precisely,
precisely,we
we
wedescribe
describe
describeaaasimulation
simulation
simulationarchitecture
architecture
architecturetaxonomy
taxonomy
taxonomythat
that
thatincorporates
incorporates
incorporatesaaasimulation
simulation
simulationar-
ar-
ar-

OF COMPUTER
chitecture
chitecture
chitecturespecifically
specifically
specificallydesigned
designed
designedfor
for
forFPGA
FPGA
FPGAaccelerated
accelerated
acceleratedsimulation,
simulation,
simulation,survey
survey
surveythe
the
thestate-of-the-art
state-of-the-art
state-of-the-artinin
in

COMPUTER SYSTEMS
COMPUTER SYSTEMS
FPGA-accelerated
FPGA-accelerated
FPGA-acceleratedsimulation,
simulation,
simulation,and
and
anddescribe
describe
describeinin
indetail
detail
detailselected
selected
selectedinstances
instances
instancesofof
ofthe
the
thedescribed
described
describedtechniques.
techniques.
techniques.

Hari
Hari
HariAngepat
Angepat
Angepat

SYSTEMS
Derek
Derek
DerekChiou
Chiou
Chiou
Eric
Eric
EricS.
S.
S.Chung
Chung
Chung
ABOUT
ABOUT
ABOUTSYNTHESIS
SYNTHESIS
SYNTHESIS
James
James
JamesC.
C.
C.Hoe
Hoe
Hoe
This
This
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionofof
ofaaawork
work
workthat
that
thatappears
appears
appearsinin
inthe
the
theSynthesis
Synthesis
Synthesis
Digital
Digital
DigitalLibrary
Library
LibraryofofofEngineering
Engineering
Engineeringand and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
SynthesisLectures
Lectures
Lectures
provide
provide
provideconcise,
concise,
concise,original
original
originalpresentations
presentations
presentationsofofofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development

M
M OR
M OR G
OR G
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,ininindigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information

G AN
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com
SSSyntheSiS
yntheSiSL
L
LectureS

AN &
AN &
yntheSiS ectureS
ectureS

& CL
cccomputer
omputerA A
Architecture

CL AY
CL AY
MORGAN&
&
&CLAYPOOL
ISBN:
ISBN:
ISBN:978-1-62705-213-9
978-1-62705-213-9
978-1-62705-213-9
MORGAN
MORGAN CLAYPOOL
CLAYPOOLPUBLISHERS
PUBLISHERS
PUBLISHERS 90000
90000
90000 omputer rchitecture
rchitecture

AY P
wwwwwwwww. .m
.m
mooorrgrggaaannncccl lalayayypppoooool l.l.c.cocoomm
m

P OOL
P OOL
OOL
999781627
781627
781627052139
052139
052139
Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,Series
Series
SeriesEditor
Editor
Editor
FPGA-Accelerated Simulation
of Computer Systems
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. e scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.

FPGA-Accelerated Simulation of Computer Systems


Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
2014

A Primer on Hardware Prefetching


Babak Falsafi and omas F. Wenisch
2014

On-Chip Photonic Interconnects: A Computer Architect’s Perspective


Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella
2013

Optimization and Mathematical Modeling in Computer Architecture


Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and David
Wood
2013

Security Basics for Computer Architects


Ruby B. Lee
2013

e Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013
iv
Shared-Memory Synchronization
Michael L. Scott
2013

Resilient Architecture Design for Voltage Variation


Vijay Janapa Reddi and Meeta Sharma Gupta
2013

Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013

Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012

Automatic Parallelization: An Overview of Fundamental Compiler Techniques


Samuel P. Midkiff
2012

Phase Change Memory: From Devices to Systems


Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran
2011

Multi-Core Cache Hierarchies


Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar
2011

A Primer on Memory Consistency and Cache Coherence


Daniel J. Sorin, Mark D. Hill, and David A. Wood
2011

Dynamic Binary Modification: Tools, Techniques, and Applications


Kim Hazelwood
2011

Quantum Computing for Computer Architects, Second Edition


Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong
2011

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities


Dennis Abts and John Kim
2011

Processor Microarchitecture: An Implementation Perspective


Antonio González, Fernando Latorre, and Grigorios Magklis
2010
v
Transactional Memory, 2nd edition
Tim Harris, James Larus, and Ravi Rajwar
2010

Computer Architecture Performance Evaluation Methods


Lieven Eeckhout
2010

Introduction to Reconfigurable Supercomputing


Marco Lanzagorta, Stephen Bique, and Robert Rosenberg
2009

On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009

e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009

Fault Tolerant Computer Architecture


Daniel J. Sorin
2009

e Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines
Luiz André Barroso and Urs Hölzle
2009

Computer Architecture Techniques for Power-Efficiency


Stefanos Kaxiras and Margaret Martonosi
2008

Chip Multiprocessor Architecture: Techniques to Improve roughput and Latency


Kunle Olukotun, Lance Hammond, and James Laudon
2007

Transactional Memory
James R. Larus and Ravi Rajwar
2006

Quantum Computing for Computer Architects


Tzvetan S. Metodi and Frederic T. Chong
2006
Copyright © 2014 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

FPGA-Accelerated Simulation of Computer Systems


Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
www.morganclaypool.com

ISBN: 9781627052139 paperback


ISBN: 9781627052146 ebook

DOI 10.2200/S00586ED1V01Y201407CAC029

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE

Lecture #29
Series Editors: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
FPGA-Accelerated Simulation
of Computer Systems

Hari Angepat
e University of Texas and Microsoft

Derek Chiou
Microsoft and e University of Texas

Eric S. Chung
Microsoft

James C. Hoe
Carnegie Mellon University

SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #29

M
&C Morgan & cLaypool publishers
ABSTRACT
To date, the most common form of simulators of computer systems are software-based running
on standard computers. One promising approach to improve simulation performance is to ap-
ply hardware, specifically reconfigurable hardware in the form of field programmable gate arrays
(FPGAs). is manuscript describes various approaches of using FPGAs to accelerate software-
implemented simulation of computer systems and selected simulators that incorporate those tech-
niques. More precisely, we describe a simulation architecture taxonomy that incorporates a sim-
ulation architecture specifically designed for FPGA accelerated simulation, survey the state-of-
the-art in FPGA-accelerated simulation, and describe in detail selected instances of the described
techniques.

KEYWORDS
simulation, cycle-accurate, functional, timing, FPGA accelerated
ix

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Host vs. Target Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Why are Fast, Accurate Simulators of Computer Targets Needed? . . . . . . . . . . . 2
1.4 Harnessing FPGAs for Simulation Not Prototyping . . . . . . . . . . . . . . . . . . . . . . 3
1.5 e Rest of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Simulator Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Uses of Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Desired Simulator Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Performance Simulation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Simulator Design Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Simulator Partitioning for Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Spatial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Temporal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Functional/Timing Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.4 Hybrid Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Functional/Timing Simulation Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Monolithic Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.2 Timing-Directed Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.3 Functional-First Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.4 Timing-First Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.5 Speculative Functional-First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Simulation Events and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.1 Centralized Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.2 Decentralized Event Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
x
3 Accelerating Computer System Simulators with FPGAs . . . . . . . . . . . . . . . . . . 25
3.1 Exploiting Target Partitioning on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Accelerating Traditional Simulator Architectures with FPGAs . . . . . . . . . . . . . 26
3.2.1 Accelerating Monolithic Simulators with FPGAs . . . . . . . . . . . . . . . . . . 26
3.2.2 Accelerating Timing-Directed Simulators with FPGAs . . . . . . . . . . . . . 26
3.2.3 Accelerating Functional-First Simulators with FPGAs . . . . . . . . . . . . . . 27
3.2.4 Accelerating Timing-First Simulators with FPGAs . . . . . . . . . . . . . . . . 27
3.2.5 Accelerating Speculative Functional-First with FPGAs . . . . . . . . . . . . . 27
3.2.6 Accelerating Combined Simulator Architectures with FPGAs . . . . . . . . 28
3.3 Managing Time rough Simulation Event Sychronization in an
FPGA-Accelerated Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Centralized Barrier Synchronization in an FPGA-Accelerated
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Decentralized Barrier Synchronization in an FPGA-Accelerated
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 FPGA Simulator Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Case Study: FPGA-Accelerated Simulation Technologies (FAST) . . . . . . . . . . 30

4 Simulation Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Full-System and Multiprocessor Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Hierarchical Simulation with Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Hierarchical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Hierarchical Transplanting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Virtualized Simulation of Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Time-multiplexed Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Virtualizing Memory Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Case Study: the ProtoFlex Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 ProtoFlex Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 BlueSPARC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.4 Hierarchical Simulation and Virtualization in a Performance
Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Categorizing FPGA-based Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


5.1 FAME Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Open-Sourced FPGA-Based Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xi
5.2.1 ProtoFlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 HAsim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 RAMP Gold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


A.1 Programmable Logic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Embedded SRAM Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.3 Hard “Macros” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xiii

Preface
e slow speed of accurate computer system simulators is a significant bottleneck in the study
of computer architectures and the systems built around them. Since computers use considerable
hardware parallelism to obtain the performance they achieve, it is difficult for simulators to track
the performance of such systems without utilizing hardware parallelism as well. e significant
capabilities and the flexibility of FPGAs make them an ideal vehicle for accelerating and address-
ing the challenges of computer system simulation.
is book presents the current state-of-the art in the use of Field Programmable Gate
Arrays (FPGAs) to improve the speed of accurate computer system simulators. e described
techniques and the related work are the result of active research in this area by the authors and
others over the last ten years.
Chapters 3 and 4 present solutions that address the major challenges in building fast and
accurate FPGA-accelerated simulators without undue implementation effort and cost. Chap-
ter 3 describes how FPGA acceleration can be applied to different simulator architectures, while
Chapter 4 describes how virtualization, via simulator multithreading and transplanting simula-
tion activity back to software, can be used to further extend the performance and capabilities of
FPGA-accelerated simulators.
Chapter 2 describes simulator architectures that apply not only to FPGA-accelerated sim-
ulators, but also to pure software simulators and pure FPGA simulators. Four of the simulator
architectures (monolithic, functional-first, timing-directed, and timing-first) are well known in
the literature, while the fifth, speculative functional-first (Section 2.6.5) was an architecture that
we developed specifically to be accelerated by FPGAs. In addition to a survey of related work in
Chapter 5, the Appendix provides a brief introduction to FPGA technologies.

Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe


August 2014
xv

Acknowledgments
e authors would like to thank all of those who have contributed to the FPGA-Accelerated Sim-
ulation Technologies (FAST) project and the ProtoFlex FPGA-accelerated simulation project.
e FAST project started in 2005 at e University of Texas at Austin and continues today.
In addition to the authors (Hari Angepat and Derek Chiou), Dam Sunwoo, Nikhil A. Patil, Yi
Yuan, Gene Y. Wu, Lauren Guckert, Mike omson, Joonsoo Kim, William H. Reinhart, D.
Eric Johnson, Zheng Xu, and Jebediah Keefe have contributed to the development of FAST
concepts and artifacts. Funding for FAST was provided in part by grants from the Department
of Energy (DE-FG02-05ER25686), the National Science Foundation (CSR-1111766, CCF-
0917158, CNS-0747438, CNS-0615352, and CCF-0541416) and gifts of funds, hardware, and
software from AMD, Bluespec, IBM, Intel, Freescale, Synopsys, VMWare, and Xilinx.
e ProtoFlex project was research carried out at Carnegie Mellon University between 2005
and 2011. In addition to the authors (Eric S. Chung and James C. Hoe), Michael K. Papamichael,
Eriko Nurvitadhi, Babak Falsafi and Ken Mai contributed to the development of the ProtoFlex
concept and simulator. Funding for the ProtoFlex project was provided in part by grants from the
C2S2 Marco Center, NSF CCF-0811702, NSF CNS-0509356, and SUN Microsystems. e
ProtoFlex project also received software and hardware donations from Xilinx and Bluespec.
e FAST and ProtoFlex projects were part of the RAMP (Research Accelerator for Mul-
tiple Processors) multi-university community effort to investigate the application of FPGAs to
the emulation and simulation of complete, large-scale multiprocessor/multicore systems. Other
participants of the RAMP project included Krste Asanovic (UC Berkeley), Christos Kozyrakis
(Stanford University), Shih-Lien Lu (Intel), Mark Oskin (University of Washington), Dave A.
Patterson (UC Berkeley), John Wawrzynek (UC Berkeley). Several different styles of FPGA-
accelerated simulation came out of the RAMP effort and are discussed in Chapter 5.

Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe


August 2014
1

CHAPTER 1

Introduction
1.1 OVERVIEW
All scientific and engineering disciplines are supported by models that enable the experimenta-
tion, prediction, and testing of hypotheses and designs. Modeling can be performed in a variety of
ways, ranging from mathematical equations (for example, using differential equations to model
circuits), to miniatures (for example, using a scale model of a car in a wind tunnel to estimate
drag), and to software-implemented algorithms (for example, using financial models to predict
stock prices). Simulation, which is often considered a form of modeling itself, applies a model
repetitively to predict behavior over simulated time. For example, one can simulate climate change
over one hundred years, or simulate the lifecycle of a star.
e technique of simulation is commonly used to guide the design of future computer
systems, or to better understand the behaviors of existing ones. Computer system simulation
enables the prediction of many different behaviors without the explicit need to build the system
itself. e most commonly predicted behavior of a computer system is performance, often in
terms of the number of cycles it takes to execute a sequence of instructions. Two other commonly
predicted behaviors are energy/power consumption and reliability in the presence of faults.
Computer system simulators are typically implemented in software due to the need for flex-
ibility and visibility. As computers have grown faster over time, their ability to simulate natural
phenomenon, such as the weather, has grown faster as well. ough our understanding may im-
prove, the inherent complexity of the lifecycle of a star, overall, does not increase over time. e
inherent complexity of computer systems, however, continues to advance rapidly over time—in
fact, at a rate faster than the growth in computer performance. us, the performance of computer
simulation is ever decreasing relative to the next generation computer being simulated. is phe-
nomenon is known as the simulation wall. Paradoxically, as computer designers have improved
virtually all other fields’ simulation capabilities, they have simultaneously reduced the ability to
simulate their next-generation designs.

1.2 HOST VS. TARGET TERMINOLOGY


When computers are used to simulate other computers, differentiated terms are necessary to avoid
confusion. us, in this manuscript, the computer system being simulated will be referred to as the
target system. e term target-correct behavior will refer to how the target system would behave
in an actual implementation. For example, if the target mispredicts a branch, the instructions that
are fetched by the target are considered wrong-path instructions—but are target-correct.
2 1. INTRODUCTION
A simulator is executed on a host platform. For example, when a simulator is implemented
in software, the computer that runs the software is deemed to be the actual host. It is also possible
to implement the simulator—or portions of the simulator—directly in custom hardware, in which
case the custom hardware would be considered the host.

1.3 WHY ARE FAST, ACCURATE SIMULATORS OF


COMPUTER TARGETS NEEDED?
Fast and accurate simulators provide a vehicle for the rapid exploration of microprocessor de-
signs. Today, most low hanging microprocessor improvements have already been implemented,
forcing architects to consider more complex mechanisms with the ever decreasing likelihood of
commensurate returns. In addition, many new ideas are based on learning algorithms that require
long training periods before the mechanisms become effective. us, a simulator must not only
be sufficiently detailed to accurately evaluate the proposed architecture, it must offer sufficient
performance to run long enough to warm up such proposed mechanisms, in order for the user to
arrive at the correct conclusions. Since microprocessor designs routinely cost hundreds of millions
of dollars, the ability to accurately evaluate a proposed microprocessor can save on substantial costs
that otherwise would be wasted on designs that do not offer improvements.
e advent of multicore processors and increased emphasis on parallelism has led to an in-
creasingly diverse set of computer architecture ideas under study. is trend has created an ever
higher demand for more simulation performance. With binary compatibility no longer a strict re-
quirement in new architectures, a breakthrough concept needs a correctly matched software base
to demonstrate its full potential. Fast simulators are critical for the development of software in
conjunction with the design and development of the underlying hardware. Software developers
(in applications, compilers, and operating systems) are unwilling to devote serious development
effort until a fast execution platform is available to develop on. It would be desirable if a simulator
was sufficiently fast for software developers. Unfortunately, software simulators, especially those
that are performance-accurate, are generally too slow for interactive use, delaying serious soft-
ware development to after the target is available which, in turn, serializes hardware and software
development.
Outside of computer architecture research, fast and accurate simulators can facilitate the
development, debugging, and performance tuning of both multi-threaded and single-threaded
applications. Understanding cache interference can be challenging even for single-threaded ap-
plications, but is especially hard for multi-threaded applications. Even when real hardware exists,
developing multi-threaded applications can be challenging due to inherent non-determinism and
the lack of observability in the underlying execution platform. For example, current Intel proces-
sors only allow the monitoring of up to four performance counters at any given time. Additional
counters can only be obtained through multiple runs—each run with a different set of specified
counters. us, cycle-by-cycle information is virtually impossible to obtain from such processors.
1.4. HARNESSING FPGAS FOR SIMULATION NOT PROTOTYPING 3
If software-based simulators were sufficiently fast, there would be no need to consider al-
ternatives to software-based simulation hosts. Unfortunately, today’s cycle-accurate simulators
of uniprocessor targets achieve on the order between 1 Kilo-Instructions Per Second (KIPS)
and 300 KIPS, depending on the implementation. Such simulators often compromise accuracy
for performance. Furthermore, there is no widespread use of a parallelized simulator of unicore
targets despite numerous attempts. e problem is further compounded by the proliferation of
multicore targets that increases simulator computation at least linearly with the number of target
cores. ere have been attempts to parallelize simulators of parallel targets, such as Graphite [28],
SlackSim [7], and ZSim [39], but all compromise accuracy in unpredictable ways to achieve scal-
ability.

1.4 HARNESSING FPGAS FOR SIMULATION NOT


PROTOTYPING
Most, if not all, computer systems today leverage hardware-level parallelism extensively to achieve
high performance. Accurately simulating the high levels of parallelism seen in current and future
computer systems on commodity, tightly-coupled multicore processors (with limited threads) is
inherently slow. A fast and cost-effective alternative to software is to apply a programmable accel-
erator that directly matches the hardware-level parallelism required in accurate computer system
simulation.
Field Programmable Gate Arrays (FPGAs) are programmable devices made up of hundreds
of thousands (if not millions) of small interconnected lookup tables that can be used to realize
arbitrary logic functions. Unlike a hardwired application-specific integrated circuit (ASIC), where
all of the logic is cast permanently in silicon, FPGA-based hardware can be easily iterated upon
in an incremental design-debug cycle similar to software development (see Appendix A for more
details on FPGAs).
Because of their programmability, FPGAs enable hardware to be implemented and main-
tained by a smaller group of people and at lower cost and time than would be required to produce a
dedicated integrated circuit. e price for using FPGAs is reduced logic capacity and logic speed
relative to native integrated circuit implementations, roughly by a factor of 10 [23] for each. Even
so, FPGAs have been successfully deployed, from “glue” logic to attach ASICs together to accel-
erate a wide range of applications. For the right application, a good FPGA-based implementation
can be multiple orders-of-magnitude faster than software.
A typical but naive way to harness an FPGA for computer system simulation is to imple-
ment a structurally accurate model of a given target system in FPGAs. For example, to model
a 16-core multiprocessor, one could instantiate 16 separate processor cores that are identical¹ to
the target cores, and connect them together with a network-on-chip (NoC) identical to the tar-
get NoC. Such implementations are referred to as prototypes, rather than simulators, because they
¹At least, at the register transfer-level (RTL) but transformations may need to be made to accommodate for the fact that
FPGAs have different underlying structures than are possible in ASICs.
4 1. INTRODUCTION
reflect a one-to-one mapping of the target micro-architecture onto an FPGA. While prototyping
small systems made up of simple cores is feasible in today’s FPGA technologies, using FPGAs
to directly prototype larger and/or more complex cores and systems becomes a herculean effort
when the implementation and integration effort is not that different (modulo the physical design)
of building the target itself.
It is our belief that, given today’s target architectures and FPGAs, one should view FPGAs
as a vehicle for simulation, not prototyping. e goal of simulation (as opposed to prototyping)
is to mimic the target system behavior at the desired level of completeness, accuracy, detail, and
speed. How the simulation is actually carried out under-the-hood is of little concern to the user.
When building a simulator using FPGAs, what is actually implemented on the FPGAs need not
resemble the target system in any way, structurally or physically.
In fact, there are many good reasons not to mimic the target system. For example, an
FPGA-accelerated simulator may take advantage of simplifications that make it easier for con-
struction, such as a constant memory latency, or implementing cache tags only (and not the data
storage) of the cache. In the first case, accuracy is compromised to make the simulator imple-
mentation simpler. In the second case, accuracy is not necessarily compromised if the simulator
produces sufficiently accurate results even though not every component is perfectly modeled. In
general, “shortcuts” used in software-based simulators often can also benefit FPGA-based simu-
lation.
e natural next step is to recognize that the simulation host is not necessarily bound to
a complete software or FPGA implementation. A practitioner could mix hardware and software
hosts to generate a faster-than-software-only “good enough” simulator with as little additional
development time and cost as possible over building a software-only simulator. us, the term
“FPGA-accelerated simulators” is used to indicate hybrid simulators where FPGAs are used to
accelerate specific components of the simulator, not necessarily the entire simulator.
Properly constructed FPGA-accelerated simulators are faster than software-only simula-
tors and can, in fact, be faster than a prototype of the target on an FPGA. Because such simula-
tors are FPGA-accelerated, rather than FPGA-implemented, software can be used to implement
components of the simulator that are otherwise inconvenient and/or unnecessary to implement
on the FPGA. As a result, FPGA-accelerated simulators are not only easier to implement, they
provide more functionality, including full system support, than would otherwise be possible.

1.5 THE REST OF THE BOOK


Chapter 2 gives a background overview in computer simulation. Chapter 3 presents key con-
cepts of FPGA-accelerated performance simulation. Chapter 4 presents hierarchical simulation
and virtualization techniques. Chapter 5 summarizes the current landscape of FPGA-accelerated
simulators. Chapter 6 offers final concluding remarks.
To provide the reader with more context, this book describes two case studies: (1) the FAST
approach [9–11] for studying the performance of uniprocessor and multiprocessor targets using
1.5. THE REST OF THE BOOK 5
FPGAs, and (2) the ProtoFlex approach [15] for accelerating functional-only full-system sim-
ulation, which can be used standalone or as a component in the construction of performance
simulators. Both FAST and ProtoFlex heavily transform and virtualize the model of the tar-
get system to simplify the efforts needed to achieve simulation completeness and, in the case of
FAST, also the timing accuracy on a FPGA-based simulation substrate. Both approaches make
appropriate use of software and FPGA hosts together to achieve the best of both worlds.
7

CHAPTER 2

Simulator Background
2.1 USES OF COMPUTER SIMULATION
In computer systems research and design, simulation studies are used when the target system does
not exist or when a given design is being studied. A simulator implements target behavior in a
manner that is simpler than building the target—otherwise, one would just construct the target.
Even when the target system is available, a simulator offers increased controllability, flexibility,
and observability. Simulators, however, have notable disadvantages compared to the target, such
as being slower, not accurately modeling the target, or not predicting all the behaviors of the
target.
ere are two major forms of computer system simulators: functional simulators and per-
formance simulators. Functional simulation predicts the behavior of the target with little or no
concern for timing accuracy. Functional behaviors include the execution of CPU instructions and
the activities of peripheral devices such as a network card sending packets or a DMA transfer from
disk. e modeling of micro-architectural state that affects performance, such as cache tags, is not
considered a part of functional simulation. Functional simulation is used for a variety of purposes,
ranging from: (1) prototyping software development before a machine is built, (2) providing pre-
liminary performance modeling and tuning, (3) collecting traces for performance modeling, and
(4) generating a reference execution to check that a performance simulator executes correctly.
A performance simulator predicts the performance and timing behaviors of the target. Per-
formance simulation is used for a variety of purposes, ranging from evaluating micro-architectural
proposals, to studying performance bottlenecks in existing systems, to comparing machines to
decide which to procure, or to enable the tuning of compilers. Since there are many timing-
dependent behaviors that a practitioner may want to predict, such as resiliency or power con-
sumption, “performance” simulation refers to the simulation of any or all of the non-functional
aspects of the target. e output of a performance simulator can assume a variety of forms—the
most common example is the aggregate Instructions Per Cycle (IPC) of the target running a
specific workload. ere are, however, other possibilities such as a cycle-by-cycle accounting of
the number of instructions fetched or issued, or even a cycle-by-cycle accounting of processor
resources consumed by every in-flight instruction (e.g., re-order buffer).
Performance can be simulated in a variety of ways. For example, for a simple microcoded
machine with a fixed latency for every instruction, the performance can be accurately predicted
using a closed-form mathematical function that accepts the number of each instruction as argu-
ment. For nearly all other machines, however, a cycle-accurate simulator must model the target’s
8 2. SIMULATOR BACKGROUND
micro-architecture in great detail. e standard way to achieve this is to build a model of every
performance-impacting component, connect such component models together as they would be
in the target system, and simulate those models in tandem.
Functional simulation is typically much faster than performance simulation. Fast functional
simulators run at roughly the same speed as the host machine, but can incur large slowdowns when
instrumentation is introduced. Accurate performance simulators are generally at least four orders
of magnitude slower than their targets. Note that this distinction does not apply to sampling-
based simulation methodologies that only simulate the computer system accurately for a subset of
time or instructions and extrapolating results from the samples. Sampling-based simulators still
require an accurate performance simulator.

2.2 DESIRED SIMULATOR CHARACTERISTICS


Simulators have five key intertwined characteristics: speed, accuracy, flexibility, completeness, and
usability. Simulator design and development is a continuous tradeoff between those characteris-
tics. We briefly describe them below.
Speed. e speed of functional and performance simulators is often measured in the num-
ber of target instructions executed per wall clock second. One notable exception is an analytical
performance model that does not execute instructions, though one could measure its speed in
terms of the amount of time it takes to evaluate the model.
Accuracy. A perfectly accurate functional simulator is one that updates the processor ar-
chitectural state after each instruction is executed as the target would in a real system, assuming
atomic and in-order instruction semantics. Generally, if a functional simulator is inaccurate, there
is a set of programs that it cannot execute correctly or execute at all.
Performance simulators vary greatly in accuracy. Some performance simulators are mathe-
matical models that are often less accurate than simulators that account for more temporal detail.
Other performance simulators are cycle-accurate, implying that the simulator output is accurate
to a single cycle. If the output of the performance simulator is a single IPC for the entire run,
that IPC is equal to the total number of instructions executed divided by the total number of cy-
cles simulated. Likewise, if the output is a cycle-by-cycle accounting of each micro-architectural
resource used in that cycle, a cycle-accurate simulator can output the exact resources used by the
target at each cycle.
It is very difficult to achieve true cycle accuracy. Many simulators claim to be cycle accurate
but are, instead, cycle-level simulators that model performance at the resolution of a target cycle,
but are generally not perfectly accurate to the target.
Flexibility. e flexibility of a simulator is the ease at which the simulator can be changed
to evaluate different target functionalities. Maximizing flexibility is desirable to enable rapid and
productive exploration of different targets.
One of the most widely used computer system simulators today is gem5 [19] that began
as a combination of the University of Michigan M5 simulator and the University of Wisconsin
2.3. PERFORMANCE SIMULATION ACCURACY 9
GEMS simulator. Gem5 embodies the work of many developers over multiple years but exposes
many parameters, making it flexible across a range of parameters, even allowing for quick changes
to the ISA.
Completeness. e completeness of a simulator refers to how much of the entire system
can be modeled. Ideally, the entire system can be simulated as needed. For example, to simulate
a smartphone, one may want to mimic the hardware of the phone, the software running on top
of the hardware, but also the user touching the screen, and the cellular network that the phone is
communicating with. Of course, achieving completeness may be difficult but is clearly desirable,
all other parameters being equal.
Usability. e usability of a simulator characterizes the amount of effort required to use
the simulator for a specific purpose. A usable simulator enables, rather than impedes, the desired
exploration and evaluation. Usability is dependent on the desired use cases; however, it incorpo-
rates many attributes, including the ease at which the desired applications, runtime systems, and
operating systems can be run on the system, how easy it is to sweep parameters, and so on. e
usability of a single simulator can vary for different usage cases. For example, a functional-only
simulator may be very usable to count the number of instructions executed across a range of ap-
plications, but may be very unusable if accurate performance numbers are required. A simulator
may not support operating systems, requiring compiling applications with libraries that mimic
the assumed operating system calls. Such a simulator is less than optimally usable for applications
that do not require complicated operating system functionality, and is not usable if the applica-
tions do require complex operating system functionality. A simulator that is extremely accurate
for a particular micro-architecture may not be usable for a functional-only usage case, due to its
slow speed, or for a different micro-architecture unless it has support for easily modifying the
micro-architecture to the desired target micro-architecture.

2.3 PERFORMANCE SIMULATION ACCURACY


In a performance simulator, accuracy is most often quantified in terms of average percentage
difference from the performance predicted by a more accurate reference point. Averages ignore
outliers that often illuminate the successes or failures of the target computer system.
It is impossible to determine how new applications/runtime systems/OSes might behave
on a simulator that has only been calibrated against existing applications/runtime systems/OSes.
However, there are many possible places where inaccuracies could mean the difference between
a profitable vs. unsuccessful product. It is very difficult to tell just how inaccurate a simulator is
without an accurate reference point, be it a truly accurate simulator or the real target, to calibrate
against.
Any simulator used to make real decisions must offer sufficient accuracy for its uses. It is
often difficult to tell how accurate a simulator needs to be to generate meaningful results. When
a company invests billion of dollars to design and build a microprocessor, the stakes are much
higher compared to generating results for an academic paper. However, if academic paper results
10 2. SIMULATOR BACKGROUND
are not sufficiently accurate to guide decisions, those papers may become irrelevant or mislead the
reader.
In addition, even if the level of accuracy needed is known, it is often impossible to determine
just how accurate a simulator is. ere are two notable exceptions: pure functional simulation and
true cycle-accurate simulation. If one only needs to be able to run software on a simulator, one can
test a simulator for its functional correctness. A perfectly accurate simulator is easy to define, but
virtually impossible to build. Even if the full register transfer logic (RTL) of a target is available
to be simulated, in real systems non-determinism is difficult to avoid due to issues such as clock
crossings and human-scale interactions.
Because there is no way to bound the error for the simulation of an arbitrary target, most
simulator developers compare their simulation results to either a more detailed simulator, that is
presumably more accurate, or actual hardware. A functional simulator is the easiest simulator to
verify, since there are often multiple, verified correct implementations of the functionality. For
example, one could use an existing computer system, that supports the same ISA, to serve as a
functional reference point.
ere are at least two forms of simulation inaccuracy. e first form of simulation inaccu-
racy is abstraction error, where a simplification in the simulation model compared to the target
creates error. For example, some versions of Simplescalar used a constant to model DRAM la-
tency when, in reality, DRAM latency varies depending on several parameters including the ac-
cess pattern (i.e., row buffer locality) and refresh rates. e second form of simulation inaccuracy
is simulator implementation bugs. For example, a practitioner may forget to include the use of
a parameter in the simulator, making that parameter irrelevant. Either error can be misleading,
even worse leading to the wrong conclusions. Both errors are, in general, non-trivial to find, as
there is generally no good reference to compare against.

2.4 SIMULATOR DESIGN TRADEOFF


Simulator design and development is a continuous tradeoff between the five desired characteris-
tics. For example, if accuracy does not matter (e.g., always predict an IPC of 1) it is trivial to build
an infinitely fast simulator that is infinitely flexible and very usable.
e speed of a simulator can have a first order effect on the accuracy of the overall simula-
tion. Modern computer systems incorporate many learning structures, such as branch predictors,
cache replacement algorithms, prefetching units, OS page allocation algorithms, and TCP/IP
windows, that require many cycles to warm up. As different programming languages with differ-
ent execution paradigms become more prevalent, they have very different effects on the execution
paths and instruction caching. As novel applications are introduced, they can sometimes stress
systems in different ways than stressed before. Operating systems can consume 75% or more of
the execution time for many modern applications, making accurately simulating them very im-
portant. e simulation of many billions of cycles may be needed to accurately observe overall
system behavior of many interacting systems, each with a significant amount of state.
2.5. SIMULATOR PARTITIONING FOR PARALLELIZATION 11
Simulation speed also affects simulator flexibility and usability. If a simulator is already fast
enough, one can take shortcuts that trade simulation speed for flexibility. Likewise, a fast simulator
enables more experiments to be run, increasing usability. e effect speed has on flexibility and
usability is not limited to these examples.
ere appears to be no upper limit to the usefulness of speed in a simulator. Simulators
significantly faster than real-time targets would still be very useful, enabling the study of many
alternative configurations. e faster the simulator, the greater the design space can be explored.
Even the performance of real systems can vary, run-to-run, by tens of percent and, therefore, the
ability to do many runs with slightly different initial states is important.
However, simplifications are typically made to enable fast turnaround time, and to improve
flexibility and usability. For example, not running the OS drastically simplifies the simulator by
avoiding such complications as privileged instructions and support for device I/O. Another exam-
ple is to run a limited data set or run a benchmark program rather than a real set of applications.
Depending on the practitioner, there are varied levels of demand on the accuracy and us-
ability/completeness of the simulated functionality and timing. Moreover, the user may wish to
uncover different levels of detail to understand the inner workings of the system, and may have
different requirements of the speed of the simulation relative to real-time. At one extreme, if
the user requires total knowledge of every wire in real-time, there is little alternative to building
an appropriately instrumented target, as no simulator will run fast enough. However, if the user
relaxes their requirements (in even just one tradeoff dimension), many shortcuts become possi-
ble in the construction of the simulator. Developers of software-based computer simulators have
long taken advantage of a diverse range of shortcuts in simplifying their efforts while nevertheless
constructing simulators with “good enough” accuracy.

2.5 SIMULATOR PARTITIONING FOR PARALLELIZATION


Simulators can be partitioned in a variety of ways that improve performance through paralleliza-
tion. In this section, we describe three basic orthogonal partitioning schemes.

2.5.1 SPATIAL PARTITIONING


A natural way to partition a simulator is at the spatial boundaries of the target system, enabling
the simulator to exploit parallelism that has already been exploited in the target, presumably to
improve target performance. For example, core-level partitioning (Figure 2.1 (i)) partitions the
simulator at the target cores/shared cache level, while module-level partitioning (Figure 2.1 (ii))
further partitions the simulator into structural modules (target core fetch/decode/rename/etc,
target cache banks).
Spatial partitioning has been studied in previous work [2, 8, 16, 29] in the realm of software
to minimize bandwidth and to tolerate latencies for enabling efficient mapping onto commod-
ity processors. ough superlinear speedups were achieved in limited cases when compared to a
sequential host running the same number of target cores, due to the additional host cores provid-
12 2. SIMULATOR BACKGROUND

Figure 2.1: Simulator partitioning at different boundaries.


2.5. SIMULATOR PARTITIONING FOR PARALLELIZATION 13
ing additional host cache capacity [8], most artifacts achieved 50% efficiency given unicore host
speeds around 100KIPS. Penry et al. [35] studied automatically parallelizing Liberty simulators,
including those of an unicore target, and also achieved roughly 50% efficiency.
In contrast to software-parallelized results, using core-level partitioning on an FPGA-based
simulator can achieve linear speedup due to the low cost of synchronization on an FPGA [49].
Core-level partitioning, however, often cannot extract sufficient parallelism to provide large sim-
ulation performance improvements, since the simulation is still bounded by the time required to
simulate a single target cycle on a single target core. By decomposing at the module-level it is pos-
sible to reduce this bottleneck to the time required to compute a single target module cycle [33].
However, simply mapping each core or module to its own dedicated set of FPGA resources
can consume significant area and incur losses in efficiency. Instead, many FPGA-simulators em-
ploy a form of virtualization, using time-multiplexing to map a set of cores or modules onto a
single FPGA computation resource. is approach, known as multi-threading, is detailed further
in 4.3.1.

2.5.2 TEMPORAL PARTITIONING


Another approach to extracting parallelism is to split a single performance simulation into a set of
simulations, each simulating a different chunk of target time, but not necessarily simulating all of
target time of interest or at the same accuracy. One example of temporal partitioning is statistical
sampling, such as [40] and [50], that are often used in software-based simulators to reduce the
amount of time required for detailed simulation. Using a functional simulator to fast-forward to
different points during program execution, it is possible to extract temporal parallelism. e key
disadvantage in such approaches is the inaccuracy created by the fast-forward process and the need
to run the performance simulator for significant periods of time to warm up micro-architectural
state after the fast-forward period. Furthermore, when simulating multiprocessor targets, fast-
forwarding often uses long instruction quantum to interleave the execution of target cores for
efficiency purposes. is coarse-grained interleaving can distort the execution of a multithreaded
program, resulting in performance inaccuracies.
As FPGA-accelerated simulators can be designed to solve many of the problems asso-
ciated with sampling simply by the increase in simulation speed, extracting temporal parallelism
from an FPGA-accelerated simulator is often realized differently than in software. Some FPGA-
accelerated simulators attempt to solve the problems introduced by sampling (coarse-grained ex-
ecution and lengthy warm up periods) by mapping these activities onto the FPGA [14]. While
temporal parallelism is not extracted directly by the FPGA simulator itself, using functional ex-
ecution interleaved on single-instruction basis and maintaining constant warm up of key micro-
architectural structures, the FPGA-simulator can enable extracting temporal parallelism in soft-
ware.
Another approach to temporal decomposition in FPGA-accelerated simulators is to exploit
the parallelism between multiple independent experiment trials. As the target cycles in different
14 2. SIMULATOR BACKGROUND
experiments are independent, it is possible to simulate multiple target cycles from different trials
at the same time on an FPGA-simulator [49]. Such an approach is useful when the target system
may only contain a few target cores, allowing multiple small targets to be aggregated into a single
simulation job on the FPGA.

2.5.3 FUNCTIONAL/TIMING PARTITIONING


It is also possible to partition a simulator into a functional partition/model and a timing parti-
tion/model. e functional partition simulates the functionality of the target system, while the
timing partition predicts the performance (and/or power, temperature, reliability, etc.) of the tar-
get. In a software context, such a partitioning is traditionally used to promote reuse [4, 17] but
not for parallelism. Functionality changes slowly, as it is exposed as a contract to the entire soft-
ware stack via the ISA, while the micro-architecture, described in the timing model, changes
frequently. us, change is mostly isolated in the timing model between successive architectural
refinements while the same functional model can be designed and verified once and then reused
with infrequent modifications.
For example, by partitioning along functional/timing at the core-level, the ISA behavior
and the micro-architectural timing become separate simulation entities (Figure 2.1 (iii)) that can
then be parallelized. Conceptually, the functional and timing partitions are connected by an ab-
stract trace buffer that contains a trace consisting of multiple trace buffer entries generated by the
functional partition, each containing information about its instruction such as the opcode, source
and destination register names, instruction pointer, data addresses, and so on.
A timing partition uses trace information to accurately predict when activity occurs in the
target. For example, a timing model uses functionally-generated addresses to determine if a load
hits in the cache. e timing model also uses functionally generated source and destination reg-
ister names to determine register dependencies. As long as the functional information is exactly
what the actual target would have generated, a correct timing model can model the target per-
fectly accurately. In general, however, a functional model requires timing information to generate
a target-correct trace. For example, the functional partition must know when a branch is mis-
predicted and when it is resolved to generate the correct wrong path instructions. Likewise, the
functional partition must know when a load is performed in relationship to a store to the same
location to return the target correct value to the load.
In the context of FPGA-based simulators, partitioning along the functional/timing bound-
ary allows for a variety of mapping choices. By selecting which parts of the simulation are mapped
to the FPGA and how these components interact with different simulator organizations can be
constructed. Chapter 3 details the design space of simulators along this dimension in more detail.

2.5.4 HYBRID PARTITIONING


e three forms of partitioning are often employed in concert with each other. For example, spatial
partitioning can be first applied at the core-level, then functional partitioning can be applied to
2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 15
extract the ISA behavior/micro-architecture timing components. e micro-architecture can then
be further spatially partitioned to extract timing module-level parallelism.

2.6 FUNCTIONAL/TIMING SIMULATION


ARCHITECTURES
As discussed previously, simulators are distinct from prototypes and, therefore, have different
architectures than their targets. While each of the three dimensions of parallelization may be
explored with different simulator architectures, the functional/timing dimension is a defining
characteristic of processor simulation. e five basic functional-timing simulator architectures
shown in Figure 2.2 have been categorized in the literature [10, 27] as (i) monolithic simula-
tors (sometimes called integrated simulators), (ii) timing-directed simulators, (iii) functional-first
simulators, (iv) timing-first simulators, and (v) speculative functional-first simulators.

2.6.1 MONOLITHIC SIMULATORS


A monolithic simulator combines target functionality and target performance prediction in a
monolithic piece of code. A monolithic simulator is, in fact, one form of an implementation
of the target and thus might be considered a prototype, but it is likely to have been structured
differently or simplified in some way from the target.
Some monolithic simulators compute every event that happens on each cycle for each com-
ponent. us, each component of the functionality of the target is performed at the correct rela-
tive target time compared to all of the other components of the target. For example, executing an
ADD instruction on an out-of-order processor requires several steps, many of which are timing
dependent. e ADD instruction must be fetched from instruction memory, decoded, register-
renamed, dispatched to a reservation station, wait for operands, issued to the ALU (if necessary),
completed, written back and sent to the other reservation stations, and retired in order. A mono-
lithic simulator does not separate functionality from timing; thus, the fetch occurs at the correct
target time, ensuring that a store to the instruction address in the case of self-modifying code
occurs at the correct moment in target time.
A sufficiently detailed register transfer logic (RTL) style description of a microprocessor,
whether it is written in Verilog, C, or any other language, is a common example of a monolithic
simulator. In such a simulator, every register in the design is instantiated and correctly written on
each target cycle.
Despite the potential for high levels of accuracy, monolithic simulators are difficult to write
and modify, since they are detailed and require each operation to be carried out at the correct
target time, even though it might not be necessary to perform the operation at the correct target
time to be either functionally or timing accurate.
16 2. SIMULATOR BACKGROUND

Figure 2.2: Simulator architectures. Functional model components are in light green, timing model
components in dark green, monolithic components in gray.
2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 17
2.6.2 TIMING-DIRECTED SIMULATORS
Timing-directed simulators were developed to address the design complexity of monolithic sim-
ulators and to promote code reuse.
Timing-directed simulators are factored into a timing model and a functional model. e
functional model performs the actual tasks associated with fetching an instruction, decoding an
instruction, renaming registers, and actually executing the instruction. When the timing model
determines that some functionality should be performed, it calls the appropriate functional model
to perform that function and to return the result (if the timing model depends on that result to
proceed). us, the functionality is performed at the correct target time as it would in a monolithic
simulator. However, the functional model is implemented separately and can be reused along with
multiple timing models.
At a high level, a timing model only needs to model activity that impacts timing and,
therefore, does not need to model activity that only impacts functionality. For example, given a
fixed latency integer ALU, one only needs to model its latency in the timing model, rather than
modeling the functionality of the ALU. One simple way to model latency is to delay the output
by the desired latency. Another example is the fact that simulated caches do not require actual
data. A third example is that instructions do not need to be decoded (since they have already
been decoded by the functional model). us, timing-directed timing models (and many timing
models in general) appear to be aggressively stripped down targets, with only the performance
skeleton remaining.
To be accurate, the timing model must capture a tremendous amount of tightly connected,
parallel activity. In fact, this is the reason why fast microprocessors must be implemented in hard-
ware. If there was a way to efficiently simulate an aggressive microprocessor target on a multi-
processor host, one could likely use those techniques to make a faster processor. us, as the
complexity of the target computer system grows, the timing model gets progressively slower. e
timing model is the bottleneck for both a functional-first simulator (described later in this section)
and a timing-directed simulator.
e Intel Asim [17] simulator is a timing-directed simulator that utilizes a functional model
to perform decode, execute, memory operations, kill, and commit. PTLSim [51] is a timing-
directed simulator that has a functional model that performs instruction operations, but does not
actually update state, which is left to the timing model. e M5 “execute-in-execute” simulator is
a timing-directed simulator that functionally performs the entire instruction when the instruction
is executed in the timing model.
Depending on (i) the level of accuracy desired, (ii) the target, and (iii) the decision as to how
functionality is partitioned, the functional model of a timing-directed simulator often reflects the
target system at least to some degree. For example, an Asim functional model provides infinite
register renaming to enable simulation of targets with register renaming. us, it is possible that
a target might require a specialized functional model to accommodate it.
18 2. SIMULATOR BACKGROUND
e timing model and the functional model in a timing-directed simulator are very tightly
coupled with bi-directional communication occurring several times per target cycle. As a result,
exploiting timing-directed partitioning to parallelize simulators is unlikely to result in speedups.
For example, when simulating a target with an idealized pipeline with one instruction committed
per cycle (IPC=1), there will be an average of one set of blocking calls between the timing model
and functional model every target cycle. e blocking calls sequentialize the computation, limiting
parallelism. us, if 100 Million Instructions Per Second (MIPS) of simulation performance is
desired, assuming a minimal one call into the functional model per instruction, an interaction
occurs every 10ns. e communication latency between any CPU and any off-chip component
by itself, not counting any time to perform the functional model, will be significantly longer
than 10ns. us, it is not surprising that we are not aware of any software-hosted simulators
parallelized on timing-directed boundaries. Instead, this partitioning is intended purely for reuse
and complexity mitigation.

2.6.3 FUNCTIONAL-FIRST SIMULATORS


Executing instructions in a processor could potentially interact with other instructions. For exam-
ple, if the microprocessor has multiple heterogeneous decoders, the decoder selected for a partic-
ular instruction depends on other instructions being decoded at roughly the same time. However,
in many such cases, such interactions do not have an effect on functionality and, therefore, can
be safely ignored without compromising functional accuracy.
Functional-first simulators were designed around the assumption that timing does not af-
fect functionality. For simulator implementation convenience, a functional-first simulator sep-
arates functionality from timing as does a timing-directed simulator. us, a functional model
only needs to be developed once and can be reused across a variety of timing models of different
targets with varying accuracy. Unlike a timing-directed simulator, the functionality is executed
before the timing, thus accounting for the name “functional-first.” Doing so further simplifies
the simulator. e functional model executes the program at an architectural level, executing in-
structions and modifying architectural state. As it does so, it generates an instruction trace that
contains information such as the instruction address, the instruction itself, the opcode, source
registers, destination register(s), data addresses, and so on, that the timing model needs to pre-
dict performance. us, a functional model executes ahead the timing model, feeding the timing
model an instruction trace providing all the necessary information the timing model needs.
e functional model does not need to run at the same time as the timing model. It is com-
mon to generate and store the instruction trace on disk and piping the traces from disk through
the timing model. Many functional-first simulators have been written including versions of Sim-
plescalar [24], SESC [38], and Graphite [28].
With multiprocessor targets, functional execution requires some form of instruction inter-
leaving, even if it is simply based on the performance characteristics of the host machine. As a
result, a simulator designed as a functional-only simulator of a multiprocessor target is effectively
2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 19
a functional-first simulator attached to some form of implicit timing model. A simple form of
an implicit timing model is an idealized single-cycle pipeline model that allows for deterministic
interleaving at a single-instruction granularity, which was done in the ProtoFlex simulator.

2.6.4 TIMING-FIRST SIMULATORS


Timing-first simulators [27] address the difficulty of creating a complete performance simulator
that accurately models all target functionality. A timing-first simulator includes a performance
simulator that executes target functionality. us, it can be considered a monolithic simulator,
or even a timing-directed simulator. However, it does need to implement all functionality, just
the most commonly used functionality. As each instruction is retired, a separate, known-to-be-
correct, and generally complete functional simulator executes the same instruction within its own
copy of the architectural state. e architectural updates from the performance simulator are com-
pared to the architectural updates from the known correct functional simulator. If the updates are
the same, no further action is required and execution continues. If not, the performance simula-
tor’s pipeline is flushed, its architectural state is forced to be the same as the functional simulator,
and execution continues from that point.
us, the performance simulator component of a timing-first simulator does not need to
implement the full instruction set and peripheral functionality, nor does the implementation need
to be absolutely correct. It relies on the functional simulator to detect and correct omissions and
errors. is capability reduces the complexity of designing a full-system simulator.
One downside to timing-first is that every instruction that is being modeled accurately
must be executed by both the performance model and the functional model. In addition, every
omission/error detected and corrected introduces simulation error due to the pipeline flush and
restart. Also, due to the use of a functional simulator for verification, which executes instructions
inorder, a timing-first simulator is limited to targets that only support sequential consistency. A
non-sequentially consistent memory ordering would be flagged as incorrect, even though it might
be target-correct in a non-sequentially consistent memory model.

2.6.5 SPECULATIVE FUNCTIONAL-FIRST


While functional-first simulation makes the assumption that functionality is divorced from tim-
ing, they are actually related. For example, branch mispredicts and resolves, the relative ordering
of multiprocessor load-stores, and atomic read-modify-write ordering can all directly affect the
functionality of the processor model. In such cases, a functional-first simulator is inaccurate. is
is precisely the reason why a timing-directed simulator must break up a monolithic functional
model into a series of micro-functional steps in order to tightly control the interleaving and ap-
plication of side-effects for each of these steps.
Speculative functional-first (SFF) simulation [9, 10] is derived from functional-first sim-
ulation, but addresses the functional-first accuracy problem while providing opportunities for
parallelization and FPGA-acceleration. It is based on two observations. e first is that micro-
20 2. SIMULATOR BACKGROUND
functional synchronization enforced by a timing model is only necessary when the instruction
stream and the implied execution produced by the functional model differs from what the target
would have produced. An observable race, such as when one target core loads from the same loca-
tion that another target core is storing to, is an example of where the functional model differs from
the target. e second observation is that scalable target applications have few observable races
in the functional model, while all applications, scalable or not, will have very frequent observable
races in the timing model since timing model modules generally communicate every cycle.
SFF’s performance approaches that of functional-first, but preserves the correctness of
timing-directed simulation. Like a functional-first simulator, an SFF simulator’s functional model
executes and initially populates the trace without timing model input. e timing models reads
information from the trace as it needs it. However, rather than assuming that the information is
correct, the timing model (i) detects when the trace information has diverged from being target
correct and (ii) corrects the diverged trace by providing sufficient information to the functional
model to enable it to regenerate the trace so it is target correct.
Divergence between the functional-execution and timed performance simulation is de-
tected by propagating values generated by the functional model through the timing model into a
set of timing model oracles. Functional values for potentially inconsistent values (e.g., branch tar-
gets and memory load/store values) are provided as part of the functional trace. When the timing
model executes a store, the functional value is stored into the timing model memory oracle. When
the timing model executes a load, the load is performed from the timing model memory oracle
and compared against the functional value. If the functional value differs from the target value,
the functional model is target incorrect and must be corrected before the timing model can con-
tinue. Since the timing model is only responsible for propagating the values, not generating them,
the normal division of work between functionality and timing of a functional-first simulator is
preserved.
Correcting for functional execution requires communicating back the point of divergence
and the corrected value that should have been used during the functional execution. e func-
tional model then rolls back its execution (using a standard checkpoint-replay scheme), corrects
its execution at the point of divergence, then continues execution. In the case of an incorrect
functional load, the functional model is rolled back to the load instruction, the value is corrected,
and the following instructions are replayed with that new value up to, but not including the first
unexecuted dynamic instruction.
Multiple rollbacks may occur and their effects, therefore, should accumulate. Multiple roll-
backs can be supported through the use of a replay log that contains an entry for every possible
unretired load. On the execution of a dynamic load instruction, the corresponding entry is popu-
lated with the functional load value. When rollback is initiated due to divergence, the appropriate
entry in the replay log is overwritten with the corrected load value. As instructions are replayed,
the load instructions use the load values from the replay log, rather than re-execute the loads
against the memory. New trace entries are generated that replace the original trace entries gen-
2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 21
erated during execution or a previous replay. Once the last dynamic instruction to be orginally
executed is replayed, the simulator transitions to executing instead of replaying.
SFF simulators enable the accurate simulation whenever the functional model may differ
from the target and, therefore, is much more general than just correcting memory order. For
example, a functional model generally does not generate wrong path instructions. By providing
a branch predictor oracle (which is simply a model of the target branch predictor) in the timing
model, the timing model can then compare the instruction pointer generated by the functional
model with the predicted branch. When a divergence is detected, the functional model is rolled
back and the branch is taken as the target would have. On branch misprediction correction, the
branch divergence is detected again and the functional model is corrected again. Nested branches
are handled by a branch target log, conceptually identical to the load replay log, that is replayed
when instructions are replayed after a correction, before an unexecuted dynamic instruction is
reached.
Speculative memory and arbitrary memory models can be easily simulated in the same
way. us, speculative functional execution with timed performance correction allows an SFF
simulator to overcome the correctness limitations of a normal functional-first while maintaining
the basic instruction-at-a-time execution model.
An example of a speculative functional-first simulator handling branch misprediction is

shown in Figure 2.3. e instruction pointer I2 , pointing to the instruction BR z L1, is a mis-
speculated branch. At time T D 1, the functional model has already executed five instructions.
e first two instructions are in the timing model and the last three are still waiting in the trace
buffer. In the first stage (fetch), I2 is detected to have been mispredicted by comparing the in-
struction pointer produced by the functional model and inside the trace with the branch predictor
predicted instruction pointer in the timing model. e timing model notifies the functional model
that I2 is misspeculated and to continue executing from I4 (either a “*” or a dark border is used
to indicate a wrong-path but target-correct instruction.)
e timing model is stalled until the first wrong path instruction, I4 , arrives. m cycles
later (T D 1 C m), two wrong-path instructions, I4 and I5 , have been sent to the trace buffer,
overwriting the target incorrect instructions I3 , I4 , and I5 . e timing model fetches I4 on
the next cycle and I5 on the cycle after that, feeding each into the pipeline. e timing model
resolves the branch at T D 3 C m, notifying the functional model. e functional model then
rolls back and regenerates I3 , I4 , and I5 , overwriting (shown by the black trace buffer entries)
the wrong path instructions by time T D 3 C m C n. e next cycle, the timing model fetches
the next instruction for its Fetch unit and commits I1 by advancing the commit pointer. As the
timing model commits, it notifies the functional model, allowing the functional model to reclaim
rollback resources.
22 2. SIMULATOR BACKGROUND

Figure 2.3: Illustrating the handling of misspeculation in SFF.

2.7 SIMULATION EVENTS AND SYNCHRONIZATION

Simulating target time and events is another aspect of simulator organization. Software is not
naturally parallel but hardware is. Software must execute each concurrent operation sequentially,
but in the correct order to correctly read and update state. One way to simulate concurrent op-
erations is using simulator events. An event is a piece of code that models a particular operation
that might execute concurrently with other events in the target. Each event is stamped with a
time that indicates the target time when it will execute. As events are created or recycled, they are
stored in an event wheel or queue. Events can be created for a future time. Events are dequeued
and executed from the event queue in target time order.
2.7. SIMULATION EVENTS AND SYNCHRONIZATION 23
It is possible that multiple events are for the same target time. For example, the first event
may be the producer of the data of a pipeline register implemented with a master-slave flop-flop
and the second event may be the consumer of the data of a pipeline register. us, both events
should execute at the same target time. Depending on how the simulator is written, however, the
order in which those two events are executed in the simulator may or may not matter.
If, for example, the pipeline register is implemented as two variables, each representing a
latch in a master-slave flip-flop, either the first event or the second event can execute first, since
the first event will write into the first variable and the second event will read from the second
variable. ere is not the possibility of the second event reading the data written by the first event
on that same cycle. Such an approach, however, requires the first variable be copied to the second
variable after both events have completed executing, thus requiring another simulator event to
occur after target events have completed.
As an alternative, the pipeline register could be implemented as a single variable. In that
case, the second event must be executed before the first, ensuring that the second event executes
with the pipeline register’s value from the previous cycle. In general, the events must be sorted in
reverse pipeline order, where the end of the pipeline executes first and the front of the pipeline
executes first. If, however, the pipeline has a loop, sorting is impossible. In that case, at least one
pipeline register must be simulated by double variables.
Event queues provide the capability to “skip” target time when there is no other activity.
For example, if one was simulating a HEP-like eight stage multithreaded microprocessor that
only executes an instruction from a thread every eight cycles and was only running one thread,
there is no point to simulate seven stages that are not active at any given target time. One would
only need to simulate the one active stage out of eight at any given target time. One event-based
simulator strategy would, as it executes each event, enqueue an event to simulate each successive
stage into the next target time.
An alternative to event-based simulation is cycle-by-cycle simulation that executes every
component/event every cycle. Such a scheme may seem less efficient, since there may be times
when a particular component has nothing to do, but doing so eliminates the event queue over-
heads. ere are cases where a cycle-by-cycle simulator is faster than an event-driven simulator,
especially if the events are appropriately statically scheduled to eliminate overheads [21].
Maintaining synchronization between timing events requires simulating a consistent tar-
get clock. Maintaining synchronization between functional events on the other hand, requires
simulating an instruction interleaving that adheres to a desired target cycle interleaving.
In monolithic or timing-directed simulator designs, it is important to decide how to sim-
ulate a consistent target clock as all events are timing events. In a functional-first simulator, how
functional instruction execution is interleaved across multiple target cores can be chosen to bal-
ance a tradeoff between accuracy and efficiency. For example, simulation efficiency might be very
high if one target core’s instructions are completely executed before another target core’s instruc-
tions are completely executed. However, since one target core’s instruction execution can affect
24 2. SIMULATOR BACKGROUND
another target core’s instruction execution, executing one to completion before the other can result
in inaccurate results. Fine-grain interleaving may more closely model the actual target instruction
interleaving but reduce the overall simulation efficiency.

2.7.1 CENTRALIZED SYNCHRONIZATION


Centralized synchronization uses a centralized component to synchronize across multiple events.
On a sequential host, centralized synchronization between events is naturally implemented by a
correctly implemented event queue and/or cycle-by-cycle simulation.
Centralized synchronization has also been used in parallelized simulators [8]. An example
is a centralized component running on a single core that other components can call to perform
the synchronization activity. Doing so, however, limits the amount of parallelism to the amount
of throughput that the centralized synchronizer supports.
e implication of a centralized barrier synchronization is that everything from one target
cycle (or sub-cycle) must be complete before the next target cycle (or sub-cycle) starts. Doing so
simplifies the simulator, at the potential cost of performance, especially in a parallelized simulator.

2.7.2 DECENTRALIZED EVENT SYNCHRONIZATION


One can also implement synchronization in a decentralized way, where synchronization is imple-
mented on a per-connection basis, rather than in a centralized component. What that implies is
that different parts of the simulator can be active at the same time, providing more opportunities
to parallelize the simulator.
For example, A-Ports [34] used in Asim, or FAST-Connectors [11] can provide decen-
tralized synchronization. Each simulator module may have multiple inputs, each represented by
the endpoint of an A-Port that is connected to that module. e module waits for inputs on all
of its A-Ports before it executes. A-Ports require activity every target cycle, even if it is a null
input, which keeps each input on the module synchronized with the others. e latency of such
ports/connectors can be set, along with the bandwidth, enabling it to model a wider interface (say
a one/two/four issue processor) using the same code.
25

CHAPTER 3

Accelerating Computer System


Simulators with FPGAs
FPGA-based simulators have been proposed to speed up simulation by parallelizing the various
activities of a simulator and mapping these activities to execution resources on the FPGA fabric.
us, the first task in developing an FPGA-accelerated simulator is to determine how they should
be partitioned for parallelization, and what components should be run on the FPGA and what
components should be run in software.
Becaues FPGAs are hardware, they are quite good at exploiting simulator parallelism, es-
pecially when simulating hardware targets. One can execute events in parallel in hardware. Of
course, an FPGA often contains fewer resources than the target, resulting in one of two possibili-
ties: either not all of the parallelism of the target is exploited in the simulator or multiple FPGAs
are used to provide additional resources so that all of the parallelism of the target can be directly
exploited in the simulator. Multiple FPGAs, themselves, require partitioning.
is chapter describes how FPGAs can be applied to accelerating simulators of computer
system targets, focusing on how the partitioning of standard simulator architectures can be lever-
aged to partition simulators for FPGA acceleration. It concludes with a case study example of
an FPGA-Accelerated Simulation Technologies (FAST) simulator, a speculative functional first
FPGA-based simulator.

3.1 EXPLOITING TARGET PARTITIONING ON FPGAS


A computer system target is naturally partitioned. A simulator can be partitioned on those target
boundaries. For example, one could parallelize a two-core processor by mapping one target core
onto a host core and the other target core onto another host core. Using core-level partitioning on
an FPGA-based simulator can achieve linear speedup due to the low cost of synchronization on
an FPGA [49]. is core-level partitioning is often insufficient in extracting sufficient parallelism
as the detail of the core increases, since the simulation is still bounded by the time required to sim-
ulate a single target cycle on a single target core. By decomposing at the module-level it is possible
to reduce this bottleneck to the time required to compute a single target module cycle [33].
However, simply mapping each core or module to its own dedicated set of FPGA resources
can consume significant FPGA area and, therefore, reduce computation efficiency. Instead many
FPGA-simulators employ a form of virtualization, using time-multiplexing to map a set of cores
26 3. ACCELERATING COMPUTER SYSTEM SIMULATORS WITH FPGAS
or modules onto a single FPGA computation resource. is approach, known as multi-threading,
is detailed further in 4.3.1.

3.2 ACCELERATING TRADITIONAL SIMULATOR


ARCHITECTURES WITH FPGAS
Instead, one could use the partitioning found in traditional simulator architectures to accelerate
with FPGAs, leveraging the extensive experience and simulator design from the literature.

3.2.1 ACCELERATING MONOLITHIC SIMULATORS WITH FPGAS


A monolithic simulator is either not partitioned, or partitioned on module boundaries (see pre-
vious section.) Because they are not partitioned into functional/timing components, it becomes
difficult to accelerate only parts of the simulator in FPGA and leave other parts in software. us,
monolithic simulators are oftentimes close to prototypes. In such cases, an FPGA implementa-
tion would also be close to a prototype and, therefore, likely to be difficult to implement in the
case of a realistic target. In a realistic target, the number of resources would be large, the intercon-
nection rich, and many multi-ported memories are present, that all consume a significant number
of FPGA resources.
It is possible that performance prediction is somewhat abstracted. For example, one could
make DRAM latency a constant. However, because full target functionality is implemented in
a monolithic simulator and, therefore, difficult to reuse, the implementation costs are generally
high. In addition, prototypes are often not sufficiently flexible to enable exploration, but tend
to only model their particular target micro-architecture and slight variations. Versions of Lib-
erty [35], RAMP-Red [30], RAMP-Blue [22], and RAMP-White [1] are examples of mono-
lithic simulators implemented on FPGAs.

3.2.2 ACCELERATING TIMING-DIRECTED SIMULATORS WITH FPGAS


ere are many interactions between the functional model and timing model in an accurate
timing-directed simulator. us, implementing a fast timing-directed simulator in an FPGA re-
quires both the functional model and the timing model to be implemented on the FPGA to
minimize communication costs. Implementing the functional partition in the FPGA consumes
resources that scales with the complexity of the modeled instruction set, making simpler ISAs
more attractive to implement on an FPGA. Implementing a full x86-64 ISA on an FPGA, for
example, would consume significantly more FPGA resources than a Sparc V8 ISA. A functional
model becomes, in essence, a monolithic simulator of the ISA that is itself quite complex.
e Intel/MIT HAsim simulator [33] is an FPGA-based timing-directed simulator that is
effectively an FPGA implementation of an Asim simulator. RAMP-Gold [49] is another timing-
directed simulator implemented on an FPGA. It has a functional model that is separate from the
timing model and split into several components that each can be individually directed to process
3.2. ACCELERATING TRADITIONAL SIMULATOR ARCHITECTURES WITH FPGAS 27
by the timing model. Both the functional model and the timing model are implemented on the
FPGA to minimize the communication overhead. Simulating many target structures requires a
tremendous number of resources which the FPGA is unlikely to have. Both HAsim and RAMP-
Gold use transplant and multithreading technologies, both discussed in Chapter 4 to provide full
system capabilities and support larger targets respectively.

3.2.3 ACCELERATING FUNCTIONAL-FIRST SIMULATORS WITH FPGAS


e same reasons to implement a timing model on an FPGA apply to a functional-first simula-
tor as they do to a timing-directed simulator: there is a tremendous amount of tightly coupled
parallel activity that is exactly what hardware fundamentally implements. us, one could imple-
ment a functional-first timing model on an FPGA and feed it with an instruction trace that is
either dynamically generated by a functional simulator, virtual machine, or even a microprocessor
modified to generate a trace, or stored on a disk and piped to the timing model. ReSim [18] is an
example where an FPGA was used to implement the timing model of a functional-first simulator,
running at roughly 28MHz which is considerably faster than an accurate software-based timing
model.

3.2.4 ACCELERATING TIMING-FIRST SIMULATORS WITH FPGAS


To the best of our knowledge, no true timing-first simulators have been implemented on or ac-
celerated with an FPGA. Certainly it could be done, even with an FPGA-based performance
model and a software-based functional model. However, the FPGA-based performance model
becomes, in essence, a monolithic simulator. e performance simulator would somehow push its
architectural state changes to the functional simulator for comparison. Instead of pushing all of
the changes, a checksum/hash could be generated from the architectural state in order to improve
checking performance.
A timing-first simulator would be limited by the speed of slowest component. In addition,
such a strategy would still have the limitations of timing-first simulation, including the inability
to model arbitrary memory models and the inaccuracies introduced by any error or omission
from the performance simulator. For these reasons, a timing-first simulator does not appear to be
amenable to FPGA acceleration.

3.2.5 ACCELERATING SPECULATIVE FUNCTIONAL-FIRST WITH FPGAS


Speculative functional-first (SFF) can be used in a software simulator to promote reuse by making
a potentially very complex functional simulator able to accurately predict performance. Doing so,
however, would not improve performance.
SFF was originally conceived and designed to be accelerated on an FPGA, specifically
running the functional model in software and the timing model on an FPGA. Since round-trip
interaction between the two models occurs only when the timing model detects a divergence that
needs to be corrected, the expectation is that such round-trips occur infrequently and, there-
28 3. ACCELERATING COMPUTER SYSTEM SIMULATORS WITH FPGAS
fore, the functional and timing models can be implemented on hosts that are relatively far away
from each other. us, the communication latency between an FPGA-based timing model and
a software-based functional model can be tolerated.
Since the functional model can run in software, it can be derived from an existing simulator.
One could even start with a full system simulator that can boot operating systems and run standard
applications. us, it offers an alternative to the transplant method described in the next chapter.

3.2.6 ACCELERATING COMBINED SIMULATOR ARCHITECTURES WITH


FPGAS
Similar to the approach of timing-first (which effectively combines a monolithic simulator with a
functional-only simulator), it is possible to combine these approaches. For example, the ProtoFlex
simulator [14] is a SMARTS simulator that uses fast functional simulation to warm the caches
and branch predictor and periodically runs a detailed performance model using those warmed
up caches and branch predictor. e fast functional model is implemented on an FPGA while
the performance model is implemented in software. is approach preserves the benefits of soft-
ware simulation while enabling more accurate fast-forwarding and sampling due to the ability
of the FPGA to enforce fine-grain instruction interleaving and constantly warmup key micro-
architectural resources.

3.3 MANAGING TIME THROUGH SIMULATION EVENT


SYCHRONIZATION IN AN FPGA-ACCELERATED
SIMULATOR
Simulator synchronization can ensure that different target components are at the same target
time. ere are two basic strategies when designing a synchronization network across an FPGA:
centralized and decentralized. Centralized schemes can offer simple, easy-to-verify and relatively
high performance when the number of events is limited. When the number of events grows larger,
decentralized schemes that exploit the spatial organization of the target itself allow better timing
and resource utilization.

3.3.1 CENTRALIZED BARRIER SYNCHRONIZATION IN AN


FPGA-ACCELERATED SIMULATOR
A centralized synchronization scheme tuned for an FPGA can leverage the organization of the
FPGA along with deterministic execution delays to simplify the design. For example, if we design
a barrier-synchronization scheme to enforce single target-cycle synchronization between multiple
concurrent target-cores, we can select a strict round-robin scheduler when scheduling a target-
core onto a given FPGA resource with deterministic execution delay. As FPGA computation
resources often have fixed latencies, implementing synchronization through round-robin schedul-
3.4. FPGA SIMULATOR PROGRAMMABILITY 29
ing is a relatively easy mechanism but limits overall throughput based on the latency of the slowest
component.
On the other hand, if we must tolerate variable latencies across multiple simulator com-
ponents, a more generic centralized scheduler can be created that tracks the completion of these
target-cores. For example, within a single host pipeline of the RAMP Gold simulator [49], in-
structions from multiple simulated cores are synchronized on every committed instruction. As
multiple simulated cores share the same host pipeline, enforcing this functional synchronization
is lightweight. e primary downside of these centralized approaches is the performance overhead
incurred when the number of components/events grows large.

3.3.2 DECENTRALIZED BARRIER SYNCHRONIZATION IN AN


FPGA-ACCELERATED SIMULATOR
Alternatively, it is possible to use a decentralized scheme, where events are synchronized only
in relation to the subset of events required to compute a new event. ese distributed schemes
use timed ports similar to the software ports found in the Asim simulator. Each port represents
a communication link in the target design, carrying a set of timed messages between simulator
components. Ports are uni-directional, point-to-point links whose primary purpose is to ensure
messages generated at higher target-cycles are not consumed too early at lower target-cycles. If for
example, every cycle at least one token is pushed through a port, one can synchronize a component
with two inputs by simply waiting for both inputs to have tokens from the next target time before
proceeding to that time.
e HAsim and FAST port designs [10, 33] are examples of this form of distributed syn-
chronization of performance events. ese schemes allow greater flexibility and scalability as the
total number of ports in any given module is generally much smaller than the total number of
ports/events in the entire target model itself.

3.4 FPGA SIMULATOR PROGRAMMABILITY


FPGA-accelerated simulators offer many advantages in terms of speed and scale over software,
but may require end users to implement a significant portion of their simulation models using
hardware description languages (HDLs) such as Verilog or VHDL. Conventional Verilog or
VHDL can be tedious and error prone to write, limiting the rate at which end users can introduce
changes to their functional and timing models.
Today, there are a growing number of options for bridging the gap between programming
in high level software versus programmable hardware. For example, Chisel is an open-sourced
hardware description language based on the Scala functional programming language [12] that
could also be used to improve the agility of FPGA-based simulation users. Commercial FPGA
tools have also begun supporting high level synthesis tools such as OpenCL-to-gates (Altera [32])
and C-to-gates (VivadoHLS from Xilinx [48]).
30 3. ACCELERATING COMPUTER SYSTEM SIMULATORS WITH FPGAS
As another example, a significant portion of the ProtoFlex full-system functional simula-
tor was developed in just under one man-year using BSV [13]. e ProtoFlex functional engine
was a non-trivial implementation effort targeting a 16-way multithreaded UltraSPARC III pro-
cessor. In BSV, the use of guarded atomic actions allowed the simulator implementation to be
abstracted at a high level closer to the design specification without limiting the ability to create
high quality implementations. Other FPGA-accelerated simulation projects such as FAST [10]
and HAsim [33] have also adopted the use of BSV.

3.5 CASE STUDY: FPGA-ACCELERATED SIMULATION


TECHNOLOGIES (FAST)
e FPGA Accelerated Simulation Technologies (FAST) project first proposed the speculative
functional first simulator architecture and developed the first SFF simulators. e first FAST
simulator, FAST-UP, simulated a unicore two-issue out-of-order x86-based computer with eight-
way 32KB L1 instruction and data caches and a 256KB shared L2 cache. It supported 64 ROB
entries, 16 shared reservation stations, 16 load/store queue entries, a four-way, 8K BTB gshare
branch predictor, and up to four nested outstanding branches.
FAST-UP’s timing model was structurally partitioned using parameterized timing model
components. Common components were provided in libraries and instantiated as needed. Com-
ponents were connected together with FAST Connectors [11] that each provided configurable
latency, bandwidth, and throughput and a shared buffer. Timing was controlled in a distributed
way through the connectors.
e functional model was QEMU [3] that was highly modified to introduce instruction
trace generation, checkpoint, and rollback. It was able to boot both Linux and Windows and
running interactive Microsoft Word and YouTube on Internet Explorer [10, 43] and was designed
to incorporate fast and accurate power models as well [44].
FAST-UP ran on a DRC Computer development platform that contained a dual socket
motherboard, where one socket contained an AMD Opteron 275 and the other socket contained
a Virtex4 LX200 FPGA. e CPU and FPGA communicated over HyperTransport. e timing
model was untuned, since we used over 30 host cycles per target cycle, and was the simulator
bottleneck resulting in roughly 1MIPS–3MIPS, depending on branch prediction accuracy.
e current version of FAST, FAST-MP, has a highly tuned multithreaded timing model
(see Chapter 4) that runs at 100MHz, supporting up to 100MIPS. us, it uses centralized syn-
chronization. It runs on a single Virtex 6 240 FPGA connected to the host CPU via PCIe. e
functional model is a parallelized version of QEMU with all of the hooks necessary to func-
tionally and performance-accurately simulate a multicore x86 system, including I/O. It has been
parallelized, with 90%+ efficiency and is running on a six core Intel processor. e aggregate func-
tional performance is about 25MIPS/host core including all support for tracing, checkpoint, and
rollback. e functional performance can be efficiently shared between all of the target cores.
31

CHAPTER 4

Simulation Virtualization
As discussed in Chapter 1, there is no strict requirement that a structural correspondence exists
between the target system and what is actually implemented on an FPGA. Given this relaxed
demand for structural fidelity, a well-engineered FPGA-accelerated simulator should achieve,
in comparison to a structurally accurate prototype, a much higher simulation rate (measured in
instruction count or other architecturally visible metrics) and incur lower design effort and logic
resources.
is chapter discusses the significant benefits that can arise from harnessing FPGAs not
as a hardware prototyping substrate but as a virtualizable compute resource for executing and
accelerating simulations. In particular, this section examines two key virtualization techniques
developed and utilized by the ProtoFlex project [14] for accelerating full-system and multiprocessor
simulations.
e first virtualization technique is Hierarchical Simulation with Transplanting for simplify-
ing the construction of an FPGA-accelerated full-system simulator. In Hierarchical Simulation,
one accelerates in FPGAs only the subset of the most frequently encountered behaviors (e.g.,
ALU and load/store instructions) and relies on a reference software simulator to support simula-
tions of rare and complex behaviors (e.g., system-level instructions and I/O devices.)
e second technique is time-multiplexed virtualization of multiple processor contexts onto
fewer high-performance multiple-context simulation engines. Simulation virtualization decou-
ples the required complexity and scale of the physical hardware on FPGAs from the complexity
and scale of the target multiprocessor system. Unlike a direct prototype, the scale of the accel-
eration hardware on the FPGA host is an engineering decision that can be set judiciously in
accordance with the desired level of simulation throughput. Before delving into the details of the
two virtualization techniques, the next section first explains the background and requirements of
full-system and multiprocessor simulations.

4.1 FULL-SYSTEM AND MULTIPROCESSOR SIMULATION


In addition to modeling processors and memory, full-system simulators model a complete system,
including system-dependent behaviors, I/O, and peripherals. e intent is to model a system
to a sufficient degree of architectural-level fidelity such that real-world software, e.g., operating
systems and commercial workloads can run without modification or re-compilation. is category
of simulation is important when exploring architectural-level features with system-wide effects
that cannot be studied or demonstrated with simplified, I/O-less, user-level benchmarking such
32 4. SIMULATION VIRTUALIZATION
as OLTP transaction processing. e importance of full-system simulation is underscored by the
large number of available software-based full-system simulators (e.g., Simics [26], QEMU [3],
SimNow [41], etc.) many of which can be augmented with performance models.
Developing a full-system simulator is a complex endeavor due to the completeness of the
model that must be captured. However, the extensiveness of what needs to be incorporated into
the simulator does not directly impact simulation speed. It should not be surprising that, in a
software-based full system simulator, most of the simulation time is consumed by emulating in-
struction execution and memory accesses and not I/O due to the rarity of I/O events incurred by
the devices relative to the rate of CPU processing events. In a given period of simulated time, the
number of simulated events stemming from instruction execution and related memory accesses
dwarfs everything else that occurs in a full-system simulation. Consequently, implementing all
peripheral and I/O subsystems in an FPGA, which increases implementation effort enormously,
contributes little to simulation speed. It is important to note that many I/O devices can already
be simulated faster than real-time using software (e.g., Disk Sim [5].)
Using techniques such as binary re-writing and native execution, a single-threaded,
software-based full-system simulation of a uniprocessor target system can run close to the speed
of the real system. However, once such a simulator is instrumented (e.g., modeling a functional,
trace-based cache hierarchy), it can incur slowdowns of 10X or more (as presented in [14, 31].)
When simulating a multiprocessor system, the slowdown of a single-threaded software-
based simulator grows at least linearly with the number of simulated processors. It may seem nat-
ural to port the multiprocessor target simulation to a multiprocessor host to offset this slowdown,
but parallelizing multiprocessor software simulators is far from being a solved problem. e fore-
most challenge is that the scalability of distributed parallel simulation is limited if the simulated
target components interact at a granularity (frequency and latency) below the communication
granularity of the underlying host system [25, 35]. If nothing is done, the host communication
latency introduces artificially large communication latency (in target simulated time), leading to
unrealistic timing or interleaving of simulated target events. On the other hand, accounting for
the effect of the host communication latency requires undesirable performance-robbing stalls be-
tween dependent simulation events.
Hardware-based acceleration using FPGAs offers an alternative to speeding up multipro-
cessor simulation. e higher simulation rates from a single FPGA forestall the need to distribute
the simulation. When scaling to distributed simulations, the hardware-level interactions allow
for better proportioned simulation speed and communication delay. In the rest of this section,
software-based or software-only simulation will refer specifically to single-threaded simulator ex-
ecution.
4.2. HIERARCHICAL SIMULATION WITH TRANSPLANTING 33
4.2 HIERARCHICAL SIMULATION WITH
TRANSPLANTING
Hierarchical simulation with transplanting is motivated by the observation that the great major-
ity of behaviors encountered dynamically make up a small subset of the total system’s reachable
set of behaviors. It is this small subset of behaviors that primarily determines the overall simula-
tion performance. To improve simulation performance while minimizing hardware development,
one should only apply FPGA acceleration to the components that exhibit the most frequently
encountered behaviors. In hierarchical simulation, one should start from a software-based, full-
system simulator that already exists to cover the total set of behaviors. Next, based on profiling,
one chooses what is necessary to accelerate in the FPGA to achieve the goal of acceleration. It
is not unreasonable to assume a software-based full-system simulator to exist as a starting point
because if such a simulator did not exist, an FPGA-accelerated simulation project should begin
by creating one as a first step. Not only is this the most tractable way to capture and debug the
wide range of behaviors necessary, but it is also a crucial enabler in validating the FPGA-captured
behaviors later on.

4.2.1 HIERARCHICAL SIMULATION


Figure 4.1 illustrates, at the conceptual level, the difference between a software-only and a hier-
archical approach to full-system multiprocessor simulation. In hierarchical simulation, software
and FPGA hosts are used concurrently to support the simulation of different parts of the full
system. In this still simplified view of Hierarchical Simulation, all components are either hosted
in hardware on the FPGA or simulated by the reference software simulator; specifically, the main
memory and processors are implemented in the FPGA while the remaining components are re-
tained in the reference software simulator (e.g., disk storage and network interfaces, etc.). Both
the hardware-hosted and software-simulated components are advanced concurrently to model
the progress of the complete target system. On one hand, when a processor invokes an I/O device
(e.g., using memory-mapped I/O or DMA), the processor model simulated on the FPGA is in
fact interacting with the software-simulated device in the software simulator. On the other hand,
when a software-simulated DMA-capable I/O device accesses memory, the device accesses the
DRAM memory modules on the FPGA host platform.

4.2.2 TRANSPLANTING
A complex component such as a processor encompasses a small set of frequent behaviors (ADDs,
LOADs, TLB/cache accesses, etc.) and a much more extensive set of complicated and fortunately
often also rare behaviors (privileged instructions, MMU activities, etc.). Assigning the complete
set of processor behaviors statically to either the software simulation or FPGA simulation host
would result in either the simulation being too slow or the FPGA development being too compli-
cated. ese conflicting goals can be reconciled by supporting transplantable components, which
34 4. SIMULATION VIRTUALIZATION

Figure 4.1: Partitioning a simulated target system across FPGA and software simulation in the
ProtoFlex simulator.

can be re-assigned to the FPGA host or software simulation dynamically at runtime during hybrid
simulation.
Continuing with the processor example, the FPGA would only implement the subset of
the most frequently encountered instruction subset. When this partially implemented proces-
sor encounters an unimplemented behavior (e.g., a page table walk following a TLB miss), the
FPGA-hosted processor component is suspended and its processor state is transplanted (that is,
copied) to its corresponding software-simulated processor model in the reference simulator. e
software-simulated processor model, which supports the complete set of behaviors, is activated
to carry out the unimplemented behavior. Afterward, the processor state is transplanted back to
the FPGA-hosted processor model to resume accelerated execution of common case behaviors.

4.2.3 HIERARCHICAL TRANSPLANTING


A full transplant from the FPGA to the full-system simulation host can incur high cost, from
microseconds (PCI-E latency) to milliseconds (Ethernet latency) between the FPGA and the
4.2. HIERARCHICAL SIMULATION WITH TRANSPLANTING 35
software simulation hosts, depending on how the software host and the FPGA host are intercon-
nected. Such a high transplant cost would require even relatively rare behaviors to be implemented
in the FPGA (to drive down the frequency of required transplants.)
Consider a scenario with a 100 MHz FPGA host capable of simulating one instruction
per cycle—in other words 100 MIPS—for the supported instructions. Further assume a 99.999%
dynamic instruction coverage, that is, only one transplant is required per 100,000 instructions
executed. If the transplant latency is one millisecond, the average execution time per 100,000
instruction becomes two milliseconds, halving the throughput to 50 MIPS. e effective average
instruction execution time can be expressed as
Teffective D Tby FPGA C Rmiss  Tby txplant

Tby FPGA is the time required to execute one instruction on the FPGA host or the time to deter-
mine that it is an unsupported instruction. Tby txplant is the time required to execute one instruction
by the software-host, including the transplant latency. Rmiss is the percentage of dynamic instruc-
tions that is not supported by the FPGA host. In the example scenario above, Tby FPGA =10 nsec;
Tby txplant =1 msec and Rmiss =0.00001.
e equation above should be strongly reminiscent of the effective memory access time
through a cache. is interesting parallel points to a simple, yet effective solution. Just as computer
architects would introduce more levels of cache hierarchies to bridge the gap between processor
and DRAM speed (as oppose to building bigger caches or building faster DRAMs), one can sim-
ilarly introduce a hierarchy of intermediate software transplant hosts with staggered, increasing
instruction coverage and performance costs. For example, today’s FPGAs can support embedded
processors realized as either soft- or hard-logic cores which can execute a software simulation
kernel for the entire processor behaviors. e simulation on the embedded processor is still slow
relative to the FPGA-hosted instructions but incur much less cost than a full transplant to the
full-system software simulator. At the same time when writing a software simulation kernel, it
is much easier to capture enough, if not all, of the processor behaviors to achieve a sufficiently
higher dynamic instruction coverage to reduce the number of times one needs to pay the full cost
of transplanting to the external software-host. If all of the processor behaviors are captured by
the software simulation kernel running on the embedded processor core, the reference software
simulator is relegated to providing simulation support of the I/O subsystem only.
To complete the analogy with hierarchical caches, the effective average instruction execu-
tion time of two levels can be expressed as
Teffective D Tby FPGA C Rmiss FPGA  Tby txplant effective

Tby txplant effective D Tby txplant C Rmiss txplant.filtered/  Tby txplant

In the above, transplant/microtransplant refers to the intermediate transplant to an interme-


diate embedded simulation kernel on the FPGA. Suppose the average execution time of an in-
struction by the embedded kernel Tby txplant is 10 usec, or 1,000 times slower than Tby FPGA .
36 4. SIMULATION VIRTUALIZATION
If we suppose the completeness of the embedded software simulation kernel misses only one
in 1,000,000 instructions, the filter miss-rate at the embedded software simulation kernel is
thus 10%, Tby transplant effective =0.11 msec. e resulting overall Teffective with microtransplant is
11.1 nsec or only 11% more than if everything were executed by FPGA. Keep in mind, to re-
duce Rmiss from one in 100,000 to one in 1,000,000 would require a disproportionate increase in
the completeness of the processor modeling—one practically would have to implement the en-
tire processor at that point. is is much more easily done in an embedded software-simulation
kernel than trying to capture the processor model completely in the FPGA. Even if one were to
undertake the herculean effort (in terms of both design time and logic resources) of completing
the processor modeling entirely by FPGA, it would only result in a 1% performance gain over
hierarchical transplanting in the example scenario.

4.3 VIRTUALIZED SIMULATION OF MULTIPROCESSORS


As stated earlier, a simple but impractical approach to constructing an N-way multiprocessor
simulator in an FPGA is to replicate N cores and integrate them together with a large-scale in-
terconnection substrate. Although this meets the requirement of simulating a large-scale system,
the development effort and required logic resources would be prohibitive when N is more than just
a handful. e advantage of this large hardware implementation effort of course is the aggregate
simulation throughput of N cores. e question is: if one were willing to tolerate less performance,
while still achieving orders-of-magnitude gain over conventional software-based simulation, can
one trade the excess performance for a significantly reduced hardware implementation effort?

4.3.1 TIME-MULTIPLEXED VIRTUALIZATION


Time-multiplexed virtualization offers a performance-driven approach that trades the excess
simulation performance for a more tractable hardware development effort and cost. In time-
multiplexed virtualization, as the name implies, a single resource is used to simulate multiple
virtual copies in a time-multiplexed fashion. is technique is especially useful in supporting the
simulation of a multiprocessor target where the multiple processor contexts can be readily mapped
onto a single, fast multiple-context engine. is virtualization decouples the scale of the target
simulated system from that of the FPGA host platform and the hardware development effort. e
scale of the FPGA simulation platform is only a function of the desired simulation throughput
(i.e., achieved by scaling up the number of engines.) For example, Figure 4.2 illustrates concep-
tually a large-scale multiprocessor simulator where multiple simulated processors in a large-scale
target system are shown mapped to share a small number of engines.
In Figure 4.2, the multiple-context pipeline is augmented with instruction-interleaved mul-
tithreading support, in which an instruction is issued from a rotating set of processor contexts on
each cycle. An interleaved pipeline enjoys the same implementation advantages as multithreaded
pipelines found in the CDC Cyber [47] and HEP [42]. With enough available processor con-
texts to keep the engine occupied, it is possible to design a deep pipeline without the ill effects
4.3. VIRTUALIZED SIMULATION OF MULTIPROCESSORS 37

Figure 4.2: Large-scale multiprocessor simulation using a small number of multiple-context inter-
leaved engines.

of data-dependent stalls. Moreover, a context blocked by long-latency events such as accesses to


main memory or a transplant operation can be taken out of the scheduler to allow other contexts
to do useful work.
In the structurally accurate prototyping approach, besides the logic resources needed to in-
stantiate N copies of the processor cores, substantial resources must also be devoted to a high
performance high-endpoint interconnection. For a shared memory multiprocessor, this may even
entail deploying a high performance cache-coherent shared-memory hierarchy so that the cores
can execute concurrently (even though the goal is only architectural simulation.) When many
target processor contexts are multiplexed onto a single simulation engine, this complexity is au-
38 4. SIMULATION VIRTUALIZATION
tomatically reduced. e multiple target processor contexts sharing a pipeline automatically sees
coherent shared memory through the common cache.
In theory, one could achieve greater scale in the target system by increasing the degree of
interleaving. However, the added number of contexts beyond a certain limit would prohibitively
degrade per-CPU simulation throughput. A second dimension of scaling is to grow the number of
interleaved engines (and hence FPGAs.) Scaling across both dimensions (virtual interleaving and
physical replication) introduces new complexities. At a basic level, new infrastructure is needed for
communication and memory sharing between the multiple distributed simulation engines. e
consolidation provided by time-multiplexing still helps by reducing the number of simulated host
pipelines connected through distributed shared memory, a number that should be much smaller
than the number of simulated target nodes.

4.3.2 VIRTUALIZING MEMORY CAPACITY


When simulating a large-scale multiprocessor system, a large number of target processors could be
mapped onto affordable FPGA logic resources via time-multiplexing, provided the commensurate
slow down is acceptable. e same time-multiplexing approach, unfortunately, is not applicable
when virtualizing the required amount of DRAM capacity.
While actual total capacity cannot be short changed, it is possible to use hierarchical mem-
ory techniques to achieve the appearance of a larger DRAM using a slower, higher-capacity back-
ing store. Specifically, memory nodes in the host system could be used as a cache of a larger
backing disk storage that contains the full contents of the target system’s main memory. is is
not unlike software simulators modeling memory systems of much larger capacity than available
on the workstation host by means of standard OS demand-paging from a slower backing disk
storage. Keep in mind that the DRAM and disks of the target run at native speed while the sim-
ulated target processors run at an order of magnitude slowdown. is helps absorb the effects of
memory virtualization even when the simulation experiences poor locality of reference. Akin to
processor virtualization, memory virtualization also enables us to tradeoff between the resources
expended in the host memory and the desired level of simulation performance.

4.4 CASE STUDY: THE PROTOFLEX SIMULATOR


e ideas presented in this chapter have been realized in the ProtoFlex full-system architectural
simulator modeled after the SunFire 3800 server [14]. e ProtoFlex simulator uses Simics [26]
running on a standard PC workstation as the reference simulator and incorporates a single Xilinx
XC2VP70 FPGA for acceleration. is simulator faithfully models a 16-way symmetrical mul-
tiprocessing (SMP) UltraSPARC III server to such a degree that it is capable of booting Solaris
8 and running commercial workloads such as Oracle On-Line Transaction Processing (OLTP.)
At the time of this work, the FPGA acceleration resulted in 49x speed up over the reference
Simics simulator, a contemporary state-of-the-art software-only simulator. By decoupling the
complexity of the target system from what is required to be implemented in FPGA, the complete
4.4. CASE STUDY: THE PROTOFLEX SIMULATOR 39
FPGA-accelerated simulation system was developed by one graduate student in just a little over
one year.

4.4.1 PROTOFLEX DESIGN OVERVIEW


e design of the ProtoFlex FPGA-accelerated architectural simulator has three objectives. e
first objective is to simulate large-scale multiprocessor systems with an acceptable slowdown
(<100x.) e second objective is to model full-system fidelity for executing realistic workloads in-
cluding unmodified operating systems. e third goal is to lower the development effort and cost
to a justifiable level in an academic computer architecture research setting. Very explicitly, it was
never a goal to capture the accurate structure or sub-instruction granularity timing of the target
multiprocessor system. From these goals, it follows quite naturally to use FPGA as a virtualizable
resource for simulation execution and not for implementing the simulated target.
Figure 4.3 shows a high-level block diagram of how the functionality of the target 16-
way SMP server is mapped onto software simulation versus the FPGA hosts. e main mem-
ory system is hosted directly by DRAM modules on the FPGA host. All 16 target processors
are mapped onto a single multi-context BlueSPARC simulation engine contained on one Xilinx
Virtex-II XC2VP70 FPGA [13]. e interleaved BlueSPARC pipeline is capable of transplant-
ing any one of the 16 processor contexts to the software simulator (while the remaining contexts
continue unimpeded) on encountering an unimplemented UltraSPARC III behavior. In addition
to the interleaved pipeline, the nearby PPC405 processor embedded in the FPGA serves as the
microtransplant host (Section 4.2.3). e reference Simics simulator running on a PC worksta-
tion, connected to the FPGA host by Ethernet, provides the third hosting option for the target
system’s I/O subsystem. e ProtoFlex simulator leverages the built-in API of Simics to issue
I/O accesses to simulated devices such as disks.

4.4.2 BLUESPARC PIPELINE


e FPGA-acceleration portion of the Hierarchical ProtoFlex simulator is hosted on a Berke-
ley Emulation Engine 2 (BEE2) FPGA platform [6]. One Xilinx Virtex-II XC2VP70 FPGA is
used to implement the BlueSPARC 16-context simulation pipeline (Figure 4.4) [13]. e BlueS-
PARC engine is a 14-stage, instruction-interleaved pipeline that supports the multithreaded ex-
ecution of up to 16 UltraSPARC III processor contexts (Section 4.3.1). Table 4.1 summarizes
the most salient characteristics of the BlueSPARC pipeline. e maximum retirement rate of the
BlueSPARC pipeline is nominally one instruction per cycle, which in combination with its clock
frequency dictates the ProtoFlex simulator’s peak simulation throughput.
e design of the BlueSPARC engine is optimized first and foremost to: (1) ensure correct-
ness, (2) maximize maintainability for future design exploration, and (3) minimize effort. In many
cases, the design of the BlueSPARC engine allowed the designer to forgo complex performance
optimizations in favor of a simpler, more maintainable design. Recall that hierarchical simulation
allows the designer to omit rare behaviors from the FPGA implementation. Table 4.2 summarizes
40 4. SIMULATION VIRTUALIZATION

Table 4.1: BlueSPARC pipeline characteristics

Table 4.2: Assignment of target behavior to simulation host (FPGA, microtransplant, full-transplant)
4.4. CASE STUDY: THE PROTOFLEX SIMULATOR 41

Figure 4.3: Allocating components for hierarchical simulation in the BlueSPARC simulator.

how the various UltraSPARC III behaviors are assigned to the three hosting options—FPGA,
embedded PowerPC microtransplant, PC full-transplant. ese assignment decisions were made
based on rigorous dynamic instruction profiling of various applications simulated in Simics. With
hierarchical transplanting, only 99.95% of the dynamic instructions are executed in hardware on
the FPGA while the remainder is carried out in the microtransplant kernel (running on the em-
bedded PowerPC) and software full-system PC host.

4.4.3 PERFORMANCE EVALUATION


is section presents the performance evaluation of the ProtoFlex simulator using software work-
loads comprising five SPECINT 2000 benchmarks (crafty, gcc, vortex, parser, bzip2) and an On-
Line Transaction Processing (OLTP) benchmark. For the SPECINT workloads, 16 copies of the
program are executed concurrently; each experiment measures simulation throughput for 100 bil-
lion aggregate instructions (after initialization phase.) For OLTP, the simulated server runs the
Oracle 10g Enterprise Database Server configured with 100 warehouses (10GB), 16 clients, and
1.4 GB SGA. Each experiment measures throughput for 100 billion aggregate instructions in
42 4. SIMULATION VIRTUALIZATION
Table 4.3: Performance comparison

a steady-state execution (where database transactions are committing steadily.) As shown next,
the workloads’ characteristics have a large effect on the throughput of both the Simics and the
ProtoFlex simulator.
When Simics is invoked with the default “fast” option, it achieves tens of MIPS in simu-
lation throughput. However, there is roughly a factor of 10x reduction in simulation throughput
when Simics is enabled with trace callbacks for instrumentation [31], such as memory address
tracing. e two columns in Table 4.3 labeled Simics-fast and Simics-trace report Simics through-
put for the simulated 16-way SMP server. Simics simulations were run on a Linux PC workstation
with a 2.0 GHz Core 2 Duo and 8 GBytes of memory. e performance most relevant to archi-
tecture research activities is represented by the performance Simics-trace column. e simulation
throughput of the ProtoFlex simulator is reported in the left-most column of Table 4.3. For these
measurements, the BlueSPARC engine is clocked at 90MHz. e ProtoFlex simulator achieves
speed comparable to Simics-fast on the SPECINT and Oracle-TPCC workloads. In compari-
son to the more relevant Simics-trace performance, the speedup is more dramatic, on average 38x
faster.

4.4.4 HIERARCHICAL SIMULATION AND VIRTUALIZATION IN A


PERFORMANCE SIMULATOR
ough we discuss hierarchical simulation and virtualization in the context of ProtoFlex, which
is a functional-only simulator, the techniques apply to performance simulators as well.
To address the issue of functional model complexity on an FPGA, FPGA-based timing-
directed simulators that intend to be fairly complete can compose with the transplant capabilities
pioneered by ProtoFlex. For example, HAsim [33] adopted both transplant and multithreading
to provide full system functionality and target scalability. Whenever an instruction that is not
4.4. CASE STUDY: THE PROTOFLEX SIMULATOR 43

Figure 4.4: BlueSPARC interleaved engine pipeline.


44 4. SIMULATION VIRTUALIZATION
implemented in the timing-directed simulator is encountered, HAsim flushes the pipeline and
transplants the processor state to a software functional model that executes that instruction and
returns the updated state to the timing-directed simulator. Because the instruction is, however,
executed in a software functional model and not executed within the timing model, inaccuracy is
introduced whenever there is a transplant.
In addition, the software functional model must have access to the state of the simulation,
including memory values, requiring the FPGA and the CPU to share memory to some degree.
at can be achieved by having true shared memory in the underlying host, or by moving data
values explicitly between the CPU and FPGA as needed. ere are FPGA platforms that already
allow the CPU and FPGA to share memory, such as the Intel/Nallatech ACP system that places
an FPGA into an Intel front-side bus socket, allowing its requests to be snooped by the FPGA,
the DRC/XtremeData systems that places an FPGA into a HyperTransport socket, the Xilinx
Zynq platform that enables the FPGA to access the embedded ARM cores’ cache, the Intel
QuickPath Interconnect that enables an FPGA to attach to the Intel QPI interconnect, and
the IBM Power8 CAPI interface that provides a coherent interface between the FPGA and the
Power8 processor.
45

CHAPTER 5

Categorizing FPGA-based
Simulators
To summarize, there are three high-level orthogonal characteristics of FPGA-accelerated simu-
lators: (1) simulator architecture, (2) the partitioning between FPGA and software host, and (3)
providing virtualization support within the simulator to better utilize host resources to better sup-
port simulation of targets. To review, the five simulator architectures described in Chapter 3 are
monolithic, timing-directed, functional-first, timing-first, and speculative functional-first. Parti-
tioning between the FPGA and software host refers to partitioning between a software functional
model and an FPGA timing model or an FPGA-based common-case functional model and a
complete software functional model. Virtualization refers to techniques such as multithreading
that enable multiple target components to share the same host resources.
Designing an FPGA-based simulator requires selecting a number of points in the design
space, ranging from the simulator architecture to the particular optimization strategies used to
cope with the restrictions of hardware-based accelerated simulation. Table 5.1 summarizes a few
of the common FPGA-based simulator artifacts along with the selected choices in terms of sim-
ulator architecture, partition mapping, synchronization, schemes, and optimization techniques.

5.1 FAME CLASSIFICATIONS


e FAME [46] classification also defines three characteristics: direct versus decoupled, RTL
versus abstract machine, and single-threaded versus multithreaded. Direct versus decoupled in-
dicates whether one host cycle is used to simulate a single target cycle (direct) or multiple host
cycles are used to simulate a single target cycle (decoupled.) RTL versus abstract machine is what
we call prototype versus simulator. (We do not consider the prototype version to be a true sim-
ulator.) Single-threaded versus multi-threaded refers to the use of multi-threading techniques at
the simulator level to tolerate host latencies such as access to host DRAM. Multi-threaded is tied
to decoupled, since one cannot use a single host cycle per target cycle if one is multi-threading
the simulator.

5.2 OPEN-SOURCED FPGA-BASED SIMULATORS


To close this chapter, this section briefly summarizes describes several open-sourced FPGA-based
simulators that employ the simulation techniques covered in this manuscript.
46 5. CATEGORIZING FPGA-BASED SIMULATORS
Table 5.1: Summary of various FPGA simulator artifacts in terms of basic simulator architecture and
key characteristics of partition-mapping, synchronization, and FPGA optimization techniques

5.2.1 PROTOFLEX
As discussed in the last chapter, the ProtoFlex simulator was developed at Carnegie Mellon Uni-
versity to support FPGA-accelerated functional simulation of full-system, large-scale multipro-
cessor systems [14]. e ProtoFlex functional model targets a 64-bit UltraSPARC III ISA (com-
pliant with the commercially available software-based full-system simulator model from Sim-
ics [26]) and is capable of booting commercial operating systems such as Solaris 10 and running
commercial workloads (with no available source code) such as Oracle TPC-C. ProtoFlex was
the first system to introduce hierarchical simulation and host multithreading as techniques for
reducing the complexity of simulator development and to virtualize finite hardware resources.
e ProtoFlex simulator is available at [36] and targets the XUPV5-LX110T platform, a widely
available and low-cost commodity FPGA platform.

5.2.2 HASIM
e HAsim project was developed at MIT and Intel and employs the use of host multithreading,
hierarchical simulation, and timing-directed simulation with functional-timing partition. HAsim
currently supports the Alpha ISA and has been used to target a Nallatech ACP accelerator with
a Xilinx Virtex 5 LX330T FPGA connected to Intel’s Front-Side Bus protocol. HAsim has been
used to simulate a detailed 4x4 multicore with 64-bit Alpha out-of-order processors on a single
FPGA. HAsim is available for download at [20].
5.2. OPEN-SOURCED FPGA-BASED SIMULATORS 47
5.2.3 RAMP GOLD
RAMP Gold is a simulator of a 64-core 32-bit SPARC V8 target developed at UC Berkeley.
e first implementation was done on an Xilinx XUPV5 board. RAMP Gold employs host mul-
tithreading and a functional first simulator and is capable of booting Linux. e RAMP Gold
package is available at [37] and includes CPU, cache, and DRAM timing models. RAMP Gold
simulators were aggregated together onto 24 Xilinx FPGAs in the Diablo project that has been
used to reproduce effects-at-scale such as TCP Incast [45].
49

CHAPTER 6

Conclusion
is book describes techniques for practical and efficient simulation of computer systems using
FPGAs. ere is a distinction between using FPGAs as a vehicle for simulation and the use
of FPGAs for prototyping. FPGA-accelerated simulation implies that a significant portion of
the simulation is implemented in software, and that at least part of the simulator is structurally
different than the target.
is manuscript surveys simulator architectures and describes how different simulator ar-
chitectures have been accelerated with FPGAs. One simulator architecture in particular, specu-
lative functional first, was designed from the ground up to enable FPGA acceleration of perfor-
mance simulators. ough SFF is not limited to software-based functional models and FPGA-
based timing models, SFF provides many advantages including completeness, reduced FPGA
resources, and the ability to tolerate latency. FAST-UP was the first implementation of an SFF
simulator that simulated a dual-issue, branch-predicted, out-of-order x86-based computer in suf-
ficient detail to boot both Linux and Windows and running interactive Microsoft Word and
YouTube on Internet Explorer. e FPGA-based timing model was the bottleneck. FAST-MP
leverages both SFF and FPGA multithreading to build a 256-core target, supported a branch
predicted seven-stage pipeline that is also intended to boot Linux and Windows while running
arbitrary off-the-shelf software.
is book also describes hierarchical simulation that implements commonly used func-
tionality on the FPGA, and less commonly used functionality in software. In addition, FPGA
virtualization enables the mapping of multiple virtual components, such as a CPU, onto a sin-
gle physical execution engine. ProtoFlex’s BlueSPARC leverages both techniques to provide an
FPGA-accelerated, full system functional model that is capable of functionally simulating sixteen
64-bit UltraSPARC V9 cores at 90MHz on a single FPGA coupled to a microprocessor.
Because SFF is intended for performance simulation and because its functional model and
timing model can both be parallelized, it occupies a different point in the simulator space than hi-
erarchical simulation described in this book. Rather than accelerating functionality on the FPGA,
it places all of the functionality in software, where it can run very quickly due to a fast baseline
simulator, and most of the timing is carried out in the FPGA. ProtoFlex/BlueSPARC accelerate
common functional instructions in the FPGA, and transplanting to software is only necessary to
provide full functionality. In both cases, however, the optimal host platform/simulator is used to
maximize performance.
50 6. CONCLUSION
In conclusion, FPGA-accelerated simulators are highly performant while providing many
of the desirable simulator benefits including accuracy, completeness, and usability. FPGA-
accelerated simulators’ main challenge is ease of programmability. e overall promise of FPGA-
accelerated simulators, however, is a compelling reason to continue researching the area.
51

APPENDIX A

Field Programmable Gate


Arrays
Field Programmable Gate Arrays (FPGAs) are VLSI devices that contain large numbers of pro-
grammable logic elements, registers, memories, and configurable routing connect the outputs of
components to the inputs of other components as specified by the programmer. ese powerful
devices can be used for a wide range of applications, including applications that have traditionally
been thought of as being only possible to efficiently implement a general purpose microprocessor.
is appendix briefly describes the internal structure of an FPGA. Note that FPGA architecture
is fairly specific to a particular manufacturer.

A.1 PROGRAMMABLE LOGIC ELEMENTS


Programmable logic elements are implemented in small memories called look up tables (LUT).
Any arbitrary two input gate can be implemented in a four-bit memory with two inputs, that
specify the address, and one output. Current FPGAs from Xilinx and Altera, the two largest
FPGA manufacturers, use larger-input LUTs that are much more powerful than a two-input
LUT. For example, a four-input mux can be implemented in a single 6-input LUT.
LUTs are bundled with other functionality such as adders, registers, and the ability to select
from different clocks to form a larger block known as an Adaptive Logic Module (ALM) (Altera)
or a slice (Xilinx). Figure A.1 and Figure A.2 are high-level and detailed drawings of an Altera
ALM. Note that the interconnect is quite rich, enabling a huge number of different possible
configurations of a single ALM (Figure A.3.) e underlying architectures and how they are
mapped to are much of the secret sauce of FPGA vendors.
LUTs can also be used as memory. LUT memory, called Memory Logic Array Blocks by Al-
tera and distributed memory by Xilinx, can be specified in a variety of depths, widths, and number
of ports. Structures around the LUTs, including the ALM/slice infrastructure, make the LUT
memory more efficient.

A.2 EMBEDDED SRAM BLOCKS


FGPAs also contain block memories (BRAMs) that are generally two ported SRAMs. BRAMs
from Altera are currently 20K bits large, while BRAMs from Xilinx are 36K bits large. BRAMs
can generally be configured width-wise up to the native size of the BRAM. For example, the Al-
52 A. FIELD PROGRAMMABLE GATE ARRAYS

Figure A.1: A high-level depiction of an Altera Stratix V ALM. ere are a pair of six-input LUTs
(though they share inputs), a pair of adders, and four registers. Figure used with permission from
Altera.
A.2. EMBEDDED SRAM BLOCKS 53

Figure A.2: A detailed depiction of an Altera Stratix V ALM. Note that each six-input LUT is
implemented as a four-input LUT, a pair of three-input LUTs, and muxes. Figure used with permission
from Altera.
54 A. FIELD PROGRAMMABLE GATE ARRAYS

Figure A.3: e different possible configurations of an Altera ALM. Figure used with permission
from Altera.
A.3. HARD “MACROS” 55
tera BRAM can be configured in the following configurations, all dual-ported: 512x32b, 512x40b,
1Kx16b, 1Kx20b, 2Kx8b, 2Kx10b, 4Kx4b, 4Kx5b, 8Kx2b, and 16Kx1b.

A.3 HARD “MACROS”


Modern FPGAs also contain Digital Signal Processing blocks that provide wide arithmetic func-
tions such as adders and multipliers as well as more specialized blocks such as FIR filters. Current
high-end FPGAs have thousands of such DSP blocks.
Many modern FPGAs also contain embedded ARM cores. Currently shipping parts have
dual ARM A9 MP cores, while FPGAs shipping within a year will have multiple 64-bit ARM
cores running at 1.5GHz and up. FPGAs have become capable System-on-a-Chip (SoC) de-
vices in their own right, where the ARM cores can be tightly integrated to custom hardware
implemented within the FPGA reconfigurable fabric.
Modern FPGAs contain extensive routing resources that can be thought of as a statically
configured network (routing fabric) consisting of a huge number of statically configured switches
in a mesh-like network, with local and more global routing resources. e connection points
between the inputs and outputs and the routing fabric are also configurable.
57

Bibliography
[1] H. Angepat, D. Sunwoo, and D. Chiou. RAMP-White: An FPGA-Based Coherent
Shared Memory Parallel Computer Emulator. In 8th Annual Austin CAS Conference, Mar.
2007. 26

[2] K. Barr, R. Matas-Navarro, C. Weaver, T. Juan, and J. Emer. Simulating a chip multipro-
cessor with a symmetric multiprocessor. Boston area ARChitecture Workshop, Jan. 2005.
11

[3] F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In USENIX 2005 Annual
Technical Conference, FREENIX Track, pages 41–46, 2005. 30, 32

[4] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network-Oriented Full-System Sim-


ulation using M5. In Sixth Workshop on Computer Architecture Evaluation using Commerical
Workloads, Feb. 2003. 14

[5] J. S. Bucy, J. Schindler, S. W. Schlosser, and G. R. Ganger. e DiskSim Simulation En-


vironment Version 4.0 Reference Manual (CMU-PDL-08-101). Technical report, CMU,
2008. 32

[6] C. Chang, J. Wawrzynek, and R. W. Brodersen. BEE2: A High-End Reconfigurable


Computing System. IEEE Design and Test of Computers, 22(2):114–125, 2005. DOI:
10.1109/MDT.2005.30. 39

[7] J. Chen, M. Annavaram, and M. Dubois. Slacksim: a platform for parallel simula-
tions of cmps on cmps. SIGARCH Comput. Archit. News, 37(2):20–29, 2009. DOI:
10.1145/1577129.1577134. 3

[8] M. Chidester and A. George. Parallel simulation of chip-multiprocessor architectures. ACM


Trans. Model. Comput. Simul., 12(3):176–200, 2002. DOI: 10.1145/643114.643116. 11, 13,
24

[9] D. Chiou, H. Angepat, N. A. Patil, and D. Sunwoo. Accurate Functional-First Multi-


core Simulators. Computer Architecture Letters, 8(2):64–67, July 2009. DOI: 10.1109/L-
CA.2009.44. 4, 19

[10] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. H. Reinhart, D. E. Johnson, J. Keefe,


and H. Angepat. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System,
58 BIBLIOGRAPHY
Cycle-Accurate Simulators. In Proceedings of MICRO, pages 249–261, Dec. 2007. DOI:
10.1109/MICRO.2007.36. 15, 19, 29, 30

[11] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. H. Reinhart, D. E. Johnson, and Z. Xu.


e FAST Methodology for High-Speed SoC/Computer Simulation. In Proceedings of
International Conference on Computer-Aided Design (ICCAD), pages 295–302, Nov. 2007.
DOI: 10.1145/1326073.1326133. 4, 24, 30

[12] Chisel. 29

[13] E. S. Chung and J. C. Hoe. High-level design and validation of the bluesparc multithreaded
processor. Trans. Comp.-Aided Des. Integ. Cir. Sys., 29(10):1459–1470, Oct. 2010. DOI:
10.1109/TCAD.2010.2057870. 30, 39

[14] E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi, and K. Mai. A Complexity-Effective


Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs. In
Proceedings of International Symposium on Field Programmable Gate Arrays, Feb. 2008. DOI:
10.1145/1344671.1344684. 13, 28, 31, 32, 38, 46

[15] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, K. Mai, and B. Fal-


safi. Protoflex: Towards scalable, full-system multiprocessor simulations using fp-
gas. ACM Trans. Reconfigurable Technol. Syst., 2(2):15:1–15:32, June 2009. DOI:
10.1145/1534916.1534925. 5

[16] J. Donald and M. Martonosi. An Efficient, Practical Parallelization Methodology for Mul-
ticore Architecture Simulation. Computer Architecture Letters, July 2006. DOI: 10.1109/L-
CA.2006.14. 11

[17] J. Emer, P. Ahuja, E. Borch, A. Klauser, C. K. Luk, S. Manne, S. S. Mukherjee, H. Patil,


S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A performance model framework.
Computer, 35(2):68–76, 2002. DOI: 10.1109/2.982918. 14, 17

[18] S. Fytraki and D. Pnevmatikatos. ReSiM, A Trace-Driven, Reconfigurable ILP Processor


Simulator. In DATE, pages 536–541, 2009. DOI: 10.1109/DATE.2009.5090722. 27

[19] e gem5 Main Page. http://www.m5sim.org. 8

[20] Hasim download page. http://asim.csail.mit.edu/redmine/projects/hasim/wi


ki/HAsim. 46

[21] R. Krashinsky. Microprocessor Energy Characterization and Optimization through Fast,


Accurat, and Flexible Simulation. Master’s thesis, Massachusetts Institute of Technology,
Cambridge, MA, 2001. 23
BIBLIOGRAPHY 59
[22] A. Krasnov, A. Schultz, J. Wawrzynek, G. Gibeling, and P.-Y. Droz. Ramp blue: A
message-passing manycore system in fpgas. In Field Programmable Logic and Appli-
cations, 2007. FPL 2007. International Conference on, pages 54–61. IEEE, 2007. DOI:
10.1109/FPL.2007.4380625. 26
[23] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. Computer-Aided De-
sign of Integrated Circuits and Systems, IEEE Transactions on, 26(2):203–215, 2007. DOI:
10.1109/TCAD.2006.884574. 3
[24] E. Larson, T. Austin, and D. Ernst. SimpleScalar: An Infrastructure for Computer System
Modeling. Computer, 35(2):59–67, Feb. 2002. DOI: 10.1109/2.982917. 18
[25] U. Legedza and W. E. Weihl. Reducing synchronization overhead in parallel sim-
ulation. In Proceedings of the Tenth Workshop on Parallel and Distributed Simulation,
PADS ’96, pages 86–95, Washington, DC, USA, 1996. IEEE Computer Society. DOI:
10.1109/PADS.1996.761566. 32
[26] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Lars-
son, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer,
35(2):50–58, Feb 2002. DOI: 10.1109/2.982916. 32, 38, 46
[27] C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-first simulation. In ACM
SIGMETRICS International Conference on Measurement and Modeling of Computer Systems,
pages 108–116, 2002. DOI: 10.1145/511399.511349. 15, 19
[28] J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio, J. Eastep, and
A. Agarwal. Graphite: A Distributed Parallel Simulator for Multicores. In HPCA, Jan.
2010. DOI: 10.1109/HPCA.2010.5416635. 3, 18
[29] S. S. Mukherjee, S. K. Reinhard, B. Falsafi, M. Litzkow, S. Huss-Lederman, M. D. Hill,
J. R. Larus, , and D. A. Wood. Wisconsin wind tunnel ii: A fast and portable parallel
architecture simulator. In PAID, June 1997. DOI: 10.1109/4434.895100. 11
[30] N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis, and K. Olukotun. At-
las: a chip-multiprocessor with transactional memory support. In Proceedings of the confer-
ence on Design, automation and test in Europe, pages 3–8. EDA Consortium, 2007. DOI:
10.1109/DATE.2007.364558. 26
[31] F. Nussbaum, A. Fedorova, and C. Small. An overview of the Sam CMT simula-
tor kit. Technical Report TR-2004-133, Sun Microsystems Research Labs, feb 2004. DOI:
10.1145/1344671.1344684. 32, 42
[32] Achieve power-efficient acceleration with opencl on altera fpgas. http://www.altera.c
om/products/software/opencl/opencl-index.html. 29
60 BIBLIOGRAPHY
[33] M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. HAsim: FPGA-Based High-
Detail Multicore Simulation Using Time-Division Multiplexing. In Proceedings of HPCA-
17, Feb. 2011. DOI: 10.1109/HPCA.2011.5749747. 13, 25, 26, 29, 30, 42
[34] M. Pellauer, M. Vijayaraghavan, M. Adler, Arvind, and J. Emer. A-ports: An efficient
abstraction for cycle-accurate performance models on fpgas. In Proceedings of the 16th In-
ternational ACM/SIGDA Symposium on Field Programmable Gate Arrays, FPGA ’08, pages
87–96, New York, NY, USA, 2008. ACM. DOI: 10.1145/1344671.1344685. 24
[35] D. A. Penry, D. Fay, D. Hodgdon, R. Wells, G. Schelle, D. I. August, and D. Connors.
Exploiting Parallelism and Structure to Accelerate the Simulation of Chip Multi-processors.
In 12th International Symposium on High-Performance Computer Architecture, pages 27–38,
Feb. 2006. DOI: 10.1109/HPCA.2006.1598110. 13, 26, 32
[36] Protoflex download page. http://www.ece.cmu.edu/~protoflex/doku.php?id=doc
umentation:userguide. 46
[37] Ramp gold download page. https://sites.google.com/site/rampgold/file-cabi
net. 47
[38] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss,
and P. Montesinos. SESC simulator, Jan. 2005. http://sesc.sourceforge.net. 18
[39] D. Sanchez and C. Kozyrakis. Zsim: fast and accurate microarchitectural simulation of
thousand-core systems. In International Symposium on Computer Architecture, pages 475–
486. ACM, 2013. DOI: 10.1145/2508148.2485963. 3
[40] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing
large scale program behavior. In ASPLOS-X, pages 45–57. ACM Press, 2002. DOI:
10.1145/605397.605403. 13
[41] SimNow webpage. http://developer.amd.com/simnow.aspx. 32
[42] B. Smith. e architecture of hep. In e architecture of HEP on Parallel MIMD compu-
tation: HEP supercomputer and its applications, pages 41–55, Cambridge, MA, USA, 1985.
Massachusetts Institute of Technology. 36
[43] D. Sunwoo, J. Kim, and D. Chiou. QUICK: A Flexible Full-System Functional Model. In
Proceedings of ISPASS, pages 249–258, Apr. 2009. DOI: 10.1109/ISPASS.2009.4919656.
30
[44] D. Sunwoo, G. Y. Wu, N. A. Patil, and D. Chiou. PrEsto: An FPGA-Accelerated
Power Estimation Methodology for Complex Systems. In e 20th International Confer-
ence on Field Programmable Logic and Applications, Milano, pages 310–317, Aug. 2010. DOI:
10.1109/FPL.2010.69. 30
BIBLIOGRAPHY 61
[45] Z. Tan. Using FPGAs to Simulate Novel Datacenter Network Architectures At Scale. PhD
thesis, EECS Department, University of California, Berkeley, Jun 2013. 47
[46] Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanović, and D. Patterson. A Case for FAME:
FPGA Architecture Model Execution. In International Symposium on Computer Architecture,
June 2010. DOI: 10.1145/1816038.1815999. 45
[47] J. E. ornton. Parallel operation in the control data 6600. Instruction-level parallel proces-
sors, pages 5–12, 1995. DOI: 10.1145/1464039.1464045. 36
[48] Vivado design suite. http://www.xilinx.com/products/design-tools/vivado/ind
ex.htm. 29

[49] A. Waterman, Z. Tan, R. Avizienis, Y. Lee, D. Patterson, and K. Asanovic. RAMP Gold
- Architecture and Timing Model. RAMP Retreat, Austin, TX, June 2009. 13, 14, 25, 26,
29
[50] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: Accelerating mi-
croarchitecture simulation via rigorous statistical sampling. In Computer Architecture, 2003.
Proceedings. 30th Annual International Symposium on, pages 84–95. IEEE, 2003. DOI:
10.1145/871656.859629. 13
[51] M. T. Yourst. PTLSim: A Cycle Accurate Full System x86-64 Microarchitectural Simu-
lator. In Proceedings of ISPASS, Jan. 2007. DOI: 10.1109/ISPASS.2007.363733. 17
63

Authors’ Biographies

HARI ANGEPAT
Hari Angepat is a Ph.D. candidate at e University of Texas at Austin. He holds a B.Eng in
Computer Engineering from McGill University and an M.S. in Computer Engineering from
e University of Texas at Austin. Hari is interested in developing domain-specific FPGA mi-
croarchitectures and productivity tools to enable widespread adoption of hardware acceleration.
Between 2008–2012, Hari led the FAST-MP project that enabled accurate functional-first sim-
ulation of multiprocessor systems on FPGAs. For more information, please visit
http://hari.angepat.com.

DEREK CHIOU
Derek Chiou is a Principal Architect at Microsoft where he leads a team working on FPGAs for
data center applications. He is also an Associate Professor at e University of Texas at Austin
where his research areas are FPGA acceleration, high performance computer simulation, rapid
system design, computer architecture, parallel computing, Internet router architecture, and net-
work processors. In a past life, Dr. Chiou was a system architect and led the performance model-
ing team at Avici Systems, a manufacturer of terabit core routers. Dr. Chiou received his Ph.D.,
S.M., and S.B. degrees in Electrical Engineering and Computer Science from MIT. For more
information on Dr. Chiou and his research, please visit
http://www.ece.utexas.edu/~derek.

ERIC S. CHUNG
Eric S. Chung is a Researcher at Microsoft Research in Redmond. Eric is interested prototyping
and developing productive ways to harness massively parallel hardware systems that incorporate
specialized hardware such as FPGAs. Eric received his Ph.D. in 2011 from Carnegie Mellon
University and was the recipient of the Microsoft Research Fellowship award in 2009. His paper
on CoRAM, a memory abstraction for programming FPGAs more effectively, received the best
paper award in FPGA’11. Between 2005 and 2011, Eric led the ProtoFlex project that enabled
practical FPGA-accelerated simulation of full-system multiprocessors. For more information,
please visit http://research.microsoft.com/en-us/people/erchung.
64 AUTHORS’ BIOGRAPHIES
JAMES C. HOE
James C. Hoe is Professor of Electrical and Computer Engineering at Carnegie Mellon Univer-
sity. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000 (S.M.,
1994). He received his B.S. in EECS from UC Berkeley in 1992. He is a Fellow of IEEE. Dr.
Hoe is interested in many aspects of computer architecture and digital hardware design, including
the specific areas of FPGA architecture for computing; digital signal processing hardware; and
high-level hardware design and synthesis. He was a contributor to RAMP (Research Accelera-
tor for Multiple Processors). He worked on the ProtoFlex FPGA-accelerated simulation project
between 2005 and 2011 with Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, Babak
Falsafi, and Ken Mai. Earlier, he worked on the SMARTS sampling simulation project. For more
information on Dr. Hoe and his research, please visit http://www.ece.cmu.edu/~jhoe.

You might also like