100% found this document useful (1 vote)

484 views120 pages

Data Processing On FPGAs

FPGA

Uploaded by

HoneyTiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

484 views120 pages

Data Processing On FPGAs

FPGA

Uploaded by

HoneyTiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 120

SSYNTHESIS

YNTHESIS L
LECTURES
ECTURES ON
ON D
DATA
ATA M
MANAGEMENT
ANAGEMENT
Series
Series
SeriesEditor:
Editor:
Editor:M.
M.
M.Tamer
Tamer
Tamerzsu,
zsu,
zsu,University
University
Universityof
of
ofWaterloo
Waterloo
Waterloo

TEUBNER WOODS
WOODS
TEUBNER

Series
Series
SeriesISSN:
ISSN:
ISSN:2153-5418
2153-5418
2153-5418

Data
DataProcessing
Processingon
onFPGAs
FPGAs

Jens
Jens
JensTeubner,
Teubner,
Teubner,Databases
Databases
Databasesand
and
andInformation
Information
InformationSystems
Systems
SystemsGroup,
Group,
Group,Dept.
Dept.
Dept.ofofofComputer
Computer
ComputerScience,
Science,
Science,TU
TU
TUDortmund
Dortmund
Dortmund
Louis
Louis
LouisWoods,
Woods,
Woods,Systems
Systems
SystemsGroup,
Group,
Group,Dept.
Dept.
Dept.of
of
ofComputer
Computer
ComputerScience,
Science,
Science,ETH
ETH
ETHZrich
Zrich
Zrich

DATA PROCESSING
PROCESSING ON
ON FPGAS
FPGAS
DATA

Roughly
Roughly
Roughlyaaadecade
decade
decadeago,
ago,
ago,power
power
powerconsumption
consumption
consumptionand
and
andheat
heat
heatdissipation
dissipation
dissipationconcerns
concerns
concernsforced
forced
forcedthe
the
thesemiconductor
semiconductor
semiconductorindustry
industry
industry
to
to
toradically
radically
radicallychange
change
changeits
its
itscourse,
course,
course,shifting
shifting
shiftingfrom
from
fromsequential
sequential
sequentialto
to
toparallel
parallel
parallelcomputing.
computing.
computing.Unfortunately,
Unfortunately,
Unfortunately,improving
improving
improving
performance
performance
performanceof
of
ofapplications
applications
applicationshas
has
hasnow
now
nowbecome
become
becomemuch
much
muchmore
more
moredifficult
difficult
difficultthan
than
thanin
in
inthe
the
thegood
good
goodold
old
olddays
days
daysof
of
offrequency
frequency
frequency
scaling.
scaling.
scaling.This
This
Thisisisisalso
also
alsoaffecting
affecting
affectingdatabases
databases
databasesand
and
anddata
data
dataprocessing
processing
processingapplications
applications
applicationsin
in
ingeneral,
general,
general,and
and
andhas
has
hasled
led
ledto
to
tothe
the
thepopularity
popularity
popularity
of
of
ofso-called
so-called
so-calleddata
data
dataappliancesspecialized
appliancesspecialized
appliancesspecializeddata
data
dataprocessing
processing
processingengines,
engines,
engines,where
where
wheresoftware
software
softwareand
and
andhardware
hardware
hardwareare
are
aresold
sold
sold
together
together
togetherin
in
inaaaclosed
closed
closedbox.
box.
box.Field
Field
Fieldprogrammable
programmable
programmablegate
gate
gatearrays
arrays
arrays(FPGAs)
(FPGAs)
(FPGAs)increasingly
increasingly
increasinglyplay
play
playan
an
animportant
important
importantrole
role
rolein
in
insuch
such
such
systems.
systems.
systems.FPGAs
FPGAs
FPGAsare
are
areattractive
attractive
attractivebecause
because
becausethe
the
theperformance
performance
performancegains
gains
gainsof
of
ofspecialized
specialized
specializedhardware
hardware
hardwarecan
can
canbe
be
besignificant,
significant,
significant,while
while
while
power
power
powerconsumption
consumption
consumptionisisismuch
much
muchless
less
lessthan
than
thanthat
that
thatof
of
ofcommodity
commodity
commodityprocessors.
processors.
processors.On
On
Onthe
the
theother
other
otherhand,
hand,
hand,FPGAs
FPGAs
FPGAsare
are
areway
way
way
more
more
moreflexible
flexible
flexiblethan
than
thanhard-wired
hard-wired
hard-wiredcircuits
circuits
circuits(ASICs)
(ASICs)
(ASICs)and
and
andcan
can
canbe
be
beintegrated
integrated
integratedinto
into
intocomplex
complex
complexsystems
systems
systemsin
in
inmany
many
manydifferent
different
different
ways,
ways,
ways,e.g.,
e.g.,
e.g.,directly
directly
directlyin
in
inthe
the
thenetwork
network
networkfor
for
foraaahigh-frequency
high-frequency
high-frequencytrading
trading
tradingapplication.
application.
application.This
This
Thisbook
book
bookgives
gives
givesan
an
anintroduction
introduction
introduction
to
to
toFPGA
FPGA
FPGAtechnology
technology
technologytargeted
targeted
targetedat
at
ataaadatabase
database
databaseaudience.
audience.
audience.In
In
Inthe
the
thefirst
first
firstfew
few
fewchapters,
chapters,
chapters,we
we
weexplain
explain
explainin
in
indetail
detail
detailthe
the
theinner
inner
inner
workings
workings
workingsof
of
ofFPGAs.
FPGAs.
FPGAs.Then
Then
Thenwe
we
wediscuss
discuss
discusstechniques
techniques
techniquesand
and
anddesign
design
designpatterns
patterns
patternsthat
that
thathelp
help
helpmapping
mapping
mappingalgorithms
algorithms
algorithmsto
to
toFPGA
FPGA
FPGA
hardware
hardware
hardwareso
so
sothat
that
thatthe
the
theinherent
inherent
inherentparallelism
parallelism
parallelismof
of
ofthese
these
thesedevices
devices
devicescan
can
canbe
be
beleveraged
leveraged
leveragedin
in
inan
an
anoptimal
optimal
optimalway.
way.
way.Finally,
Finally,
Finally,the
the
thebook
book
book
will
will
willillustrate
illustrate
illustrateaaanumber
number
numberof
of
ofconcrete
concrete
concreteexamples
examples
examplesthat
that
thatexploit
exploit
exploitdifferent
different
differentadvantages
advantages
advantagesof
of
ofFPGAs
FPGAs
FPGAsfor
for
fordata
data
dataprocessing.
processing.
processing.

M
Mor
Mor
Morgan
gan
gan&
Cl
Cl
Claypool
aypool
aypool Publishers
Publishers
Publishers
&
&C

Data Processing on
FPGAs
Jens
JensTeubner
Teubner
Louis
Louis Woods
Woods

About
About
AboutSYNTHESIs
SYNTHESIs
SYNTHESIs

&
&
&

Mor
Mor
Morgan
gan
gan

ISBN:
ISBN:
ISBN: 978-1-62705-060-9
978-1-62705-060-9
978-1-62705-060-9

90000
90000
90000

Cl
Cl
Claypool
aypool
aypool Publishers
Publishers
Publishers

w
w
ww
w
ww
w
w...m
m
mooorrrgggaaannnccclllaaayyypppoooooolll...cccooom
m
m

9
9
9781627
781627
781627050609
050609
050609

MOR GAN
GAN &
& CL
CL AYPOOL
AYPOOL
MOR
MOR
GAN
&
CL
AYPOOL

This
This
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionof
of
ofaaawork
work
workthat
that
thatappears
appears
appearsin
in
inthe
the
theSynthesis
Synthesis
Synthesis
Digital
Digital
DigitalLibrary
Library
Libraryof
of
ofEngineering
Engineering
Engineeringand
and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
SynthesisLectures
Lectures
Lectures
provide
provide
provideconcise,
concise,
concise,original
original
originalpresentations
presentations
presentationsof
of
ofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,in
in
indigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com

SSYNTHESIS
YNTHESIS L
LECTURES
ECTURES ON
ON D
DATA
ATA M
MANAGEMENT
ANAGEMENT
M.
M.
M.Tamer
Tamer
Tamerzsu,
zsu,
zsu,Series
Series
SeriesEditor
Editor
Editor

Data Processing on FPGAs

Synthesis Lectures on Data

Management
Editor
M. Tamer zsu, University of Waterloo

Synthesis Lectures on Data Management is edited by Tamer zsu of the University of Waterloo.
e series will publish 50- to 125 page publications on topics pertaining to data management. e
scope will largely follow the purview of premier information and computer science conferences, such
as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but
not are limited to: query languages, database system architectures, transaction management, data
warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia
data management, data mining, and related subjects.

Data Processing on FPGAs

Jens Teubner and Louis Woods
2013

Perspectives on Business Intelligence

Raymond T. Ng, Patricia C. Arocena, Denilson Barbosa, Giuseppe Carenini, Luiz Gomes, Jr.
Stephan Jou, Rock Anthony Leung, Evangelos Milios, Rene J. Miller, John Mylopoulos, Rachel A.
Pottinger, Frank Tompa, and Eric Yu
2013

Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based
Data and Services for Advanced Applications
Amit Sheth and Krishnaprasad irunarayan
2012

Data Management in the Cloud: Challenges and Opportunities

Divyakant Agrawal, Sudipto Das, and Amr El Abbadi
2012

Query Processing over Uncertain Databases

Lei Chen and Xiang Lian
2012

Foundations of Data Quality Management

Wenfei Fan and Floris Geerts
2012

iii

Incomplete Data and Data Dependencies in Relational Databases

Sergio Greco, Cristian Molinaro, and Francesca Spezzano
2012

Business Processes: A Database Perspective

Daniel Deutch and Tova Milo
2012

Data Protection from Insider reats

Elisa Bertino
2012

Deep Web Query Interface Understanding and Integration

Eduard C. Dragut, Weiyi Meng, and Clement T. Yu
2012

P2P Techniques for Decentralized Applications

Esther Pacitti, Reza Akbarinia, and Manal El-Dick
2012

Query Answer Authentication

HweeHwa Pang and Kian-Lee Tan
2012

Declarative Networking
Boon au Loo and Wenchao Zhou
2012

Full-Text (Substring) Indexes in External Memory

Marina Barsky, Ulrike Stege, and Alex omo
2011

Spatial Data Management

Nikos Mamoulis
2011

Database Repairing and Consistent Query Answering

Leopoldo Bertossi
2011

Managing Event Information: Modeling, Retrieval, and Applications

Amarnath Gupta and Ramesh Jain
2011

Fundamentals of Physical Design and Query Compilation

David Toman and Grant Weddell
2011

Methods for Mining and Summarizing Text Conversations

Giuseppe Carenini, Gabriel Murray, and Raymond Ng
2011

Probabilistic Databases
Dan Suciu, Dan Olteanu, Christopher R, and Christoph Koch
2011

Peer-to-Peer Data Management

Karl Aberer
2011

Probabilistic Ranking Techniques in Relational Databases

Ihab F. Ilyas and Mohamed A. Soliman
2011

Uncertain Schema Matching

Avigdor Gal
2011

Fundamentals of Object Databases: Object-Oriented and Object-Relational Design

Suzanne W. Dietrich and Susan D. Urban
2010

Advanced Metasearch Engine Technology

Weiyi Meng and Clement T. Yu
2010

Web Page Recommendation Models: eory and Algorithms

Sule Gndz-gdc
2010

Multidimensional Databases and Data Warehousing

Christian S. Jensen, Torben Bach Pedersen, and Christian omsen
2010

Database Replication
Bettina Kemme, Ricardo Jimenez-Peris, and Marta Patino-Martinez
2010

Relational and XML Data Exchange

Marcelo Arenas, Pablo Barcelo, Leonid Libkin, and Filip Murlak
2010

User-Centered Data Management

Tiziana Catarci, Alan Dix, Stephen Kimani, and Giuseppe Santucci
2010

Data Stream Management

Lukasz Golab and M. Tamer zsu
2010

Access Control in Data Management Systems

Elena Ferrari
2010

An Introduction to Duplicate Detection

Felix Naumann and Melanie Herschel
2010

Privacy-Preserving Data Publishing: An Overview

Raymond Chi-Wing Wong and Ada Wai-Chee Fu
2010

Keyword Search in Databases

Jerey Xu Yu, Lu Qin, and Lijun Chang
2009

Copyright 2013 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Data Processing on FPGAs

Jens Teubner and Louis Woods
www.morganclaypool.com

ISBN: 9781627050609
ISBN: 9781627050616

paperback
ebook

DOI 10.2200/S00514ED1V01Y201306DTM035

A Publication in the Morgan & Claypool Publishers series

SYNTHESIS LECTURES ON DATA MANAGEMENT
Lecture #35
Series Editor: M. Tamer zsu, University of Waterloo
Series ISSN
Synthesis Lectures on Data Management
Print 2153-5418 Electronic 2153-5426

Data Processing on FPGAs

Jens Teubner
Databases and Information Systems Group, Dept. of Computer Science, TU Dortmund

Louis Woods
Systems Group, Dept. of Computer Science, ETH Zrich

SYNTHESIS LECTURES ON DATA MANAGEMENT #35

M
&C

Morgan & cLaypool publishers

ABSTRACT
Roughly a decade ago, power consumption and heat dissipation concerns forced the semiconductor industry to radically change its course, shifting from sequential to parallel computing.
Unfortunately, improving performance of applications has now become much more dicult than
in the good old days of frequency scaling. is is also aecting databases and data processing
applications in general, and has led to the popularity of so-called data appliancesspecialized
data processing engines, where software and hardware are sold together in a closed box. Fieldprogrammable gate arrays (FPGAs) increasingly play an important role in such systems. FPGAs
are attractive because the performance gains of specialized hardware can be signicant, while
power consumption is much less than that of commodity processors. On the other hand, FPGAs
are way more exible than hard-wired circuits (ASICs) and can be integrated into complex systems in many dierent ways, e.g., directly in the network for a high-frequency trading application.
is book gives an introduction to FPGA technology targeted at a database audience. In the rst
few chapters, we explain in detail the inner workings of FPGAs. en we discuss techniques and
design patterns that help mapping algorithms to FPGA hardware so that the inherent parallelism
of these devices can be leveraged in an optimal way. Finally, the book will illustrate a number of
concrete examples that exploit dierent advantages of FPGAs for data processing.

KEYWORDS
FPGA, modern hardware, database, data processing, stream processing, parallel algorithms, pipeline parallelism, programming models

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1
1.2
1.3
1.4
1.5
1.6
1.7

A Primer in Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1

2.2
2.3

Moores Law and Transistor-Speed Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Memory Wall and Von Neumann Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Multicore CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Field-Programmable Gate Arrays (FPGAs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
FPGAs for Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.3 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Basic Hardware Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Combinational Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Sequential Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Asynchronous sequential logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Synchronous sequential logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Hardware Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Hardware Description Languages (HDLs) . . . . . . . . . . . . . . . . . . . . . . . 12
Circuit Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Logical Design Flow (Synthesis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Physical Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1
3.2

A Brief History of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Look-up Tablese Key to Re-Programmability . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 LUT Representation of a Boolean Function . . . . . . . . . . . . . . . . . . . . . . 18

3.3
3.4

3.5
3.6

3.7

3.8

3.2.2 Internal Architecture of an LUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 LUT (Re)programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 Alternative Usage of LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Elementary Logic Units (Slices/ALMs) . . . . . . . . . . . . . . . . . . . . . . . . . 20
Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Logic Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
High-Speed I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Auxiliary On-Chip Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.1 Block RAM (BRAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.2 Digital Signal Processing (DSP) Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.3 Soft and Hard IP-Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
FPGA Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.1 FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7.2 Dynamic Partial Reconguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Advanced Technology and Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8.1 Die Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8.2 Heterogeneous Die-Stacked FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8.3 Time-Multiplexed FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8.4 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

FPGA Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1

4.2

4.3
4.4

Re-Build, Parameterize, or Program the Hardware Accelerator? . . . . . . . . . . . . 33

4.1.1 Re-Building Circuits at Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Parameterized Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Instruction Set Processors on top of FPGAs . . . . . . . . . . . . . . . . . . . . . . 37
From Algorithm to Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Expression ! Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Circuit Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Data-Parallel Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Pipeline-Parallel Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Pipeline Parallelism in Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 Pipelining in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Designing for Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5

Data Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1

5.2

5.3
5.4

5.5
5.6

Regular Expression Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Finite-State Automata for Pattern Matching . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Implementing Finite-State Automata in Hardware . . . . . . . . . . . . . . . . . 53
5.1.3 Optimized Circuit Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.4 Network Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Complex Event Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Stream Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Hardware Partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.3 Best-Eort Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.4 Line-Rate Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Filtering in the Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Data Path Architecture in the Real World . . . . . . . . . . . . . . . . . . . . . . . . 61
Data Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.1 Compositional Query Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Getting Data In and Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Dynamic Query Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.1 Fast Workload Changes rough Partial Modules . . . . . . . . . . . . . . . . . 68
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Accelerated DB Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1

6.2

4.4.4 Turning a Circuit into a Pipeline-Parallel Circuit . . . . . . . . . . . . . . . . . . 47

Related Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Sort Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.1 Sorting Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.2 BRAM-based FIFO Merge Sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.3 External Sorting with a Tree Merge Sorter . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.4 Sorting with Partial Reconguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Skyline Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Standard Block Nested Loops (BNL) Algorithm . . . . . . . . . . . . . . . . . . 77
6.2.2 Parallel BNL with FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Performance Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Secure Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.1

FPGAs versus CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.1.1 Von Neumann Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xii

7.2
7.3

7.4

7.1.2 Trusted Platform Module (TPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

FPGAs versus ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Security Properties of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.1 Bitstream Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3.2 Bitstream Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3.3 Further Security Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
FPGA as Trusted Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4.1 Fully Homomorphic Encryption with FPGAs . . . . . . . . . . . . . . . . . . . . 86
7.4.2 Hybrid Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4.3 Trusted Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Commercial FPGA Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.1
A.2
A.3

NetFPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Solarares ApplicationOnload Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Fusion I/Os ioDrive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Authors Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

xiii

Preface
System architectures, hardware design, and programmable logic (specically, eld-programmable
gate arrays or FPGAs) are topics generally governed by electrical engineers. Hardware people
are in charge of embracing technological advantages (and turning them into improved performance), preferably without breaking any of the established hardware/software interfaces, such as
instruction sets or execution models.
Conversely, computer scientists and software engineers are responsible for understanding
users problems and satisfying their application and functionality demands. While doing so, they
hardly care how hardware functions underneathmuch as their hardware counterparts are largely
unaware of how their systems are being used for concrete problems.
As time progresses, this traditional separation between hard- and software leaves more
and more potential of modern technology unused. But giving up the separation and building
hardware/software co-designed systems requires that both parties involved understand each others
terminology, problems/limitations, requirements, and expectations.
With this book we want to help work toward this idea of co-designed architectures.
Most importantly, we want to give the software side of the storythe database community in
particulara basic understanding of the involved hardware technology. We want to explain what
FPGAs are, how they can be programmed and used, and which role they could play in a database
context.
is book is intended for students and researchers in the database eld, including those
that have not had much contact with hardware technology in the past, but would love to get
introduced to the eld. At ETH Zrich/TU Dortmund, we have been teaching for several years
a course titled Data Processing on Modern Hardware. e material in this book is one part of
that Master-level course (which further discusses also modern hardware other than FPGAs).
We start the book by highlighting the urgent need from the database perspective to invest
more eort into hardware/software co-design issues (Chapter 1). Chapters 2 and 3 then introduce
the world of electronic circuit design, starting with a high-level view, then looking at FPGAs
specically. Chapter 3 also explains how FPGAs work internally and why they are particularly
attractive at the present time.
In the remaining chapters, we then show how the potential of FPGAs can be turned into
actual systems. First, we give general guidelines how algorithms and systems can be designed
to leverage the potential of FPGAs (Chapter 4). Chapter 5 illustrates a number of examples
that successfully used FPGAs to improve database performance. But FPGAs may also be used to
enable new database functionality, which we discuss in Chapter 7 by example of a database crypto

xiv

PREFACE

co-processor. We conclude in Chapter 8 with a wary look into the future of FPGAs in a database
context.
A short appendix points to dierent avors of FPGA system integration, realized through
dierent plug-ins for commodity systems.
Jens Teubner and Louis Woods
June 2013

CHAPTER

Introduction
For decades, performance of sequential computation continuously improved due to the stunning
evolution of microprocessors, leaving little room for alternatives. However, in the mid-2000s,
power consumption and heat dissipation concerns forced the semiconductor industry to radically
change its course, shifting from sequential to parallel computing. Ever since, software developers
have been struggling to achieve performance gains comparable to those of the past. Specialized
hardware can increase performance signicantly at a reduced energy footprint but lacks the necessary exibility. FPGAs (reprogrammable hardware) are an interesting alternative, which have
similar characteristics to specialized hardware, but can be (re)programmed after manufacturing.
In this chapter, we give an overview of the problems that commodity hardware faces, discuss how
FPGAs dier from such hardware in many respects, and explain why FPGAs are important for
data processing.

1.1

MOORES LAW AND TRANSISTOR-SPEED SCALING

Ever since Intel co-founder Gordon E. Moore stated his famous observation that the number of
transistors on integrated circuits (IC) doubles roughly every two years, this trend (Moores law)
has continued unhalted until the present day. e driving force behind Moores law is the continuous miniaturization of the metal oxide semiconductor (MOS) transistor, the basic building block
of electronic circuits. Transistor dimensions have been shrunk by about 30 % every two years, resulting in an area reduction of 50 %, and hence the doubling of transistors that Moore observed.
Transistor scaling has not only led to more transistors but also to faster transistors (shorter
delay times and accordingly higher frequencies) that consume less energy. e bottom line is that,
in the past, every generation of transistors has enabled circuits with twice as many transistors,
an increased speed of about 40 %, consuming the same amount of energy as the previous generation, despite 50 % more transistors. e theory behind this technology scaling was formulated
by Dennard et al. [1974] and is known as Dennards scaling. ere was a time when Dennards
scaling accurately reected what was happening in the semiconductor industry. Unfortunately,
those times have passed for reasons that we will discuss next.

1.2

MEMORY WALL AND VON NEUMANN BOTTLENECK

Just like CPUs, o-chip dynamic memory (DRAM) has also been riding Moores law but due
to economic reasons with a dierent outcome than CPUs. Whereas memory density has been

1. INTRODUCTION

doubling every two years, access speed has improved at a much slower pace, i.e., today, it takes
several hundred CPU cycles to access o-chip memory. DRAM is being optimized for large
capacity at minimum cost, relying on data locality and caches in the CPU for performance. us,
a signicant gap between processor speed and memory speed has been created over the years, a
phenomenon known as the memory wall.
Furthermore, the majority of computers today are built according to the Von Neumann
model, where data and software programs are stored in the same external memory. us, the bus
between main memory and the CPU is shared between program instructions and workload data,
leading to the so-called Von Neumann bottleneck.
To mitigate the negative eects of both the memory wall and the Von Neumann bottleneck,
CPUs use many of the available transistors to implement all sorts of acceleration techniques to
nonetheless improve performance, e.g., out-of-order execution, branch prediction, pipelining,
and last but not least cache hierarchies. In fact, nowadays a substantial amount of transistors and
die area (up to 50 %) are used for caches in processors.

1.3

POWER WALL

In the past, frequency scaling, as a result of transistor shrinking, was the dominant force that increased performance of commodity processors. However, this trend more or less came to an end
about a decade ago. As already mentioned in the previous section, the advantages of higher clock
speeds are in part negated by the memory wall and Von Neumann bottleneck, but more importantly, power consumption and heat dissipation concerns forced the semiconductor industry to stop
pushing clock frequencies much further.
Higher power consumption produces more heat, and heat is the enemy of electronics. Too
high temperatures may cause an electronic circuit to malfunction or even damage it permanently.
A more subtle consequence of increased temperature is that transistor speed decreases, while
current leakage increases, producing even more heat. erefore, silicon chips have a xed power
budget, which microprocessors started to exeed in the mid-2000s, when frequency scaling hit the
so-called power wall.
A simplied equation that characterizes CPU power consumption (PCP U ) is given below.
We deliberately ignore additional terms such as short circuit and glitch power dissipation, and focus
on the most important components: dynamic power and static power.
PCP U D C Vd2d. fclk C Vdd Ileak

dynamic power

static power

Dynamic power is the power consumed when transistors are switching, i.e., when transistors are changing their state. e parameter characterizes the switching activity, C stands for
capacitance, Vdd for voltage, and fclk corresponds to the clock frequency. Static power, on the

1.4. MULTICORE CPUS AND GPUS

other hand, is the power consumed even when transistors are inactive, because transistors always
leak a certain amount of current (Ileak ).
As transistors became smaller (< 130 nanometers), reality increasingly started to deviate
from Dennards theory, i.e., the reduced voltage of smaller transistor was no longer sucient to
compensate fully for the increased clock speed and the larger number of transistors. For a number
of reasons, voltage scaling could no longer keep up with frequency scaling, leading to excessive
power consumption.
Unfortunately, limiting frequency scaling solved the power consumption issue only temporarily. As transistor geometries shrink, a higher percentage of current is leaked through the
transistor. As a result, static power consumption, which is independent of the clock frequency,
is increased. us, to avoid hitting the power wall again, in the future, an increasing amount of
transistors will need to be powered o, i.e., it will only be possible to use a fraction of all available
transistors at the same time.

1.4

MULTICORE CPUS AND GPUS

As Moores law prevailed but frequency scaling reached physical limits, there was a major shift
in the microprocessor industry toward parallel computing: instead of aiming for ever-increasing
clock frequencies of a single core, multiple identical cores are now placed on the same die. Unfortunately, there are a number of issues with multicore scaling. First of all, performance is now
directly dependant on the degree of parallelism that can be exploited for a given task. Amdahls
law states that if a fraction f of computation is enhanced by a speedup of S , then the overall
speedup is:
1
speedup D
:
.1 f / C fS
In the case of multicores, we can interpret f as the fraction of parallelizable computation (assuming perfect parallelization), and S as the number of cores. us, as the number of cores increases,
so does the pressure to be able to exploit maximum parallelism from a task. However, as Hill
and Marty [2008] observed, a parallel architecture that relies on large amounts of homogeneous,
lean cores is far from optimal to extract the necessary parallelism from a task. Hill and Marty
[2008] suggest that an asymmetric architecture would be better suited, while they see the highest
potential in dynamic techniques that allow cores to work together on sequential execution.
Graphic processors (GPUs), in a sense, are an extreme version of multicore processors. In a
GPU there are hundreds of very lean cores that execute code in lockstep. GPUs have the same
problems with Almdahls law as multicore CPUs. In fact, the more primitive GPU cores and
the way threads are scheduled on them, reduces exibility, making it even more dicult to extract
suitable parallelism from arbitrary applications that would allow an eective mapping to the GPU
architecture.

1. INTRODUCTION

1.5

SPECIALIZED HARDWARE

Dark silicon [Esmaeilzadeh et al., 2011] refers to the underutilization of transistors due to power
consumption constraints and/or inecient parallel hardware architectures that conict with Amdahls law. A promising way to overcome these limitations is a move toward heterogeneous architectures, i.e., where not all cores are equal and tasks are o-loaded to specialized hardware to
both improve performance and save energy. is conclusion is similar to the one size does not
t all concept [Stonebraker et al., 2007] from database research although applied to hardware
architectures.
Instead of mapping a given task to a xed general-purpose hardware architecture, specialized hardware is mapped to the task at hand. Dierent problems require dierent forms of
parallelism, e.g., data parallelism versus pipeline parallelism, coarse-grained parallelism vs. negrained parallelism. Custom hardware allows employing the most eective form of parallelization
that best suits a given task.
Specialized hardware is neither bound to the Von Neumann bottleneck nor does it necessarily
suer from the memory wall. For instance, custom hardware that needs to monitor network data,
e.g., for network intrusion detection or high-frequency trading, can be coupled directly with a
hardware Ethernet controller. us, the slow detour via system bus and main memory is avoided.
Consequently, the need for large caches, branch prediction, etc., dissolves, which saves chip space
and reduces power consumption.
Power consumption of specialized hardware solutions is usually orders of magnitude below
that of general-purpose hardware such as CPUs and GPUs. Knowing exactly what kind of a
problem the hardware is supposed to solve, allows using transistors much more eectively. Also,
due to specialized hardware parallelism and avoidance of the Von Neumann bottleneck, lower clock
frequencies are typically sucient to eciently solve a given task, which further reduces power
consumption. For instance, a circuit that handles 10G Ethernet trac processes 64-bit words at
a clock speed of only 156.25 MHz.
Nevertheless, specialized hardware also has a number of drawbacks, and in the past systems
that were built from custom hardware (e.g. database machines in the 1980s) typically lost the
race against systems based on general-purpose hardware. First of all, building custom hardware
is a dicult and time-consuming process. Second, potential bugs usually cannot be solved after
manufacturing, making testing even more time-consuming, and also increasing the risk associated
with producing specialized hardware. ird, unless the custom hardware is mass-produced, it
is signicantly more expensive than general-purpose hardware. In the past, frequency scaling
improved sequential computation performance to such an extent that in many domains custom
hardware solutions were simply uneconomical.

1.6. FIELD-PROGRAMMABLE GATE ARRAYS (FPGAS)

1.6

FIELD-PROGRAMMABLE GATE ARRAYS (FPGAS)

Between the two extremesgeneral-purpose processors and specialized hardware

reprogrammable hardware is another class of silicon devices, which in a sense combines the
best of both worlds. Field-programmable gate arrays (FPGAs) are the most advanced brood
of this class. FPGAs consist of a plethora of uncommitted hardware resources, which can be
programmed after manufacturing, i.e., in the eld.
A circuit implemented on an FPGA can be thought of as a hardware simulation of a corresponding hard-wired circuit. As such FPGA-based circuits exhibit many of the favorable characteristics of specialized hardware: (i) application-tailored parallelism, (ii) low power-consumption,
(iii) and integration advantages (e.g., to avoid the Von Neumann bottleneck). But the fundamental
dierence compared to specialized hardware is that circuits implemented on an FPGA can be
changed anytime, by a mere update of conguration memory. Of course, there is no free lunch.
Hard-wired circuits consume even less energy, require even fewer transistors to implement a given
task, and can be clocked faster than FPGAs. But oftentimes these are acceptable (or even negligible) concessions, given the various benets of reprogrammability.
First, reprogrammability allows multiple specialized circuits to execute on the same silicon
device in a time-multiplexed manner, i.e., circuits can be loaded when they are needed. Second,
FPGAs drastically reduce time to market and provide the ability to upgrade already deployed
circuits. ird, migration to the next-generation manufacturing process (to benet from smaller
transistors) is seamless, i.e., by buying a next-generation FPGA and migrating the hardware description code, similar to the way software is migrated from one CPU generation to the next.

1.7

FPGAS FOR DATA PROCESSING

Originally, FPGAs were primarily used as glue logic on printed circuit boards (PCBs), and later on
also for rapid prototyping. However, more than two decades of FPGA technology evolution have
allowed FPGAs to emerge in a variety of elds, as a class of customizable hardware accelerators
that address the increasing demands for performance, with a low energy footprint, at aordable
cost. In recent years, increased attention from both academia and industry has been drawn to
using FPGAs for data processing tasks, which is the domain that this book focuses on.

1.7.1 STREAM PROCESSING

FPGAs have several desirable properties for stream processing applications. High-frequency trading is a good example, where high-rate data streams need to be processed in real time, and microsecond latencies determine success or failure. e I/O capabilities of FPGAs, allow for exible
integration, e.g., in the case of high-frequency trading, the FPGAs are inserted directly into the
network, enabling most ecient processing of network trac. Furthermore, the reprogrammability of FPGAs makes it possible to quickly adapt to market changes.

1. INTRODUCTION

Another example illustrating the advantages of FPGAs are network intrusion detection systems (NIDSs) that scan incoming network packets, attempting to detect malicious patterns. Typically there are several hundred patterns formulated as regular expressions, which all need to be
evaluated in real time. e regular expressions can easily be implemented in hardware as nite state
machines (FSMs), which, on an FPGA, can then all be executed in parallel. us, besides the integration and reprogramming capabilities, here the inherent parallelism of FPGAs is exploited to
achieve unprecedented performance.

1.7.2 BIG DATA

Data management, in the advent of big data, becomes an increasingly dicult task using traditional database techniques. For instance, ad-hoc data analytics queries often cannot rely on
indexes, and need to fall back to scanning vast amounts of data. is has led to so-called data
warehouse appliances that combine hardware and software in a single, closed box, allowing appliance vendors to ne-tune software/hardware co-design. IBM/Netezzas appliance [Francisco,
2011], for example, combines multiple FPGAs and CPUs in a large blade server. To increase I/O
bandwidth stored data are highly compressed. During query execution, the FPGAs eciently
decompress and lter data, while the CPUs take care of more complex higher-level operations.
1.7.3 CLOUD COMPUTING
Whereas today FPGAs are still considered exotic for many data processing tasks, cloud computing
could catapult FPGAs to a mainstream data processing technology. In cloud computing, compute
resources are provided as a service to customers, i.e., customers can outsource tasks to the cloud
and only pay for the resources they actually need for a particular job. A major cost factor for
cloud providers is the power consumption of their data centers. us, any technology that can
deliver the same performance and exibility with a lower energy footprint is very attractive for the
cloud. In addition, big cloud providers such as Google, Amazon, and Microsoft denitively have
the economic means to provide a bundle of FPGAs as a service with a high-level programming
interface, making FPGA technology much more accessible to customers than it is today. e
implications of bringing FPGA acceleration into cloud infrastructures has been discussed, e.g.,
by Madhavapeddy and Singh [2011].
1.7.4 SECURITY
So far, we stressed performance and energy consumption as key advantages of FPGAs. However,
security is a third dimension, where FPGAs can provide signicant benets. FPGA conguration
data can be encrypted and authenticated, making it very dicult to tamper with the FPGA conguration itself. ese built-in authentication and encryption mechanisms can be used to create
a root of trust, on top of which secure reprogrammable hardware can be built. In the Von Neumann
model of a conventional processor, data and program are stored in the same memory, making

1.7. FPGAS FOR DATA PROCESSING

buer overow attacks or rootkits possible. In hardware, on the other hand, data and program
can be easily separated, e.g., the program can be implemented in logic (protected by the root of
trust), while data are stored in memory. Based on these ideas Arasu et al. [2013] built the Cipherbase system, which extends Microsofts SQL Server with FPGA-based secure hardware to
achieve high data condentiality and high performance.

In a buer overow attack, a data buers boundary is intentionally overrun such that data are written into adjacent memory.
For instance, if the attacker can guess where the program code is stored, this method can be used to inject malicious code.
A rootkit is software that typically intercepts common API calls, and is designed to provide administrator privileges to a user
without being detected. A buer overow could be exploited to load a rootkit.

CHAPTER

A Primer in Hardware Design

Before we delve into core FPGA technology in Chapter 3, we need to familiarize ourselves with
a few basic concepts of hardware design. As we will see, the process of designing a hard-wired
circuita so-called application specic integrated circuit (ASIC)is not that dierent from implementing the same circuit on an FPGA. In this chapter, we will cover the key components
that make up an integrated circuit, how these circuits are typically designed, and the various
tools required to convert an abstract circuit specication into a physical implementation, ready
for manufacturing.

2.1

BASIC HARDWARE COMPONENTS

e majority of electronic circuits, be it a simple counter or a full-edged microprocessor, are made

up of the following three fundamental ingredients: (i) combinational logic elements, (ii) memory
elements, (iii) and wiring to interconnect logic and memory elements. With the help of a few simple examples, in the following, we will show how these three ingredients are typically combined
to construct circuits that implement a given function.

2.1.1 COMBINATIONAL LOGIC

At the heart of any circuit there are basic logic gates, which can be combined by wiring their input
and output ports together to form more complex combinational logic elements. For instance, on the
left-hand side of Figure 2.1, a circuit known as half adder is constructed from an XOR gate and an
AND gate. is circuit computes the addition of two one-bit numbers A and B , and reports the
result on the output port S . If A and B are both set to one this produces an overow, which is
captured by the AND gate and reported as carry bit on output port C . Together with an additional
OR gate, two half adders can be combined to build a so-called full adder. A full adder has a third
input port (carry-in) and accounts for the carry bit from another full adder, i.e., it adds up three
one-bit numbers. is way, a cascade of full adders can be further combined to construct adders
with a wider word width, e.g., 32-bit adders.
Another example of a very fundamental combinational circuit in hardware design is a multiplexer, illustrated on the right-hand side of Figure 2.1. is 2-to-1 multiplexer has three input
ports: two input signals (i n0 , i n1 ) and a select line (sel ) that determines which of the two input
signals is routed to the output port of the multiplexer. Again, wider multiplexers can be constructed from these basic 2-to-1 multiplexers. Multiplexers enable the evaluation of conditional

2. A PRIMER IN HARDWARE DESIGN

A
B

.
XOR

AND

i n0

AND
Inverter

i n1

out

AND

sel
Figure 2.1: Combining basic logic gates to construct more complex circuits: a half adder (left) and a
two-input multiplexer (right).

expressions, i.e., if-then-else expressions of the form out = (sel ) ? i n1 : i n0 , where sel determines
whether i n1 or i n0 is selected for the output.
Combinational logic is purely driven by the input data, i.e., in the examples in Figure 2.1,
no clock is involved and no explicit synchronization is necessary. Notice that each logic gate has a
xed propagation delay, i.e., the time it takes before the eect of driving input signals is observable
at the output of a gate. Propagation delays result from physical eects, such as signal propagation
times along a signal wire or switching speeds of transistors. Combining multiple gates increases
the overall propagation delay, i.e., the propagation delay of a complex combinational circuit comprises the sum of propagation delays of its gates along the longest path within the circuit, known
as the critical path. e critical path determines the maximum clock speed of sequential circuits,
which we will discuss next.

2.1.2 SEQUENTIAL LOGIC

In contrast to combinational logic, sequential logic has state (memory). In fact, sequential logic is
combinational logic plus memory. While the output of combinational logic depends solely on its
present input, the output of sequential logic is a function of both its present and its past input,
as illustrated in Figure 2.2 (left). Whereas the logic gate is the fundamental building block of
combinational logic, state elements (e.g., ip-ops, latches, etc.) are the basic building blocks of a
sequential circuit.

2.1. BASIC HARDWARE COMPONENTS

combinational
.
logic

out

XNOR

S
Q

state
element

clk
S

XNOR

Figure 2.2: A sequential circuit with a feedback loop (left), the internals of an S-R (NOR) latch
(center), and symbol of a D ip-op (right).

2.1.3 ASYNCHRONOUS SEQUENTIAL LOGIC

One of the most basic one-bit state elements is a so-called SR (set/reset) latch. Internally, it can
be constructed using two cross-coupled NOR gates, as depicted in Figure 2.2 (center). If S and R
are both logic low (i.e., S D 0, R D 0), the feedback loops ensure that Q and Q (the complement
of Q) remain in a constant state. S D 1 and R D 0 forces Q D 1 and Q D 0, whereas S D 0 and
R D 1 does the opposite. S and R are not allowed to be logic high (i.e., S D 1, R D 1) at the same
time since this would cause Q D Q D 0.
Notice that the SR latch is level-sensitive, meaning that its state changes when the input
signals change their voltage levels (e.g., where ve volt corresponds to one state and zero volt to
the other). us, even though a circuit with latches can maintain state, it is still entirely driven
by its inputs, and no form of synchronization exists. erefore, this type of circuitry is called
asynchronous sequential logic. e speed of asynchronous sequential logic is essentially only limited
by the propagation delays of the logic gates used. However, asynchronous logic is very dicult to
get right, with, e.g., race conditions to deal with, which is why nearly all sequential circuits today
are synchronous.
2.1.4 SYNCHRONOUS SEQUENTIAL LOGIC
In a synchronous sequential circuit all memory elements are synchronized by means of a clock signal,
which is generated by an electronic oscillator, and distributed to all memory elements. e clock
signal (clk) periodically alternates between two states, i.e., logic low and logic high, and memory
elements are synchronized to one of the clock edges, i.e. the rising edge (change from 0 to 1) or the
falling edge (change from 1 to 0).
A more sophisticated memory element than the SR latch is required to be able to synchronize to the edge of a clock, e.g., a so-called D ip-op. e symbol that represents a D ip-op is
illustrated on the right-hand side of Figure 2.2. e D ip-op only stores the input value from
the D port at the specied clock edge (rising or falling). After that the outputs (Q and Q) remain unchanged for an entire clock period (cycle). Internally, the edge-sensitivity of D ip-ops

2. A PRIMER IN HARDWARE DESIGN

is implemented using two latches in combination with additional logic gates. Most D ip-ops
allow the D and clk port to be bypassed, forcing the ip-op to set or reset state, via separate
S=R ports.
e main reason for the ubiquitous use of synchronous sequential logic is its simplicity. e
clock frequency determines the length of a clock period and all combinational logic elements are
required to nish their computation within that period. If these conditions are met the behaviour
of the circuit is predictable and reliable. On the ip-side, maximum clock frequency is determined
by the critical path in a circuit, i.e., by the longest combinational path between any two ip-ops.
As a consequence, the potential performance of other faster combinational elements cannot be
maxed out.

2.2

HARDWARE PROGRAMMING

In the early days of electronic circuit design, schematics were the only formal way to represent a
circuit. us, circuits used to be drawn (e.g., as the circuits depicted in Figure 2.1) by hand or
using a computer-aided design (CAD) tool. Today, the most common way to design a circuit is
using an appropriate hardware description language (HDL), which is better suited to capture the
complexity of large circuits, and signicantly increases productivity.

2.2.1 HARDWARE DESCRIPTION LANGUAGES (HDLS)

e two most popular hardware description languages are Verilog and VHDL (both are also
used to program FPGAs). A hardware description language (HDL), at rst glance, resembles
an ordinary programming language such as C. Nevertheless, there are fundamental dierences
between a language designed for generating assembly code to be executed on a microprocessor, and
one that is designed to produce hardware circuits. HDLs are structured programming languages
that (i) capture circuit hierarchy and connectivity, (ii) naturally allow expressing the inherent
parallelism of separate circuit components, and (iii) provide a built-in mechanism for simulating
circuit behavior in software.
Dierent Levels of Abstraction: Structural versus Behavioral Modeling

e fundamental abstraction in any HDL is a module (referred to as entity in VHDL). A module

encapsulates a (sub)circuit and denes an interface to the outside world in terms of input/output
ports. Much like classes in an object-oriented language, modules are dened once and can then
be instantiated multiple times. Several instantiations of modules execute in parallel and can be
connected via wires between their input and output ports.
A Verilog implementation of a 2-to-1 multiplexer (cf. Figure 2.1) is given in Listing 2.1.
e module multiplexer denes an interface with three single-wire input ports (in0, in1, sel)
and one output port (out). Inside the multiplexer module four gates are instantiated (1 inverter,
2 AND gates, and 1 OR gate), and connected using the wires nsel, out0 and out1. Notice
how also the input/output ports of the multiplexer are connected with the instantiated gates.

2.2. HARDWARE PROGRAMMING

Listing 2.1: Structural Verilog (MUX).

1
2
3
4
5
6
7
8
9
10
11
12
13
14

module m u l t i p l e x e r (
i n p u t in0 , i n1 , s e l ,
output out
);
wire n s e l ;
wire out0 ;
wire out1 ;

inverter
andgate
andgate
orgate

inv0 ( sel
and0 ( i n0
and1 ( i n1
o r 0 ( out0

Listing 2.2: Behavioral Verilog (MUX).

1
2
3
4
5
6
7
8

module m u l t i p l e x e r (
i n p u t in 0 , in 1 , s e l ,
output out
);
assign out = s e l ? in1 : in0 ;
endmodule

, nsel );
, nsel , out0 ) ;
, s e l , out1 ) ;
, out1 , o u t ) ;

endmodule

e multiplexer displayed in Listing 2.1 is a structural implementation of a 2-to-1 multiplexer. at is, the multiplexer was built bottom-up by combining multiple instantiations of simpler modules into a single, more complex module. However, often it is benecial to model a
complex system prior to detailed architecture development. erefore, common HDLs also support a top-down method for designing a circuit, known as behavioral modeling. Listing 2.2 shows
the same 2-to-1 multiplexer implemented using behavioral Verilog. Whereas structural modeling
is an imperative technique, exactly dening how a circuit is constructed, behavioral modeling is a
declarative technique, specifying the behavior rather than the architecture of a circuit.
Simulation

Since producing a hardware circuit is a costly and lengthy process, simulation is a crucial tool for
designing hardware circuits economically. Simulation is so fundamental that supporting mechanisms are directly integrated into the HDL.
ere are various levels of granularity at which a circuit can be simulated. e rst step in
the design process of a circuit is usually to verify the behavioral correctness of a circuit. For that
matter, a behavioral model of the circuit is implemented and an appropriate testbench is created
within the HDL. A software simulator can then evaluate the circuit against the test cases specied
in the testbench.
Later in the design process, behavioral components are gradually replaced by structural
ones, and other aspects than logical correctness, e.g., adherence to timing constraints, become
important. HDLs also support this form of simulation, e.g., modules can be annotated with estimated timing information such as propagation delay, etc., and a simulator can check whether a
circuit can sustain a given clock rate.

2. A PRIMER IN HARDWARE DESIGN

route

physical
place

technology
mapping

logic optimization

RTL
synthesis

RTL

behavioral
circuit .
specication

high-level
synthesis

logical

manufacturing

technology
library

RTL

netlist

constraints
Figure 2.3: Design ow: formal circuit specication ! physical circuit.

2.3

CIRCUIT GENERATION

In this section we briey discuss the design ow for producing a physical circuit from a formal
specication, written in some HDL or higher-level language. e main steps are illustrated in
Figure 2.3. Most of these steps are also relevant for FPGA programming, which we will discuss
in the next chapter.

2.3.1 LOGICAL DESIGN FLOW (SYNTHESIS)

e highest level of abstraction in the circuit design ow are purely behavioral specications of
circuits. ese specications are typically not written using a hardware description language. Instead, domain specic languages, as well as standard languages such as C/C++ and SystemC are
commonly used. e output of a high-level synthesizer is typically a so-called register-transfer
level (RTL) description (see below) in HDL code. Notice that high-level synthesis is an active
eld of research, and especially general-purpose high-level synthesis often produces inferior results, compared to hand-crafted HDL code, which is why RTL descriptions (using, for example,
Verilog or VHDL) are still the most common entry point to circuit design.
At the register-transfer level (RTL), a circuit is modeled as a network of storage elements
(ip-ops, RAMs, etc.) with combinational logic elements in between. At this level, combinational elements may still be specied in a behavioral manner, e.g., an arithmetic adder component
may be used, without specifying a concrete implementation of an adder. e RTL abstraction has
two immediate advantages: (i) increased productivity, as certain functionality can be generated,
in contrast to being assembled manually from logic gates, and (ii) portability, as the RTL representation is technology independent, e.g., the same RTL specication can be used to generate
SystemC is an extension of C++ targeted toward software/hardware co-design.

2.3. CIRCUIT GENERATION

an ASIC or program an FPGA. To reach an RTL model of a circuit, manual design and tuning
may be necessary. However, once an RTL model of a circuit exists, all further processing steps
(including RTL synthesis) toward the physical circuit are fully automated.
An RTL synthesizer translates the RTL circuit description into an internal representation
of unoptimized Boolean logic equations. e next step is a technology-independent optimization
of these equations, i.e., redundant logic is automatically removed.
e nal step of the synthesis process is to implement the abstract internal representation of
the design by mapping it to the cells of a technology library (also known as cell library, provided by
the device manufacturer. e cells of a technology library range from basic logic gates (standard
cells) to large and complex megacells with ready-to-use layout (e.g., a DMA controller). e
library consists of cell descriptions that contain information about functionality, area, timing, and
power for each cell. As illustrated in Figure 2.3, the technology mapping process also takes design
constraints into account. ese constraints, e.g. regarding area consumption, timing, and power
consumption, guide the mapping process, and determine which cells are selected when multiple
options exist.

2.3.2 PHYSICAL DESIGN FLOW

e end result of the synthesis process is a so-called netlist. Netlists consist of two fundamental
components: instances and nets. Each time a cell of some target technology library is used, this is
called an instance. Nets, on the other hand, refer to the wires that connect instances.
Based on the netlist, the physical design process begins. In the rst phase the gates of the
netlist are placed on the available two-dimensional space. Initially, structures that should be placed
close together are identieda process known as oorplanning. en, based on the oorplan,
the placer tool assigns a physical location to each gate of the netlist. After initial placement, the
clock tree is inserted and placement decisions are re-evaluated. Multiple iterations are typically
necessary to nd a satisfying placement for all gates.
From the placed-gates netlist and the geometric information about the cells provided by
the technology library, the router tool derives the physical paths of the nets that connect gates,
as well as the power supply lines. Again, multiple iterations are typically necessary, and gates
might be relocated in the routing phase. e fully routed physical netlist is the nal result of
the entire design ow. It consists of the gates of a circuit, their exact placement, and drawn
interconnecting wires. In essence, the circuit design is now ready for fabrication. What follows
is typically a number of physical design verication steps carried out in software before a rst
prototype of the circuit is produced by an IC vendor.

CHAPTER

FPGAs
In this chapter, we give an overview of the technology behind eld-programmable gate arrays (FPGAs). We begin with a brief history of FPGAs before we explain the key concepts that make
(re)programmable hardware possible. We do so in a bottom-up approach, that is, we rst discuss the very basic building blocks of FPGAs, and then gradually zoom out and show how the
various components are combined and interconnected. We then focus on programming FPGAs
and illustrate a typical FPGA design ow, also covering advanced topics such as dynamic partial
reconguration. Finally, to complete the picture of modern FPGAs, we highlight bleeding-edge
technology advances and future FPGA trends.

3.1

A BRIEF HISTORY OF FPGAS

Field-programmable gate arrays (FPGAs) arose from programmable logic devices (PLDs), which
rst appeared in the early 1970s. PLDs could be programmed after manufacturing in the eld.
However, programmability in these devices was limited, i.e., programmable logic was hard-wired
between logic gates.
In 1985, the rst commercially available FPGA (the Xilinx XC2064) was invented. is
device hosted arrays of congurable logic bocks (CLBs) that contained the programmable gates, as
well as a programmable interconnect between the CLBs.
Early FPGAs were usually used as glue logic between other xed hardware components.
However, the tremendous development of FPGAs in the 1990s, made FPGAs an attractive alternative to ASICs for prototyping, small volume production, for products with a short time to
market, or products that require frequent modications.
Today, FPGAs are enhanced with many additional hardware components that are integrated directly into the FPGA fabric such as embedded digital signal processing units (DSP),
network cores, and even full-edged processors, e.g., the ARM Cortex-A9, which is embedded in the Xilinx Zynq-7000 programmable SoC.
In summary, FPGAs have gone through an extreme evolution in the last three decades. Today, FPGAs provide massive parallelism, low power consumption, and high-speed I/O capabilities, which makes them interesting devices for data processing with compute- and data-intensive
workloads.

3. FPGAS

AND

Input

Output

00
01
10
11

0
0
0
1

Input

Output

00
01
10
11

0
1
1
1

Figure 3.1: AND gate (left) and OR gate (right), each represented by a two-input LUT.

3.2

LOOK-UP TABLESTHE KEY TO

RE-PROGRAMMABILITY

In Chapter 2, we saw that the three fundamental ingredients of any circuit are combinational logic
(compute), memory elements (storage), and interconnect (communication). In the following, we
will discuss these aspects in the context of FPGAs. In an ASIC, combinational logic is built from
wiring a number of physical basic logic gates together. In FPGAs, these logic gates are simulated
using multiple instances of a generic element called a look-up tableor simply LUT. As we will
see, LUTs can be (re)programmed after manufacturing, which makes them mainly responsible
for the (re)programmability property of FPGAs.

3.2.1 LUT REPRESENTATION OF A BOOLEAN FUNCTION

An n-input LUT can be used to implement an arbitrary Boolean-valued function with up to n
Boolean arguments. Two simple examples of a two-input AND gate and a two-input OR gate, each
implemented by a two-input LUT, are given in Figure 3.1.
3.2.2 INTERNAL ARCHITECTURE OF AN LUT
An n-input LUT requires 2n bits of SRAM to store the lookup table, and a 2n W 1 multiplexer to
read out a given conguration bittypically implemented as a tree of 2 W 1 multiplexers. 4-input
LUTs used to be the standard, but today 6-input LUTs are more common. For readability, an
example of a 4-input LUT is given in Figure 3.2.
As illustrated in the gure, the Boolean values at the inputs i n0 to i n3 determine which
SRAM bit is forwarded to the output (out ) of the LUT. e LUT in the illustration can be used
to implement any 4-input Boolean expression. However, in practice, LUTs are even more sophisticated, e.g., the 4-input LUT above could also be used to implement two 3-input LUTs. Hence,
typical LUTs today have not only one output port but several to support such congurations.
3.2.3 LUT (RE)PROGRAMMING
Since the conguration of an LUT, i.e., what Boolean function the LUT implements, is stored in
SRAM, programming an LUT boils down to updating the SRAM cells of a given LUT. Inter-

3.2. LOOK-UP TABLESTHE KEY TO RE-PROGRAMMABILITY

i n0

i n1

i n2

i n3
Conf. Bit
We

0
1

D
EN

0
1
0
1

Clk

16 bits of SRAM

0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1

D
EN

0
1
0
1

out

0
1

LUT

Figure 3.2: Internal structure of a 4-input LUT: circuitry for reading (left) and writing (right).

nally, these SRAM cells are organized as shift registers (1-bit wide and 2n -bits deep), as depicted
on the right of Figure 3.2. us, the bits of the conguration bitstream are shifted bit-by-bit into
the LUTs when an FPGA is (re)programmed. us, an LUT can be read (asynchronously) in less
than a cycle but writing to an LUT requires 2n cycles. is reects one of the typical trade-os in
hardware designhere, write performance is traded for a simpler design and, as a result, reduced
chip space consumption.

3.2.4 ALTERNATIVE USAGE OF LUTS

So far, we have described how LUTs can be used to simulate gates, i.e., as combinational logic.
However, since a key component of a LUT is SRAM, LUTs can also be used as memory elements
of a circuit implemented on an FPGA.
When an FPGA is programmed, some LUTs can be congured to be used as so-called
distributed RAM, i.e., a LUT can be congured as a small n 1 LUT RAM. Multiple LUT
RAMs can then be combined with others for deeper and/or wider memories. Distributed RAM
is a good choice for relatively small embedded RAM blocks within a circuit, e.g. to implement
small FIFOs.
LUTs can also be congured to implement yet another fundamental memory element,
namely as shift registers. As illustrated in Figure 3.2 (right), the shift register functionality is
already present in the LUT as part of the write logic. Shift registers are useful, for example, to
build delay lines to compensate for certain latencies of other circuit components.

3. FPGAS

carry-out
4-LUT
D

4-LUT
.

carry
logic

A
B
D

4-LUT

Ci n

XOR

carry
logic

Cout

elementary
logic unit
carry-in
Figure 3.3: Functionality within an elementary logic unit (left) and a full adder constructed by combining a LUT with elements of the carry logic (right).

3.3

FPGA ARCHITECTURE

After discussing the key ideas behind look-up tables (LUTs) in the previous section, we now
focus on how these LUTs are embedded into coarser architectural units and distributed across
the FPGA fabric.

3.3.1 ELEMENTARY LOGIC UNITS (SLICES/ALMS)

A xed number of LUTs are grouped and embedded into a programmable logic component,
which we will call elementary logic unit (Xilinx refers to these units as slices, whereas Altera calls
them adaptive logic modules (ALMs)). e exact architecture of elementary logic units varies among
dierent vendors and even between dierent generations of FPGAs from the same vendor. Nevertheless, we can identify four main structural elements of an elementary logic unit: (i) a number of
LUTs (typically between two and eight), (ii) a proportional number of 1-bit registers, (iii) arithmetic/carry logic, (iv) and several multiplexers.
An example elementary logic unit is illustrated in Figure 3.3, on the left-hand side. For presentation purposes, this elementary logic unit has only two 4-input LUTs with two corresponding
ip-op registers. is architecture can be considered classic (i.e., slices of Xilinx Virtex-4 and
earlier FPGAs are based on this architecture) but modern FPGAs typically have more LUTs per
elementary logic unit.

3.4. ROUTING ARCHITECTURE

Each LUT is paired with a ip-op, i.e., a 1-bit memory element to store the result of
a table look-up. is facilitates pipelined circuit designs, where signals may propagate through
large parts of the FPGA chip while a high clock frequency is maintained. Next to LUT-based
memories these ip-ops are the second type of memory elements present in FPGAs. Whether
a ip-op is used or by-passed is determined by a multiplexer. e multiplexers in the elementary
logic units can be driven by additional SRAM that is also set when an FPGA is programmed.
Finally, FPGAs have fast dedicated wires between neighboring LUTs and their corresponding circuitry. e most common type of such communication channels are carry chains. Carry
chains allow combining multiple LUTs to implement arithmetic functions such as adders and
multipliers. In Figure 3.3 (left), the blue path represents a carry chain (though somewhat simplied, e.g., wires to the ip-op or output multiplexer have been omitted).
Typically, the vertical wires of the carry chain pass through dedicated carry logic that helps
in the construction of particular arithmetic functions. For example, on the right-hand side of
Figure 3.3, a full adder (1-bit adder) is constructed using an XOR gate (implemented by the LUT)
together with another XOR gate, as well as a 2-to-1 multiplexer of the carry logic. Via the vertical
carry wires a cascade of such full adders can be used to create wider adders.

3.4

ROUTING ARCHITECTURE

Direct wires to neighboring elementary logic units (e.g., carry chains) allow combining multiple
units to build more sophisticated circuits such as adders and multipliers. However, modern FPGAs provide enough congurable resources to host an entire system on chip (SoC). But to build
such a complex system a more exible communication mechanism is required to connect dierent
sub-circuits spread out over the FPGA fabric. is communication mechanism is known as the
interconnect.

3.4.1 LOGIC ISLANDS

A small number of elementary logic units (cf. Section 3.3.1) are grouped together into a coarser
grained unit that we refer to as logic island (corresponding to congurable logic blocks (CLBs) or logic
array blocks (LBAs) in the terminology of Xilinx and Altera, respectively). An example, resembling
the CLBs of Virtex-5 and Virtex-6 FPGAs, is given in Figure 3.4, on the left-hand side.
In this example, a logic island consists of two elementary logic units. Each elementary logic
unit has a separate set of wires (e.g., for carry-chains, etc.) running vertically through the chip
and connecting elementary logic units of adjacent logic islands. For more general communication among logic islands the elementary logic units of every logic island are connected to the
interconnect via switch matrix, which we will discuss in just a moment.
On the right-hand side of Figure 3.4, we show how the logic islands are arranged as a twodimensional array on the FPGA. e exible interconnect between the logic islands allows for
arbitrary communication patterns, say, LI00 could be talking to LI21 , which is three hops (wiring

3. FPGAS

switch matrix

carry-out

elementary
logic unit (2)

logic island

elementary
logic unit (1)

.
carry-in

IOB

LI03

LI13

LI23

LI33

IOB

LI02

LI12

LI22

LI32

IOB

LI01

LI11

LI21

LI31

IOB

LI00

LI10

LI20

LI30

IOB

carry-in

Figure 3.4: Internals of a logic island (left) and two-dimensional arrangement of logic islands (LIxy s )
on an FPGA (right), surrounded by I/O blocks (IOBs).

segments) away. e interconnect also makes it possible for logic islands to directly communicate
with special I/O blocks (IOBs) located at the periphery of the FPGA (see Section 3.5).

3.4.2 INTERCONNECT
e interconnect is a congurable routing architecture that allows communication between arbitrary logic islands. It consists of communication channels (bundles of wires) that run horizontally
and vertically across the chip, forming a grid containing a logic island in every grid cell.
As illustrated in Figure 3.5, at the intersection points of the routing channels there are
programmable links that determine how the wires are connected, allowing to connect the outputs
and the inputs of arbitrary logic islands or I/O blocks.
Each wire can be connected to any of the other three wires that come together at the intersection point, i.e., all those physical connections exist but programmable switches determine
which connections are active. In the example, in Figure 3.5, a vertical connection (red) was programmed by setting the SRAM cell of the corresponding switch appropriately. Hence, wires can
be programmed to take left or right turns or continue straight.

3.5

HIGH-SPEED I/O

As mentioned in the previous section, the two-dimensional array of logic islands is surrounded by
a large amount of I/O blocks (IOBs). ese IOBs sit at the periphery of the FPGA and are also

3.5. HIGH-SPEED I/O

programmable
switch matrix and
bundle of lines

programmable link
at intersection point

SRAM

cell

programmable
switch with
memory cell

Figure 3.5: Routing architecture with switch matrix, programmable links at intersection points, and
programmable switches.

connected to the programmable interconnect, allowing the logic islands to communicate with the
outside world (cf. Figure 3.4).
Also the IOBs can be programmed to serve dierent needs and allow the FPGA to communicate with a multitude of other devices. Typically, many I/O standards are supported, with the
two main classes of I/O standards being single-ended (used, e.g., in PCI) and for higher performance dierential (used, e.g., in PCI Express, SATA, 10G Ethernet, etc.). Typically, the IOBs
also contain certain auxiliary hardware such as serial-to-parallel converters or 8b/10b encoders/decoders that are used in a number of communication protocols. In a nutshell, the IOBs can be
programmed to implement the physical layer of many common communication schemes.
High-speed (multi-gigabit) I/O is implemented using extremely fast serial transceivers at
the heart of the IOBs. e fastest transceivers, at the time of writing this book, are the GTH/GTZ
type transceivers of the Virtex-7 HT FPGAs from Xilinx, providing 28 Gb/s serial bandwidth
eachby comparison, SATA Gen 3 requires 6 Gb/s serial bandwidth. e Virtex-7 HT FPGA
ships with sixteen 28 Gb/s and seventy-two 13 Gb/s transceivers. us, aggregate bandwidth of
more than a terabit per second can be achieved with these FPGAs.
A transceiver is an electronic device consisting of both a transmitter and a receiver.

3. FPGAS
logic island

BRAM

DSP unit
Clock A
Address A
DataIn A

DataOut A

Port A

WriteEn A

Clock B
Address B
DataIn B

DataOut B

Port B

WriteEn B

.
Figure 3.6: FPGA layout with interspersed BRAM blocks and DSP units (left), and the (slightly
simplied) interface of a dual-ported BRAM block (right).

3.6

AUXILIARY ON-CHIP COMPONENTS

e logic resources of FPGAs discussed so far are in principle sucient to implement a wide
range of circuits. However, to address high-performance and usability needs of some applications,
FPGA vendors additionally intersperse FPGAs with special silicon components (cf. Figure 3.6)
such as dedicated RAM blocks (BRAM), multipliers and adders (DSP units), and in some cases
even full-edged CPU cores. Hence, Herbordt et al. [2008] observed that the model for FPGAs
has evolved from a bag of gates to a bag of computer parts.

3.6.1 BLOCK RAM (BRAM)

On the left-hand side of Figure 3.6, a common layout of the dierent components within an
FPGA is illustrated. Between the columns of logic islands there are columns of dedicated RAM
blocks typically referred to as Block RAM (BRAM). A single BRAM block can hold a few kilobytes of data (e.g., 4 KiB), and usually an FPGA holds a few hundred BRAMs, which can all be
accessed in parallel. BRAM access is very fast, i.e., a word can be read or written in a single clock
cycle at a clock speed of several hundred megahertz.
Compared to distributed RAM, discussed in Section 3.2.4, BRAMs provide signicantly
higher density but can only be instantiated at a coarser grain, making them the ideal choice to store
larger working set data on-chip. As with distributed RAM, multiple BRAMs can be combined
to form larger memories.

3.6. AUXILIARY ON-CHIP COMPONENTS

On Virtex FPGAs, BRAMs are dual-ported, as depicted on the right-hand side of Figure 3.6. is means that the BRAM can be accessed concurrently by two dierent circuits. e
word width for each port is congurable, i.e., one circuit might choose to access BRAM at a byte
granularity, while another addresses BRAM in four-byte chunks. Each port is driven by a separate clock, i.e., the two circuits accessing the BRAM may run at dierent speeds. Furthermore,
a dual-ported BRAM can be congured to behave as two single-ported BRAMs (each one-half
the original size) or even as FIFO-queues.
BRAMs can be used for clock domain crossing and bus width conversion in an elegant way.
For instance, an Ethernet circuit clocked at 125 MHz could directly write data of received packets
into a BRAM, congured as a FIFO buer, one byte at a time. On the other side, a consuming
circuit with a dierent clock speed, say 200 MHz, could choose to read from that same buer at
a 4-byte granularity.

3.6.2 DIGITAL SIGNAL PROCESSING (DSP) UNITS

FPGAs are attractive for digital signal processing (e.g, digital ltering, Fourier analysis, etc.),
which heavily relies on mathematical operations. However, while adders and multipliers can be
built from the LUTs and ip-ops provided in an FPGA, they can by no means competein
terms of performance, space, and power consuptionwith corresponding circuits in pure silicon.
erefore, FPGA manufacturers also embed dedicated hardware multipliers and adders (between
less than a hundred to a few thousand) in their FPGAs. A nite impulse response (FIR) lter is a
typical DSP application. An example of a FIR lter is given below:
yn D a0 xn C a1 xn

1 C a2 xn

(where xn/yn indicate the lters input/output values at clock cycle n, respectively).
is lter processes a stream of input samples in a sliding window manner, i.e., the last
three input samples are multiplied by some coecients (a0 , a1 , a2 ) and then accumulated. A
typical (fully pipelined) hardware implementation of such as lter is depicted in Figure 3.7. All
three multiplications and additions are computed in parallel and intermediate results are stored in
ip-op registers (. ). Because this combination of multiply and accumulate (MAC) is so frequently
used in DSP applications, a DSP unit, in essence, usually comprises a multiplier and an adder.
As with most other components in FPGAs, the DSP units can also be customized and
combined with adjacent DSP units. For example, a Xilinx DSP48E slice has three input ports
(which are 25 bits, 18 bits, 48 bits wide) and provides a 25 18-bit multiplier in combination
with a pipelined second stage that can be programmed as 48-bit subtractor or adder with optional
accumulation feedback. Hence, these DSP units can be used in a variety of modes, and perform
operations such as multiply, multiply-and-accumulate, multiply-and-add/subtract, three input
addition, wide bus multiplexing, barrel shifting, etc., on wide inputs in only one or two clock
cycles. In the database context, fast multipliers are very useful, e.g., to implement ecient hash
functions.

3. FPGAS

xn .
a2

DSP unit
a1

Figure 3.7: Fully pipelined FIR lter constructed from three DSP units.

3.6.3 SOFT AND HARD IP-CORES

FPGAs are a great tool to create custom hardware but development eort is still signicantly
higher than for writing software. Certain functionality is fairly standardized and used in many
FPGA designs over and over again. us, so-called intellectual property (IP) cores can be instantiated in FPGA designs. In essence, an IP core is a reusable unit of logicfree or commercial
ranging from circuits for simple arithmetic operations to entire microprocessors. A soft IP core
implements the corresponding functionality using standard programmable resources provided by
the FPGA, while a hard IP core refers to a dedicated silicon component embedded in the FPGA
fabric.
BRAMs and DSP units are the simplest form of embedded silicon components on FPGAs. Often FPGA vendors also add more complex hard-wired circuitry to their FPGAs to support common functionality at high performance with minimal chip space consumption. A typical
example is the medium access controller (MAC) core found on many FPGAs, connected to an
Ethernet PHY device on the corresponding FPGA card, providing high-level access to Ethernet frames.
Some FPGAs even incorporate full-edged hard CPU cores. Several older generations
of Xilinxs Virtex FPGAs shipped with embedded PowerPC cores, e.g., the Virtex-5 FX130T
integrates two PowerPC 440 cores (800 MHz). FPGAs of the newer Xilinx Zynq series include
an ARM Cortex-A9 dual-core (1 GHz). Also Altera produces FPGAs with embedded ARM
cores, and Intel in collaboration with Altera designed an embedded processor (Stellarton) that
combines an Atom core with an FPGA in the same package.

3.7

FPGA PROGRAMMING

Having discussed the key ingredients of FPGAs, we now take a closer look at how FPGAs are
programmed. From a high-level perspective, the FPGA design ow is very similar to generating
hard-wired circuits, which we discussed in the previous chapter (Section 2.3). Its the tools proe PHY connects Ethernet circuitry to a physical medium such as optical ber or copper cable.
Please note that here the Atom processor and the FPGA communicate via PCI Express, i.e., this is more of a system-inpackage than a true system on chip (SoC).

mapped

NCD
routed

BIT

iMPACT

NCD

bitgen

NGD

par

NGC

map

HDL

ngdbuild

XST or other

3.7. FPGA PROGRAMMING

FPGA

.
synthesize

implement design

generate
bitstream

program
FPGA

Figure 3.8: FPGA design ow: Xilinx tool chain and intermediate circuit specication formats.

vided by the FPGA vendors that do all the magic of mapping a generic circuit specication onto
FPGA hardware.

3.7.1 FPGA DESIGN FLOW

To illustrate a typical FPGA design ow, we will examine the tools of the Xilinx tool chain,
as well as their intermediate circuit representations. For some of the terminology used in this
section, we refer to Chapter 2, where we discussed fundamental hardware design concepts. e
most important steps and tools of the design ow to produce an FPGA-circuit are depicted in
Figure 3.8.
Synthesis

e entry point to programming FPGAs is the same as for producing hard-wired circuits, i.e.,
typically by using a hardware description language (HDL) such as VHDL or Verilog. e Xilinx synthesizer (XST) turns an HDL specication into a collection of gate-level netlists (native
generic circuit (NGC) format), mapped to a technology library (UNISIM) provided by Xilinx.
However, at this level also third-party synthesizers (e.g., Synplicity) may be used, which typically
store the netlist using an industry-standard EDIF format.
Translate

e tool ngdbuild combines and translates all input netlists and constraints into a single netlist
saved as native generic database (NGD) le. e FPGA designer species constraints in a socalled user constraint le (UCF). Constraints are used to assign special physical elements of the
FPGA (e.g., I/O pins, clocks, etc.) to ports of modules in the design, as well as to specify timing
requirements of the design. Whereas the NGC netlist is based on the UNISIM library for behavioral simulation, the NGD netlist is based on the SIMPRIM library, which also allows timing
simulation.
EDIF stands for electronic design interchange format.

3. FPGAS

Map

e map tool maps the SIMPRIM primitives in an NGD netlist to specic device resources such
as logic islands, I/O blocks, BRAM blocks, etc. e map tool then generates a native circuit
description (NCD) le that describes the circuit, now mapped to physical FPGA components.
Notice that this is an additional step not needed in the classical design ow for generating circuits
(cf. Section 2.3).
Place and Route

Placement and routing is performed by the par tool. e physical elements specied in the NCD
le are placed at precise locations on the FPGA chip and interconnected. While doing so, par takes
timing constraints specied in the user constraint le (UCF) into account. Oftentimes, place and
route (based on simulated annealing algorithms) is the most time consuming step in the design
ow, and multiple iterations may be necessary to comply with all timing constraints. e par tool
takes the mapped NCD le and generates a routed NCD le, which also contains the routing
information.
Bitstream Generation and Device Programming

Now the routed design needs to be loaded onto the FPGA. However, the design must rst be
converted into an FPGA-readable format. is is handled by the bitgen tool, which encodes the
design into a binary, known as bitstream. e bitstream can then be loaded onto the FPGA, e.g.,
via JTAG cable and using the iMPACT tool. As a side note, modern FPGAs often also feature
the possibility to encrypt and authenticate bitstreams to support security-sensitive applications.
e bitstream controls a nite state machine inside the FPGA, which extracts conguration
data from the bitstream and block-wise loads it into the FPGA chip. Xilinx calls these blocks
frames. Each frame is stored in a designated location in the conguration SRAM that directly
relates to a physical site on the FPGA (cf. Figure 3.9) and congures the various congurable
elements on that site, e.g., multiplexers, inverters, dierent types of LUTs, and other conguration
parameters. Once the conguration memory is completely written, the FPGA is programmed and
ready for operation.

3.7.2 DYNAMIC PARTIAL RECONFIGURATION

As noted in the previous section, conguration memory of an FPGA is divided into frames, of
which each corresponds to a specic physical site on the FPGA. ese frames are the smallest
addressable units of device conguration and are distributed across the FPGA, as illustrated on
the left-hand side of Figure 3.9. A frame always groups together a column of elementary blocks
of some type such as logic islands, BRAM blocks, I/O blocks, etc. In earlier FPGAs (e.g., Xilinx
Virtex-2 FPGAs), a frame would span the entire hight of the FPGA chip but nowadays frames are
usually arranged in multiple rows. To give an idea of the granularity of these conguration frames,

3.7. FPGA PROGRAMMING

BRAM frame

DSP frame

row n - 2

row n - 1

row n

logic frame

FPGA
static region
partially reconfigurable
region A

partially reconfigurable
region B

internal configuration access port

partial
partial
partial
partial

bitstream
bitstream
bitstream
bitstream

A1
A2
B1
B2

large (external) storage

Figure 3.9: Conguration regions of an FPGA, comprising multiple elementary blocks of one type,
covered by conguration frames (left), and an FPGA programmed with two partially recongurable
regions (PRRs) and corresponding partial bitstreams stored externally (right).

in Virtex-4, Virtex-5, and Virtex-6 FPGAs a CLB-frame (i.e., for logic island conguration)
spans 16 1, 20 1, and 40 1 CLBs, respectively.
e organization of conguration memory described above is the basis for a technique
known as dynamic partial reconguration. is technique allows parts of an FPGA to be reprogrammed without interrupting other running parts on the same FPGA. To do so, only the frames
of a particular partial reconguration region (PRR) are updated with a new conguration, while
the other frames are left unchanged.
Dynamic partial reconguration enables interesting applications, where specialized modules can be loaded on-demand at run time without occupying precious chip space when they are
inactive. To load a hardware module, a control unit on the FPGA (e.g., a processor) loads conguration data from some external sourcefor example, from o-chip DRAMand sends it to
the so-called Internal Conguration Access Port (ICAP), which is the gateway to the conguration
memory of the FPGA.
e above scenario is illustrated on the right-hand side of Figure 3.9. Notice that xed
partially recongurable regions (PRRs) need to be dened beforehand, i.e., when the static part
of a circuit is designed. ose regions then serve as placeholders, into which a bitstream can be
loaded later on. A partial bitstream can be only loaded into the exact PRR that it was designed for,
Note that Xilinx has a longer history of supporting dynamic partial reconguration in their FPGAs than Altera, which is why
in this section we use Xilinx-specic terminology and focus on the design ow for Xilinx devices.

3. FPGAS

28 nm FPGA die
silicon interposer
.

package substrate
solder balls

Figure 3.10: Xilinxs stacked silicon interconnect (SSI) technology.

e.g., in the example the partial bitstream A1 could not be loaded into the partially recongurable
region B. is can be a limitation, hence partial bitstream relocation is an active research topic
studied, e.g., in the work of Touiza et al. [2012].
Nevertheless, state-of-the-art dynamic partial reconguration already exhibits signicant
benets, for example: (i) time-multiplexed applications may use more circuitry than actually ts
onto a single FPGA, (ii) there is no idle power consumed by the circuits not currently in use,
(iii) and design productivity can be increased since synthesizing smaller partial bitstreams is a lot
faster than full bitstreams synthesis.

3.8

ADVANCED TECHNOLOGY AND FUTURE TRENDS

After having discussed established core FPGA technology, in this section, we look into what
is currently happening at the forefront of FPGA research and innovation. We selected a few
topics ranging from FPGA manufacturing and FPGA architecture to how FPGAs could be programmed in the future.

3.8.1 DIE STACKING

FPGAs have a long history of being at the leading edge of semiconductor technology innovation. As such, Xilinx produced one of the rst devices that introduced so-called die stacking the
Virtex-7 2000T. Die stacking (also referred to as 3D ICs) is a technique to assemble and combine
multiple dies within the same package. e dierence to assembling multiple chips on a printed
circuit board (PCB) is that the dies can be assembled at the same density as a monolithic solution,
leading to better performance and less power consumption.
e problem with large monolithic FPGAs is that at the early stages of a new manufacturing process (shrinking transistors), there are many defective dies. In fact, die yield dramatically
decreases as a function of die size. erefore very large monolithic FPGAs are typically only
manufactured once the process matures. us, one benet of die stacking is that large FPGAs
can be produced early on. e Virtex-7 2000T that was released in 2011 consisted of 6.8 billion
transistors, making it the largest FPGA ever.
Figure 3.10 illustrates Xilinxs stacked silicon interconnect (SSI) technology used for the
Virtex-7 2000T. Four FPGA dies, fabricated using a 28 nm manufacturing process, are soldered

3.8. ADVANCED TECHNOLOGY AND FUTURE TRENDS

side by side on top of the so-called silicon interposer. e interposer is a passive silicon chip
that connects adjacent FPGA dies via tens of thousands of connections, allowing for very high
bandwidth, low latency, and low power consumption. Note that side by side stacking avoids a
number of thermal issues that could result from stacking multiple FPGA dies on top of each
other.

3.8.2 HETEROGENEOUS DIE-STACKED FPGAS

e Virtex-7 2000T, discussed in the previous section, can be thought of as homogeneous since
the four dies are all identical FPGA dies. In 2012, Xilinx took its die stacking technology one
step further and introduced the Virtex-7 H580T, which assembles heterogeneous dies in the same
package. In particular, the Virtex-7 H580T combines two FPGA dies with one transceiver die,
that hosts multiple 28 Gb/s transceivers. us, the analog portions of the transceivers are physically separated from the digital portions of the FPGA, isolating the transceivers from noise and
ensuring very high signal integrity.
At the broader scale, heterogeneous die stacking is about integrating dierent process technologies into the same package. Logic, analog, and memory (DRAM) chips have very dierent
optimization constraints, and are thus produced independently. Integrating these technologies
on the same die would be feasible but not economical. Notice that integrating, for example, a
CPU core into an FPGA die is not as problematic because CPUs and FPGAs are both logic and
use the same process technology. Fortunately, heterogeneous die stacking allows integrating dies
produced using dierent process technologies into the same package in an economically sensible way. In the future, die stacking might bring several new interesting FPGAs, as for example,
FPGAs with integrated DRAM memory.
3.8.3 TIME-MULTIPLEXED FPGAS
Tabula is a semiconductor start-up that introduced a fundamentally new FPGA architecture with
their ABAX product family, in 2010. Existing FPGA architectures suer from the fact that an
increasing amount of die area is used for the interconnect rather than logic elements, as process
technology shrinks. In other words, there would be enough space on the silicon for more logic
elements, but interconnecting them is the problem.
To avoid this underutilization of chip space, Tabula replicates FPGA resources (e.g., LUTs,
ip-ops, multiplexers, and interconnect) eight times and then switches between those independent sets of resourcesso-called foldsat a very high frequency (1.6 GHz). In a sense, eight
smaller FPGAs execute in a time-multiplexed manner, and simulate a larger FPGA, where each
individual fold runs with a clock period of 625 picoseconds, resulting in an overall clock period
of 5 nanoseconds.
Xilinx refers to SSI as being a 2.5D technology. What is meant is that active silicon (FPGA dies) are mounted on passive
silicon (the interposer). By contrast, 3D die stacking refers to active-on-active stacking, e.g, multiple FPGAs on top of each
other, which might be supported in future devices.

3. FPGAS

Tabula refers to this concept as 3-dimensional chips of eight folds, where logic elements
not only connect to adjacent logic elements in the two-dimensional space, as in traditional FPGA
architectures, but also to logic cells in the above fold. is is made possible using transparent
latches in the interconnect, which are controlled by time-multiplexing circuitry, and allow communication between dierent folds.
e ABAX chip can be programmed exactly the same way as one would program a commodity FPGA, i.e., the high-speed reconguration and time-multiplexing is completely hidden
from the programmer. e key advantage of this technology is that it can provide the same amount
of logic resources as large commodity FPGAs, however, at a signicantly lower price, i.e., in the
range of 100-200 USD. Hence, the technology is very promising. As a result, Tabula was ranked
third on the Wall Streets Journals annual Next Big ing list in 2012 [Basich and Maltby,
2012].

3.8.4 HIGH-LEVEL SYNTHESIS

As FPGA technology is developing at a rapid pace, producing chips with an ever increasing
number of programmable resources, the evolution of programming paradigms for FPGAs has
been lagging far behind, i.e., the de facto standard for many years has been to write code at the
RTL level (see Section 2.3.1) using a hardware description language such as VHDL or Verilog.
Programming at this level of abstraction is dicult and error-prone, leading to long design cycles
and a steep learning curve for developers coming from the software world, which is hindering the
wide-spread adoption of FPGA technology.
To make FPGAs accessible to a wider range of developers and to lower time to market,
there have been signicant research eorts to enable the translation of higher-level languages into
hardware circuitsso-called high-level synthesis (HLS) or electronic system level (ESL) synthesis.
Xilinxs new design suite Vivado, for example, supports synthesizing C-based algorithmic circuit
specications. e actual HLS tool is based on Autopilot from AutoESL, a high-level synthesis
vendor that Xilinx purchased in 2011.
By contrast, Altera is pushing for high-level synthesis based on OpenCL, which is a framework for writing parallel programs that execute across a number of dierent platforms such as
multicore CPUs and GPUs. OpenCL is well known for general purpose GPU (GPGPU) programming, where serial routines on the CPU control a heavy parallel workload that is executed
on the GPU. In OpenCL so-called kernels that execute on OpenCL devices, e.g., a GPU, are
specied using the C-based OpenCL language. Whereas a GPU executes such kernels on xed
compute cores, in an FPGA, the compute cores are highly customizable, and it is expected that
often similar performance to that of GPUs can be achieved, however, with signicantly lower
energy consumption. While a HLS compiler makes it easier to specify hardware circuits, testing
the generated circuits is still cumbersome. Since OpenCL is not just a language and a compiler
but also includes an execution framework, it has the potential to eventually also make testing and
debugging more productive.

CHAPTER

FPGA Programming Models

We mainly looked at FPGAs from the hardware technology side so far. Clearly, the use and
programming of FPGAs is considerably dierent to the programming models that software
developers are used to. In this chapter, we will show how entire system designs can be derived
from a given application context, and we will show how those designs can be made to benet
from the intrinsic properties of FPGA hardware (or how work can be partitioned, so solutions
can get the most out of hard- and software).
We will start the chapter (Section 4.1) with a discussion about trade-os in exibility and
performance and how one can avoid very high circuit compilation cost. In Section 4.2, we show
how a given application problem can actually be turned into a hardware circuit. In Sections 4.3
and 4.4 we then illustrate how generated circuits can be optimized by leveraging data and pipeline
parallelism (respectively).

4.1

RE-BUILD, PARAMETERIZE, OR PROGRAM THE

HARDWARE ACCELERATOR?

FPGAs provide the opportunity to modify and re-congure the FPGA at a very ne granularity,
even from one user query to the next. We will, in fact, discuss some systems that follow this
route later in Chapter 5. e strategy is not appropriate for all application scenarios, however.
e problem is that circuit (re-)compilation is an extremely CPU-intensive operation. Necessary
steps, such as component placement and routing are highly compute-intensive, and they scale
poorly with the circuit size. In practice, several minutes of compilation are the norm; some circuits
might require hours to be generated.
On the positive side, circuit re-building has the potential to generate highly ecient (if not
optimal) circuits for a very wide range of problem instances. Eectively, system designers face
three design goals:

(a) Runtime Performance. At runtime, the hardware solution should have good performance characteristics. Ideally, these characteristics should be close to those of a hand-crafted, tailor-made
circuit for the problem instance at hand.
(b) Flexibility/Expressiveness. e solution should support a wide range of problem instances.
SQL, for instance, is expressive enough to cover many useful applications.

4. FPGA PROGRAMMING MODELS

runtime
performance

.
exibility/
expressiveness

re-conguration
speed

Figure 4.1: Design space for FPGA programming model. Design goals are execution performance
at runtime; exibility/expressiveness to support dierent problem instances; and a fast way to realize
workload changes. Not all goals can be maximized at the same time.

(c) Re-Conguration Speed. When workloads change, the hardware solution should be able to
react to such changes with low latency. To illustrate, having to wait for minute- or hour-long
circuit routing is certainly inappropriate for ad hoc query processing.
Unfortunately, not all of these goals can be reached at the same time. Rather, designers have to
make compromises between the opposing goals. As illustrated with Figure 4.1, at most two goals
can be met satisfactoryat the expense of the third.

4.1.1 RE-BUILDING CIRCUITS AT RUNTIME

Circuit re-compilation enables optimizing runtime performance
and expressiveness, but sacrices re-conguration speed. It could
thus be placed in the design space as shown here on the right.
Because of the high re-conguration eort, the model seems
most appropriate whenever the workload is mostly static and long
re-compilation times are acceptable. Examples could be risk management for electronic trading systems (we shall see one example
in Section 5.3.1) or network intrusion detection (Section 5.1.4),
where rule sets to match change only at a longer time scale.

runtime perf.

.
exibility

re-conf. speed

Pre-Compiled Modules, Partial Reconguration

e re-compilation eort can be reduced through the use of pre-compiled modules that the circuit
generator merely stitches together to obtain a working hardware solution for a given problem
Note that circuit re-compilation is a software-only task. e FPGA must be taken o-line only to upload the compiled
bitstream. is time is relatively short and technology exists to eliminate it altogether (using multi-context FPGAs).

4.1. RE-BUILD, PARAMETERIZE, OR PROGRAM THE HARDWARE ACCELERATOR?

instance. e system of Dennl et al. [2012], for instance, includes modules for relational algebra
operators that can be used to construct a physical representation of an algebraic query plan on the
two-dimensional chip space.
Pre-compiled modules work well together with partial reconguration, selective replacement of only some areas on the
runtime perf.
FPGA chip (cf. Section 3.7.2). And when these areas are small,
also less time is needed to move the (partial) bitstream to the
device, which may further improve responsiveness to workload
.
changes.
Only so much variety can be pre-compiled, however, which
limits the exibility that can be achieved by following this route. In exibility re-conf. speed
our three-goal design space, this moves the approach toward faster
re-conguration speed, but at the expense of exibility (illustrated
on the right). Pre-compiled modules and partial re-conguration can also be combined with
parameterization to improve the exibility/performance trade-o. We will look at this technique
in a moment.
At this point we would like to mention that circuit re-construction is not only computeintensive. It also implies that, at application runtime, a stack of hardware design tools has to be
executed for synthesis, placement/routing, and bitstream generation. Installation, maintenance,
and licensing of these tools might be too complex and expensive to employ the approach in practical settings. If used with partial re-conguration, pre-compiled modules might not actually depend on the availability of these tools. But partial re-conguration has yet to prove its maturity
practical use. Most commercial users would likely refrain from using the technology today in
real-world settings.

4.1.2 PARAMETERIZED CIRCUITS

Commercial uses of FPGAs often do not re-generate circuits after system deployment at all
and treat them similarly to a hard-wired ASIC. Hardware re-congurability is used here only
to (a) perform occasional hardware/software updates and (b) avoid the high production cost of
ASICs when the applications market size is too small. Adaptation to application workloads is
then achieved via parameterization. e idea is that the hardware circuit itself is hard-wired, but
at runtime it reads out and interprets a number of conguration parameters that describe the
current workload.
For certain workloads, this model is a good t. If, for instance, the FPGA is used only to
lter a tuple stream based on a simple predicate, say
column-name constant
(where is some comparison operator D; <; ; : : : ), then column-name, , and constant can be
left as a conguration parameter in an otherwise hard-wired circuit.

4. FPGA PROGRAMMING MODELS

.
disk

memory

DMA

. compress
project

CPU

restrict
FPGA
Figure 4.2: Data ow in the Netezza FAST engine (adapted from Francisco [2011]).

An example of this model is the Netezza Data Appliance Architecture. e included

FAST engine (FPGA-accelerated streaming technology) uses a xed processing pipeline
to perform data preprocessing in FPGA hardware. is pipeline, illustrated in Figure 4.2,
rst decompresses the raw table data, then performs projection and
runtime perf.
row selection in sequence. Not shown in the gure is a visibility engine that discards tuples not valid for the current transaction (e.g.,
uncommitted data).
rough runtime parameters, the FAST engine can be con.
gured to run most of the critical tasks in the application areas that
exibility re-conf. speed Netezza was designed for. Parameter updates can be applied almost
instantaneously, since the necessary amount of conguration is very
small. But the engine cannot be congured beyond this well-dened query pattern. In terms of
our design space illustration, Netezza oers good runtime performance and fast re-conguration.
But it gives up exibility, compared to systems that employ circuit re-construction, as illustrated
here on the left.
Example: Skeleton Automata

Parameterization can actually be quite powerful. Its potential reaches far beyond only the setting
of selection parameters or column names. e approach is expressive enough to cover a large and,
most importantly, relevant subset of XPath, the de facto standard to access XML data [Teubner
et al., 2012]. Pushing this subset to an accelerator may speed up, e.g., an in-memory XQuery
processor by large factors.
XPath can be implemented with help of nite-state automata, driven by the sequence of
opening and closing tags in the XML input. e structure of these automata depends on the user
query. e relevant insight now is that the class of automata that can arise is constrained by the
XPath language specication. is constraint is sucient to build a skeleton automaton that includes any transition edge that could be expressed with XPath. By making the condition assigned
to each of these edges a conguration parameter, the skeleton automaton can be parameterized

4.1. RE-BUILD, PARAMETERIZE, OR PROGRAM THE HARDWARE ACCELERATOR?

skeleton automaton
XPath
spec.

:::

static part (o-line)

dynamic part (runtime)

user query
/a//b

a
?

:::

b
*

:::

FPGA

conguration param.
Figure 4.3: Parameterization oers excellent runtime and re-conguration performance, but limits
the expressiveness addressable by the hardware accelerator. Skeleton automata illustrate how a meaningful class of queries can be supported nevertheless. Edges of a general-purpose state automaton (the
skeleton automaton) can be parameterized to describe any relevant query automaton.

to run any XPath query (within the relevant dialect) as a true hardware NFA. Transitions not
needed for a particular query can be assigned a false condition parameter, eectively removing
the transition from the automaton.
Figure 4.3 illustrates this concept. e skeleton automaton is generated based on the language semantics of XPath and uploaded to the FPGA once. e user query is then used to infer
conguration parameters (printed blue in Figure 4.3), which are used to ll placeholders (indicated as in Figure 4.3) in the hardware circuit. Conguration parameters can be inferred and
installed in a micro-second time scale, which guarantees full ad hoc query capabilities.

4.1.3 INSTRUCTION SET PROCESSORS ON TOP OF FPGAS

e remaining corner in our design space (see gure on the right below) is easy to ll. An instruction set processor, realized in the FPGA fabric, oers high exibility and can be re-congured by
merely changing the program it executes.

4. FPGA PROGRAMMING MODELS

Possible processor designs could range from low-footprint

runtime perf.
general-purpose processors (such as ARM architectures or MicroBlaze soft cores oered by Xilinx) to very specialized processors whose instruction set is targeted to the particular applica.
tion problem. However, the former type of processors would compete with counterparts that are readily available as discrete silicon
componentswhich oer higher speeds and better energy e- exibility re-conf. speed
ciency at a signicantly lower price. Realizing a general-purpose
processor on FPGA hardware alone thus seems hardly attractive.
Specialized processors on FPGA hardware can have an edge over mainstream processors. is was demonstrated, e.g., by Vaidya and Lee [2011], which proposed a tailor-made coprocessor for column-oriented databases, implemented in Altera Stratix III hardware. ough
Vaidya and Lee [2011] did not implement an end-to-end system, their simulations indicated
that this can lead to an improvement in query execution speed by one or two orders of magnitude. e approach had been followed commercially by Kickre, which marketed their accelerator
as SQL chip. Kickre was acquired by Teradata in 2010.
Mixed Designs. e approaches listed above do not exclude one another. In fact, the potential
of congurable hardware lies in the possibility to build hybrid designs. For instance, a hardware
design could include a highly specialized, tailor-made circuit; fast, but only suited to run the
most common case. e tailor-made circuit could be backed up by an instruction set processor
that captures all remaining situations, such as infrequent request types, system setup, or exception
handling. If these situations are rare, processor speed will not really matter to the overall system
performance.
Commercially available FPGA hardware increasingly supports such mixed designs. In various forms, vendors oer FPGA chips where a general-purpose CPU is readily integrated as a
discrete silicon component. is eases system design, reduces cost (since the instruction set processor does not occupy precious congurable chip space), and improves performance (discrete
silicon blocks sustain higher clock frequencies). An example of such hardware is the Xilinx Zynq
product line, which integrates a hard ARM core with congurable chip space.

4.2

FROM ALGORITHM TO CIRCUIT

e true potential of FPGA technology lies, of course, in the ability to create tailor-made circuits
for a given application problem. But how can such circuit be inferred from a problem specication?

4.2.1 EXPRESSION ! CIRCUIT

As discussed in the hardware design part of this book, any logic circuit consists of combinational
logic and some form of memory. Both parts are glued together through a wiring that connects all
parts of a circuit. Combinational logic can describe operations where the output only depends on

4.2. FROM ALGORITHM TO CIRCUIT

the present operator input, but not on, e.g., previously seen input items. at is, combinational
circuits describe pure functions. Combinational circuits can be wired up into larger combinational
circuits simply according to the data ow of the functions that they describe. e operation

y D f g.x1 / ; h.x2 /
could thus be implemented as the circuit
x2

where f. indicates the sub-circuit that implements f . By construction, this leads to directed
acyclic graphs/circuits.
Some application problems require access to previous values from the input and/or from the
output. If we use xk to denote the value of an (input) item x at clock cycle k , the two examples
yk D

1 C xk
2

(A)

yk D yk

(S)

1 C xk

would access values from the preceding clock cycle to compute the average value of the last two
x seen and the running sum of all x , respectively.
To implement such behavior, memory elementsip-op registers in practicemust be
inserted into the data path(s) of the circuit. A register, indicated as rectangles . as before, briey
stores the value that it receives at its input and makes it available on its output during the next
clock cycle. e above two examples can then be expressed in hardware as follows:
x

1=.2

.
x

Observe how, in both cases, registers delay values by one clock cycles. To access even older
items, multiple delay registers can be used one after another.

4.2.2 CIRCUIT GENERATION

Oftentimes, programmers need not explicitly draw circuit diagrams to turn an application problem into a hardware circuit. Rather, expressions that describe the combinational parts of a circuit
can be phrased directly in a hardware description language (VHDL or Verilog). Modern design
tools will then not only compose circuits automatically as illustrated above. ey will also perform
cross-operator optimizations, generally improving area and/or speed eciency of the generated
circuit.
For instance, in the A example, addition and division can be combined, simply by discarding any logic that would compute
the least-signicant bit of the C output.

4. FPGA PROGRAMMING MODELS

For functions that require delay functionality, the circuit must be explicitly synchronized to
a clock signal in the VHDL/Verilog code. If a (sub-)result must be carried from one clock cycle
to the next (i.e., delayed) that result must explicitly be assigned to a register variable or signal.
Design tools will, however, try to eliminate redundant registers. ey will often also try to retime the resulting circuit: pushing combinatorial tasks before or after a delay register may help to
balance signal delay paths and thus improve the maximum propagation delay, which is one of the
key determinant for the circuits speed.
Circuits generated this way typically serve as an entry point for further tuning. In particular,
those circuits might have long and poorly balanced signal paths (despite automatic re-timing).
And they do not leverage parallelism, which is a key strength of tailor-made hardware. In the section that follows, we will thus discuss how circuits can be optimized by exploiting data parallelism
and pipeline parallelism (the latter also leads to optimized signal paths). But before that, we will
have a very brief look at high-level synthesis tools.

4.2.3 HIGH-LEVEL SYNTHESIS

e aforementioned techniques are quite eective to build fast hardware circuits based on data
ow-oriented task descriptions. By composing large circuits from smaller ones, even sizable application problems become manageable.
From a (software) programmer perspective, however, composition purely based on data ow
is often counter-intuitive; an automated compiler that can handle control and data ow parts of
a conventional programming language is clearly desirable. And indeed, several development
platforms today oer high-level language compilers that can compile C, Java, or similar languages
directly into a circuit description.
Much like a conventional software compiler, these systems usually turn the user program
rst into an internal representation consisting of basic blocks and annotated with data ow and
control ow information. Figure 4.4 illustrates this for the Optimus compiler of Hormati et al.
[2008]. e small code in Figure 4.4(a) results in four basic blocks as shown in Figure 4.4(b). e
graph in Figure 4.4(b) is also annotated with control ow and data ow information (the latter
only for variable sum).
To convert the sample code into a hardware circuit, Optimus will create a hardware unit
for each block that adheres to a xed port interface schema. is schema includes explicit ports
for data and control ow information. Hardware units are nally wired up by connecting these
ports according to the data and control ow information inferred from the user code. For more
details refer to [Hormati et al., 2008].
High-level language synthesis becomes more eective if the starting language already contains mechanisms to, e.g., express independence of operations and/or parallelism. e Optimus
compiler, for instance, is part of the Liquid Metal system of IBM Research [Auerbach et al.,
2012] and comes with its own programming language Lime to express such properties. Kiwi, a

4.3. DATA-PARALLEL APPROACHES

BB1

sum
0
i
0

BB2 tmp

sum
0
for i D 0 to 7 do
sum
sum C read ()
end for
write (sum)

read ()

sum
sum C tmp
BB3 i D i C 1
branch to BB2 if i < 8
BB4 write (sum)

(a) Small example code with data and control

ow.

(b) Basic blocks as inferred by Optimus, annotated with data ow

information . and control ow information . .

Figure 4.4: To compile high-level language code, Optimus [Hormati et al., 2008] breaks down user
code into basic blocks and annotates data ow and control ow information. After mapping all basic
blocks to hardware, their control and data ports are wired according to the control and data ow
information. Illustration adapted from [Hormati et al., 2008].

joint research project of U Cambridge and Microsoft Research [Greaves and Singh, 2008], uses
custom attributes to .NET assembly to achieve the same goal.
e Accelerator platform makes the interplay of language expressiveness and parallelism on
dierent hardware back-ends explicit. In the context of .NET, Accelerator describes a functionalstyle framework to express data-parallel programs. A set of back-end compilers can then generate runnable code for a wide range of data-parallel hardware, including commodity processors,
graphics processors, and FPGAs (the latter has been demonstrated by Singh [2011]).

4.3

DATA-PARALLEL APPROACHES

e strength of FPGAs (or any other bare-hardware platform) lies in their inherent hardware
parallelism. Fundamentally, any single gate, any sub-circuit, or any sub-area on the chip die can
operate independently of any other. is massive potential for parallelism is only limited by explicit synchronization or serialization, which the circuit designer chose to dene to match the
semantics of the application scenario.
In practice, application circuits do need a fair degree of synchronization. But hardware
allows such synchronization to be lightweight and very ecient. is is in sharp contrast to
heavyweight mechanisms (compare and swap or even just memory ordering), which programhttp://research.microsoft.com/en-us/projects/Accelerator/

4. FPGA PROGRAMMING MODELS

mers should avoid in software-based systems if they aim for performance. is capability for
lightweight synchronization is also the reason why we are particularly interested in ne-grained
parallelism in this chapter and look at task assignment on the level of assembly style microoperations.
Types of Parallelism. e available hardware parallelism can be applied to application problems
in many dierent ways. In the context of FPGAs and tailor-made circuits, two strategies have
been by far the most successful, and we will discuss them in turn. In this section, we rst look at
ne-grained data parallelism, which has many similarities to software techniques like vectorization
or partition/replicate schemes. Section 4.4 then looks at pipeline parallelism, which in hardware
is a lot more appealing than it typically is in software.

4.3.1 DATA PARALLELISM

ere are many real-world situations where the same processing task has to be applied to all items
in a data set. And with certain assumptions on independence and order indierence, hardware
is free to handle individual items in parallel. e classical example to this are matrix or vector
computations, where individual components can be calculated independently. Most tasks in the
database domain can be handled this way, too; the relational data model, for instance, explicitly
assumes set-oriented processing.
In the software domain, data parallelism has been successful at many levels. Vector processors operate on sets of data on the assembly level, a model that has been re-incarnated more
recently in the form of SIMD capabilities for mainstream processors or in the form of dataparallel kernel execution on modern graphics processors (GPUs). Modern compilers allow us to
parallelize situations of data parallelism across compute resources of multi-core architectures (an
example tool kit is Intels read Building Blocks library). And on the network scale, the MapReduce paradigm [Dean and Ghemawat, 2004] has written a remarkable story of success.
Data Parallelism in Hardware

collect

dispatch

On the hardware level, data parallelism corresponds to a

replication of components on the chip area. is is illusreplica 1
trated in Figure 4.5 here on the right. Input data is dis::
tributed over the q replicas of the sub-circuit and processed
:
in parallel. At the output, circuitry may be required to collect the results of parallel execution into a sequential output
. q
replica
stream.
Replication allows us to linearly increase throughput
by dedicating additional chip resources. q -fold replication Figure 4.5: Circuit replication.
roughly leads to a throughput improvement by a factor of
q (dispatch and collect logic can usually be integrated into surrounding components and thus be
neglected).

4.4. PIPELINE-PARALLEL APPROACHES

Besides availability of chip space, the practical limit to replication is the speed at which data
can be provided to the circuit and/or results consumed at the output of the circuit. For instance,
if data arrive as a single, sequential input stream, dissecting this stream into independent work
units often bottlenecks data-parallel execution.

4.4

PIPELINE-PARALLEL APPROACHES

Pipeline parallelism is a form of task parallelism and applies to situations where a given task f
can be broken down into a sequence sub-tasks f1 ; : : : ; fp , such that f D fp f1 . e subtasks f1 ; : : : ; fp can then be computed on separate processing units, and intermediate results are
forwarded from unit to unit (or pipeline stage to pipeline stage). e concept is similar to an
assembly line in, say, a car factory. And like in a car factory, the concept becomes highly parallel if
multiple data items ow through the pipeline one after another; all processing units then operate
in parallel, each one on a dierent input item.
At the logical level, pipelining is a frequent pattern in many software systems. e execution
engines of most database systems, for instance, are built on the concept and evaluate query plans
in a pipeline model. Physically, however, those systems use pipelining only rarely. e cost of
messaging between physical processing units is often prohibitive; forwarding database tuples,
e.g., between processor cores would dominate the overall execution cost.

4.4.1 PIPELINE PARALLELISM IN HARDWARE

Between components of a hardware circuit, however, communication can be realized very eciently, making pipeline parallelism a feasible concept even for very lightweight sub-tasks fi . In
fact, as we shall see in a moment, pipeline parallelism is the method of choice to gain performance
in hardware circuits.
e reason for this is that pipeline parallelism becomes relatively easy to apply to a hardware
circuit. To illustrate, suppose we are given a combinational circuit that performs an application
task f . By re-grouping the gates and wirings into, say, three blocks, we can break up the computation of f into sub-circuits f1 ; : : : ; f3 , even when their local semantics would be hard to express
on the application level:
f.

f.1

To compute the fi independent of one another (and thus allow for parallelism), as a next
step we need to introduce pipeline registers. To this end, we insert a ip-op on any signal path
from fi to fiC1 , as indicated here using gray bars:
f.1

4. FPGA PROGRAMMING MODELS

throughput

replication

combined

q D 13

p D 10
qD2

p D 100

p D 50

p D 10

p D 20

pipelining

qD7

qD4

pD4
pD2
pD1

q D 10

qD1

qD2

area

Figure 4.6: Eects of replication and pipelining on processing throughput of an example circuit. For
this graph, we modeled an example circuit along the lines of Kaeslin [2008]. Notice that the location
of, e.g., crossover points may depend on the particular circuit.

e eect of these registers is that they remember the outcome of computation fi , such that in
the following clock cycle they appear as a stable signal to the input of fiC1 . e circuit for fi then
becomes available again to accept the next item from its input.
us, the modied circuit could still accept a new input item on every clock cycle (like the
original circuit f could). But the introduction of pipeline registers reduced the longest signal path
of the circuit, which is now the longest path of any fi . Assuming that f could be broken into subcircuits of equal size, this means that the clock frequency can be increased by approximately a factor
of p (the pipeline depth). e increased clock frequency directly translates into a throughput
increase.
Cost of Pipelining. Pipelining is attractive in hardware, because the introduction of pipeline
registers causes only a small overhead, both in terms of space and performance. Notice in particular that we only had to introduce new registers; there was no need to replicate or expand the
combinational part of the circuit, however. Pipelining reaches its limit when the signal delay for
a single fi approaches the speed of the associated pipeline register.
is can be seen in Figure 4.6. In this gure, we illustrated how replication/data parallelism
and pipelining can turn additional chip area into increased throughput. To obtain the graph, we
assumed characteristics of a typical hardware circuit and idealized models for the eects of replication and pipelining, as detailed by Kaeslin [2008]. e graph shows how, initially, pipelining
requires little investment (some ip-ops only) to gain substantial throughput improvements. e
intrinsic latency of pipeline registers, however, limits the throughput that can be achieved with
pipelining alone.

4.4. PIPELINE-PARALLEL APPROACHES

In practice, replication and pipelining are often used in combination. To illustrate, we included an example where the (pipelined) circuit with p D 10 is replicated on the chip, which for
the example results in an additional performance boost.
Wiring Overhead. e simple model used to generate Figure 4.6 does not include any overhead
that may result from wiring and communication needs. If replication is used excessively, however,
this overhead may become signicant in practice, whereas pipeline parallelism tends to be less
aected by wiring costs. is dierence can be understood from the graphical representation that
we showed above for both approaches. As the degree of parallelism increases replicated circuits
will range over an increasing area of the chip die. us, dispatch and collect circuits at both ends
have to bridge an increasing distance as q grows, which directly aects signal path lengths. is
problem does not arise for pipeline parallelism, where communication remains short-ranged no
matter how large the parallelism degree.
A practical example of how pipelining may actually help to avoid
large fan-outs and long signal paths is the XML ltering engine that
.
:::
we showed earlier in Figure 4.3. Its actual implementation (detailed in
[Teubner et al., 2012]) uses pipeline registers in-between segments of
the state automaton circuit, as illustrated here on the right (Figure 4.7).
e design avoids that the input stream must be supplied to all automa- Figure 4.7: Pipelining
ton transitions in parallel, which would scale poorly as the number of and skeleton automata
segments increases. is keeps throughput rates high, even when the [Teubner et al., 2012].
length and complexity of the matching automaton increases [Teubner
et al., 2012].

4.4.2 PIPELINING IN FPGAS

e discussion above assumed that chip space can equally be used to instantiate (or replicate)
logic or for additional ip-ops used as pipeline registers. In FPGAs, the amount and proportion
of chip resources is xed, however. Combinational logic resources cannot be re-purposed for ipops or vice versa.
In Section 3.3.1, we saw that actual FPGA devices are composed of elementary logic units,
which refers to a static combination of lookup tables (i.e., combinational logic) and ip-ops that
can latch the lookup table output. Enabling these ip-ops as pipeline registers, hence, will not
increase the overall chip space demand of the circuit. At the same time, the size of an elementary
logic unit lies within the sweet spot of pipelining, where combinational logic and register count
are balanced to obtain a good area/throughput trade-o.
Good FPGA designs thus try to (quite aggressively) pipeline until the point where combinational logic and ip-ops are in balance. is explains, in other words, why the NFA construction mechanism of Yang and Prasanna [2012], which we looked at in Section 5.1.3, achieves
favorable performance characteristics. e basic building blocks of this mechanism exactly match
the lookup table/ip-op combination within a typical elementary logic block. Observe how, in

4. FPGA PROGRAMMING MODELS

the same line of work, Yang et al. [2008] also bounded signal paths and fan-outs through pipelining (Figure 5.8).

4.4.3 DESIGNING FOR PIPELINE PARALLELISM

Pipeline parallelism is attractive also becauseunlike data parallelismit permits many forms
of loop-carried dependencies. is is because, by construction, a data item that is processed at
a pipeline stage fi still sees any side eects that preceding items may have produced at fi . To
illustrate this with a toy example, the for clause in
sum
0
for all input items x do
sum
sum C x 2
end for
contains an inter-loop dependency (through variable sum) that violates the requirements for a
data-parallel loop processing.
e loop body can, however, be broken down into two sub-operations that are evaluated in
a pipeline fashion:
tmp . x 2

sum

sum C tmp

Pipeline registers in-between the two sub-operations would enable the loop body to be processed in parallel. In practice, operations like x 2 can be pipelined internally, increasing parallelism
and speed even further.
Example: Pipeline-Parallel Frequent Item Computation

Arguably, the above example could be replaced by a parallel computation of partial sums, followed
by a merging operation. e attractiveness of pipelining comes from the fact that also much more
intricate application problems can be accelerated through pipeline parallelism. To illustrate this,
consider the Space-Saving algorithm of Metwally et al. [2006], shown as Algorithm 1. To answer
top-k -type queries, Space-Saving counts the number of occurrences of items xi in a data stream.
Space-Saving is an approximate algorithm with bounded space (n bins, item/count pairs).
Space-Saving not only contains loop-carried dependencies (via count and item values). It
also requires accessing the same memory content according to dierent criteria (item lookups and
search for the bin with the minimum count value), which has made the algorithm notoriously
hard to parallelize through classical means. In fact, the parallel solution of Das et al. [2009] has a
lower throughput than the single-threaded code of Cormode and Hadjieleftheriou [2008], who
did an in-depth study and comparison of frequent item counting techniques.
Pipeline parallelism can help to signicantly speed up Space-Saving on FPGA hardware [Teubner et al., 2011]. e idea is a combination of classical pipeline parallelismlet input
items x ow through a sequence of hardware-based binsand a neighbor-to-neighbor communication mechanism between bins. e concept is illustrated in Figure 4.8. Items xj travel along

4.4. PIPELINE-PARALLEL APPROACHES

for all input items x do

nd bin bx with bx :item D x
if such a bin was found then
bx :count
bx :count C 1
else
bmin
bin with minimum count value
bmin :count
bmin :count C 1
bmin :item
x
end if
end for
Algorithm 1: Space-Saving algorithm of Metwally et al. [2006].
3

item
count

1 bi :item D x1

item
count

biC1

item
count

biC2

2 bi :count < biC1 :count

1 comFigure 4.8: Pipeline-parallel implementation of Space-Saving. For each item xj and bin bi ,
2
pare xj to the content of bi , compare the count values of bi and bi C1 (if necessary, swap bin
3 forward xj to bi C1 .
contents so the smaller count value ends up on the right), then
1 xj is compared to the local item and the count value
a pipeline of bins bi . At every bin bi ,
2 the bin contents of bi and biC1 are compared
incremented if a match was found. Otherwise,
according to their count values. If necessary, the contents of bi and biC1 are swapped, such that
3 the input item xj moves on to the next bin biC1 .
biC1 receives the smaller count value. en
e frequent item example also nicely illustrates how pipeline parallelism can lead to signicantly better scalability properties than data-parallel alternatives. In Teubner et al. [2011], we
discussed various parallelization schemes for the problem on FPGA hardware. Figure 4.9 shows
the scalability of the pipelined and data-parallel strategies of this comparative study.

4.4.4 TURNING A CIRCUIT INTO A PIPELINE-PARALLEL CIRCUIT

An electronic circuit reects the data ow of the application problem that it implements. If this
data ow is cycle-free, pipelining becomes straightforward and can be achieved as sketched at the
beginning of Section 4.4.1. To this end, the circuit is cut in half in such a way that all signal
wires which cross the cut have the same direction. A set of pipeline registers inserted into such

4. FPGA PROGRAMMING MODELS

throughput [million items / sec]

125

pipeline-parallel

100
75
50

data-parallel

25
0

32
64
128
256
512
1024
number of items monitored (degree of parallelism)

Figure 4.9: FPGA-based solutions to the frequent item problem. e pipeline-parallel strategy of
Figure 4.8 keeps wire lengths short and scales signicantly better than a data-parallel alternative (data
from [Teubner et al., 2011]).

a cut will keep the semantics of the circuit intact; results will only be delayed by one additional
clock cycle.
If the data ow contains cyclesthat is, if the output of a previous computation is needed
to process the next input itemsuch cuts may no longer be found, and the intuitive approach to
pipelining no longer be successful. It turns out, however, that many forms of such recursive circuits
can still be pipelined. A good starting point to get to a pipelined solution is to partially unroll the
iterative computation. Depending on additional properties, such as associativity of operations, the
unrolled circuit can then often be re-organized such that cuts in the above sense can be found.
A detailed discussion is beyond the scope of this book and we refer to Section 2.7 in the book of
Kaeslin [2008] for details.

4.5

RELATED CONCEPTS

In many ways, FPGAs are probably the most exible type of compute resources available today.
As such, they cover a very large design space in terms of their system design and integration
and in terms of how they exploit hardware parallelism, which has become the key approach to
performance in virtually any computing device.
Several specic points in this design space are also covered by o-the-shelf hardware devices
that may be usable as accelerators, co-processors, or even standalone processors. Probably the most
well-known representatives of such a device are graphics processors (GPUs), which meanwhile have
evolved into computing platforms with a remarkably wide range of applications.

4.5. RELATED CONCEPTS

Graphics Processors (GPUs). Graphics processing units make data parallelism explicit at a scale
unmatched by any other processor technology. e latest incarnation of the NVIDIA Kepler architecture, for instance, includes up to 15 SMX Streaming Multiprocessors [Corp., 2012]. Each
of them contains 192 cores (which in general-purpose CPU terminology would best compare
to an ALU), for a total of 2,880 parallel processing units. ereby, the granularity of data parallelism sits in-between the replicated parallel circuits that we discussed in Section 4.3 and the
course-grained parallelism available in general-purpose multi-core processors. NVIDIA GPUs
follow an SIMT (single instruction, multiple threads) execution model: groups of 32 threads
(a warp) are scheduled such that all 32 threads execute the same instruction at the same time.
is signicantly eases hardware scheduling logic and brings intrinsic data parallelism all the way
down to the execution layer.
At the programmers level, the graphics device can execute compute kernels. Many thousands, even millions or more, logical executions of such a kernel operate on independent data
items. ereby, synchronization across threads is highly limited, which again simplies the underlying hardware and enables higher compute density.
Graphics processors were identied as a potential substrate for database co-processing almost a decade ago. Govindaraju et al. [2004] showedeven before graphics processors oered
the programmability that they have todaythat the available data parallelism can be used to speed
up important database tasks, such as selection and aggregation. Two years later, Govindaraju et al.
[2006] showed how the use of a GPU co-processor can oer signicant performance/price advantages when implementing sorting. He et al. [2008] pushed the idea of using graphics processors
for database co-processing one step further by running database joins on GPU hardware.
Many-Core Processors. Inspired by graphics processor designs, other processor makers have
come up with acceleration platforms that emphasize data parallelism on the hardware level. Intel
has recently announced their Xeon Phi platform [Intel Corp., 2012], which packages 60 generalpurpose cores, each capable of executing four threads, into a single chip. Internally, each core
oers a 512-bit-wide SIMD unit for vector-oriented execution. While the processing model of
Xeon Phi is not strictly tied to data-parallel execution, its NUMA (non-uniform memory access)
architecture clearly favors data-parallel tasks.
Data Flow Computing on FPGAs. For reasons mentioned in this chapter, FPGAs are very attractive to leverage pipeline parallelism (and accelerate application tasks that can benet less from
data parallelism alone). Several vendors make this explicit in FPGA programming platforms. For
instance, Maxeler oers platforms for dataow computing [Pell and Averbukh, 2012]. A dedicated compiler extracts data ow information from a Java program. e resulting data ow graph
is then mapped to an FPGA circuit design that is highly susceptible to pipeline parallelism. Processing pipelines created this way may span entire FPGA chip dies and use thousands of pipeline
stages.

CHAPTER

Data Stream Processing

FPGA technology can be leveraged in modern computing systems by using them as a co-processor
(or accelerator) in a heterogeneous computing architecture, where CPUs, FPGAs, and possibly
further hardware components are used jointly to solve application problems.
is idea is particularly attractive for application settings that readily match the strengths
of programmable hardware. is has led the research community to specically investigate into
streaming scenarios, where data arrives as a continuous ow and has to be processed in real time.
Flow-oriented processing can leverage the I/O capabilities of modern FPGA devices, and if the
necessary amount of state can be kept (mostly) within the device, an FPGA-based stream processor will not suer from von Neumann or memory bottlenecks.
In this chapter, we look at some systems that have applied this idea successfully to important application problems. We start by showing how FPGAs can accelerate pattern matching in
network intrusion detection systems (Section 5.1), then lift the ideas to support complex event processing, which lies at the heart of many stream processing engines (Section 5.2). In Sections 5.3
through 5.5, we will generalize the use of FPGAs for data ow problems and illustrate how SQLstyle query functionality can be realized with the help of FPGAs.

5.1

REGULAR EXPRESSION MATCHING

Matching an input data stream to a (set of ) regular expression(s) is an important and useful task
on its own, but also as a pre-processing step for many stream analysis tasks. For instance, it is the
task of a network intrusion detection system to detect suspicious patterns in a network data ow
and perform a series of actions when a match is detected (e.g., alert users or block network trac).
Semantically rich stream processing engines depend on parsing and value extraction from their
input stream to perform higher-level computations.
Regular expressions correspond 1-to-1 to nite-state automata, which are the method of
choice to realize pattern matching in both hard- and software. eir low-level implementation
faces considerably dierent trade-os in hard- and software. By studying these trade-os, in
the following we illustrate some of the important characteristics of FPGA hardware and their
consequences on real-world applications.

5.1.1 FINITE-STATE AUTOMATA FOR PATTERN MATCHING

Figure 5.1 illustrates how nite-state automata and regular expressions relate. e automaton
shown implements the regular expression .*abac.*d. It is an instance of a non-deterministic

5. DATA STREAM PROCESSING

a
q.0

b
q1

a
q2

c
q3

q5
*

Figure 5.1: Non-deterministic automaton, corresponding to the regular expression .abac.d.

nite-state automaton (NFA). at is, multiple states in this automaton might be active at a time
(e.g., after reading a single a, states q0 and q1 are both active) and multiple transitions may have
to be followed for a single input symbol.
A beauty of non-deterministic automata lies in their easy derivation from regular expressions. Simple, syntax-driven construction rulesmost well known are those of McNaughton and
Yamada [1960] and ompson [1968]allow to mechanically convert any regular pattern into an
equivalent NFA. is simplicity is the reason why in Figure 5.1, the original pattern .*abac.*d
still shines through. For similar reasons, such non-deterministic automata are attractive in many
application scenarios, because new patterns can easily be added (or old ones removed) from an
existing automaton.
Deterministic and Non-Deterministic Automata

On the ip side, non-deterministic automata are less straightforward to implement in computer

software. To process an input symbol, all candidate states and transitions have to be considered
iteratively by a software program. is is inecient by itself and will likely break the O.1/ cost
characteristics (per input symbol) promised by regular expressions and state automata.
Most software systems resolve this ineciency by converting the generated NFA into a
deterministic nite-state automaton (DFA). In a DFA, exactly one state is active at any time, and
exactly one transition has to be followed for every input symbol. Systems can exploit this property and implement input processing as a series of lookups in a hold-state; symboli ! new-state
mapping table (e.g., realized through a hash table). An NFA can be converted into a DFA, e.g.,
by powerset construction: given an NFA with states qiN 2 S N , each subset of (active) states corN
responds to exactly one element in the powerset QD 2Q .
Figure 5.2 illustrates the result of such a conversion (we simplied the automaton by eliminating unreachable DFA states). e illustration shows that the conversion increases the number
of automaton states. Further, the correspondence to the original pattern has largely gone lost in
the conversion process.

In practice, many states in QD can be determined statically to be unreachable, so the actual number of states jQD j is typically
N
less than 2jQ j .

5.1. REGULAR EXPRESSION MATCHING

a
a

def

def
def

def
b

d
d

def

Figure 5.2: Deterministic automaton, equivalent to the non-deterministic automaton in Figure 5.1.
e automaton contains more states, but now only one of them can be active at a time.

input D a
q0
OR

input D b
q1

AND

input D c

input D d

q2
AND

q3
AND

q4
AND

.
Figure 5.3: Hardware implementation for the non-deterministic automaton in Figure 5.1. Each state
qi maps to a ip-op register; each transition maps to combinational logic between states.

5.1.2 IMPLEMENTING FINITE-STATE AUTOMATA IN HARDWARE

FPGAs can be viewed as essentially a generic implementation of a nite-state automaton in hardware. In fact, any logic circuit strictly is not more than a nite-state automaton (since the available
amount of memory is nite). As such, it is only natural to implement user-level automata using
congurable logic.
A given automaton can be mapped to an equivalent hardware circuit mechanically in the
following way (this construction assumes that " transitions have been eliminated beforehand, as
it is the case in Figure 5.1):
1. For each automaton state qi , instantiate a ip-op register FFi . Each ip-op can hold one
bit of information, which we interpret as active/inactive.
p

2. For each transition qi ! qj , instantiate combinational logic that forwards an active bit from
FFi to FFj if the condition p is satised. If the new content of FFj has multiple sources,
combine them with a logical OR.
Applying this strategy to the automaton in Figure 5.1 results in the hardware logic shown
in Figure 5.3. Observe how the construction preserves the structure of the source automaton and
how states/transitions map to ip-ops/combinational logic.

5. DATA STREAM PROCESSING

LUT consumption in %

An automaton can be mapped into hardware with this strategy whether the automaton is
deterministic or not. e ineciency that arises in software implementationsiterative processing of candidate statesdoes not apply to this hardware solution. All logic resources in an FPGA
chip operate independently, so possible transitions are naturally considered in parallel.
6
NFA
DFA (one-hot encoded)
DFA (binary encoded)

5
4
3
2
1
0 .
0

4 5 6 7 8
i in (0|1)* 1 (0|1)i

Figure 5.4: FPGA resource consumption (lookup tables) for automaton alternatives. DFAs can be
encoded with one bit per state (one-hot) or by enumerating all states and encoding them in a binary
number.

In fact, when implemented in hardware, non-deterministic automata are often the strategy
of choice. In addition to the advantages already mentioned above (correspondence with regular
expressions allows for easy construction or modication), NFAs tend to have a much simpler
structure. is makes the combinational logic to implement transitions simpler and more ecient. In Figure 5.4, we illustrated the FPGA resource demand for dierent implementation
strategies for the same regular expression (Xilinx Virtex-5 LX110T chip). Consumption of logic
resources is an important criterion in FPGA design on its own. Here the growing logic complexity may additionally lead to longer signal propagation delays, reduced clock frequencies, and
lower overall performance.

5.1.3 OPTIMIZED CIRCUIT CONSTRUCTION

Early on, alternative construction mechanisms were suggested that allow us to turn a given regular
expression into a hardware implementation directly, without explicitly building an automaton as
we saw in Figure 5.1. Such a direct construction is particularly meaningful for programmable logic,
such as programmable logic arrays (PLAs; addressed by Floyd and Ullman [1982]) or FPGAs
(which are the target of the technique of Sidhu and Prasanna [2001]).
Note that the number of states does not directly make a dierence. An NFA with jQN j states may lead to a DFA with 2jQ
states. But since only one of them can be active at any time, jQN j bits still suce to encode the current state.

5.1. REGULAR EXPRESSION MATCHING

a
q0.

q0.

b q3

d
a

c
c

b
(a) Simplied NFA to match the regular expression
b*c(a|b)*d.

(b) NFA for bc(a|b)d where all transitions incoming

to one state carry the same condition.

Figure 5.5: Rewriting nite-state automata for better FPGA resource utilization. Both automata recognize the same language, but the right one maps better to FPGA primitives.

::
:

.
OR

AND

input D x

(a) Hardware module to implement NFA.

(b) Automaton for expression bc(a|b)d, built from modules

shown on the left.

Figure 5.6: Once NFAs are in proper shape (cf. Figure 5.5(b)), they can be realized just with the
single type of module shown as (a). Figure (b) illustrates the resulting implementation of b*c(a|b)*d
(where abbreviates an OR gate and stands for an AND gate).

e utilization of FPGA logic resources can be improved if the construction is designed

to match the basic FPGA primitives. In particular, combinational logic is available in FPGAs in
the form of n-input lookup tables, which are typically paired with a ip-op register (in Xilinx
FPGAs, this makes the basic building block of an FPGA slice). A method to do so was suggested
by Yang and Prasanna [2012], Yang et al. [2008]. ey observed that the NFA for any user pattern
can be brought into a particular shape, namely such that for each state, all of its input transitions
have to match the same input symbol.
Formally, in a nite-state automaton, each input symbol xi brings the automaton from one
state qi to another, qiC1 . e new state is dened by a transition function f :
qiC1 D f .qi ; xi / :

(5.1)

Figure 5.5 illustrates this with two automata for the regular expression b*c(a|b)*d. Intuitively, the automaton in Figure 5.5(a) has lower resource consumption. But the alternative in

5. DATA STREAM PROCESSING

Figure 5.5(b) satises the constraint of having just one input symbol at the incoming transitions
of each automaton state. Automata that adhere to this shape can be realized in hardware with
just a single type of module. Such a module is shown in Figure 5.6(a) (assuming an n-ary OR gate
to accommodate ingoing transitions and an AND gate to account for the condition). As Yang and
Prasanna [2012] have shown, the combinational parts of this circuit map well to a combination
of lookup tables, which are readily paired with a ip-op in modern FPGAs. Figure 5.6(b) shows
how modules can be assembled to implement the automaton shown in Figure 5.5(b).

5.1.4 NETWORK INTRUSION DETECTION

Regular expression matching forms the heart of most publish/subscribe systems, where subscribers
register patterns that describe which parts of a stream they are interested in. A particular use case
is network intrusion detection, where the set of subscriptions is a rule set that describes known
network attacks. Yang and Prasanna [2012], Yang et al. [2008] have shown how the matching of
network packets to this rule set can be realized in FPGA hardware. Beyond design internals of
nite-state automata (as discussed above), these implementations also give a good sense of how
FPGA circuits can be optimized and how design trade-os are considerably dierent from those
in software-based systems.
Multiple-Character Matching
Consequently, state qiC2 can be obtained by applying f twice,
def
qiC1 D f .qiC1 ; xiC1 / D f f .qi ; xi /; xiC1 D F .qi ; xi ; xiC1 / ;

(5.2)

state

logic

input xi

and consuming two input symbols at once.

(a) initial situation
If realized in software, F might be dicult to construct and
require a very large lookup table. In hardware, by contrast, the conQ.
f
f
state
log.
log.
cept is rather simple to implement. Figure 5.7(a) shows a high-level
input xi
view of a hardware state automaton, grouping the circuit into state
input xi C1
(ip-op registers Q) and combinational logic that implements the
transition function f . As can be seen in the gure, f consumes the
(b) f replicated
initial state and an input symbol xi to compute a new state (as in
Equation 5.1).
Figure 5.7: Two input syme situation in Equation 5.2 can now be obtained by repli- bols per clock cycle.
cating the logic associated with f . As shown in Figure 5.7(b), this
means that the resulting hardware circuit can now consume two input symbols in a single clock
cycle. In practice, this trades higher chip space consumption for better matching throughput.
Signal Propagation Delays and Pipelining

Actual intrusion detection systems will have to match the input network stream against hundreds,
if not thousands, of rules. With the amount of chip space available in modern FPGAs, all state

5.1. REGULAR EXPRESSION MATCHING

match out

reg

RE5

RE6

RE7

RE8

reg

RE13

RE14

RE15

RE16

reg

reg
input

reg

RE1

RE2

RE3

RE4

reg

RE9

RE10

RE11

RE12

.
Figure 5.8: Pipeline registers (indicated as reg
. ) help to keep signal propagation delays low and avoid
high-fanout signals (illustration adapted from Yang et al. [2008]).

automata to match such large rule sets can be laid out side-by-side on the two-dimensional chip
area. anks to the intrinsic parallelism, all rules can then be matched fully in parallel, promising
high, rule set-independent matching speed.
In practice, however, such perfect scaling is limited by geometrical and physical eects.
As the rule set grows, the generated automata will cover a growing area on the chip, causing
the distance between components to increase (following an O.N / dependence, where N is the
number of NFAs). is, in turn, increases signal propagation delays in the generated circuit, such
that the clock frequency (and thus the achievable throughput) has to be reduced to maintain correct
behavior of the hardware circuit.
A solution to this problem is pipelining. ereby, signal paths are intercepted by pipeline
registers, which memorize their input signals from one clock cycle to the next. With shorter signal
paths, clock frequencies can be increased, with direct consequences on the observed throughput.
e price for this is a slight increase in latency. In an n-staged pipeline, the overall circuit output
(in this case a match information) is delayed by n FPGA clock cycles.
For the problem at hand, pipelining can be applied in a hierarchical fashion, as illustrated in
Figure 5.8. As can be seen in the gure, this keeps the number of stages (intuitively, the number of
pipeline registers along any path from the circuit input to its output) short, while still allowing for
a large rule set. In practice, clock cycle times are in the range 510 ns, such that pipelining causes
only negligible latency overhead (e.g., compared to the time the same packet needs to travel over
the network wire).

5. DATA STREAM PROCESSING

raw
stream

event
.
extraction
(NFA-based)

events

stream
partitioning
(state memory)

event
old
new
state

pattern
matcher
(NFA-based)

match

Figure 5.9: Complex event processing architecture. Events are extracted from raw input stream. A
partitioner component reads the corresponding state for each input event, hands it to an NFA-based
pattern matcher, then memorizes the new partition state.
Space $ roughput

Multi-character matching and pipelining both trade chip resources for better matching throughput. As a third space $ throughput trade-o, the entire matching logic can be replicated on
the chip, with incoming network packets load-balanced to either of the replicas. In practice, all
three strategies form a design space, and it depends on the hardware and problem characteristic, which combination of strategies maximizes the overall matching throughput. For a rule set of
760 patterns from the SNORT intrusion detection system, Yang et al. [2008] report an achievable
throughput of 14.4 Gb/s on a Virtex-4 LX100 FPGA device (this chip was released in 2004).

5.2

COMPLEX EVENT PROCESSING

As described above, nite-state automata allow the detection of symbol sequences in a single input
data stream. is model is adequate, e.g., to perform syntactic analyses on the stream or to match
patterns within a single message (as is the case in the above intrusion detection scenario). e true
strength of modern stream processing engines, however, comes from lifting pattern matching to
a higher semantical level. To this end, low-level events are derived from the input stream (e.g.,
through syntactic analyses). e resulting sequence of events is then analyzed according to complex
event patterns. As an example, a stock broker might want to be informed whenever the price for
any stock symbol has seen ve or more upward movements, then a downward change (pattern
up{5} down, where up and down are derived events).
e challenge in matching complex event patterns is that they usually depend on semantic
partitioning of the input events. For instance, prices for various stock symbols might arrive interleaved with one another; a matching pattern for one stock symbol might overlap with many
(partial) matches for other symbols. To detect such patterns, the stream processor needs to keep
track of the matching state for each of the stock symbols in the stream.

5.2.1 STREAM PARTITIONING

To realize semantic stream partitioning, Woods et al. [2010] devised a partitioner component that
plugs into the data ow of the stream processing system.

5.2. COMPLEX EVENT PROCESSING

input

hMSFT; 42i

events

YHOO

hGOOG; 17i

hAAPL; 29i

GOOG

AAPL

to
matcher
old
new

.
Figure 5.10: Hardware-based stream partitioning using a chain of hgroup-id; state-veci pairs. e
strategy keeps signal paths short and the design scalable.

e resulting architecture is illustrated in Figure 5.9. An NFA-based syntax analyzer extracts low-level events from the raw input stream (e.g., a network byte stream, as used for click
stream analysis in [Woods et al., 2011]). A hardware stream partitioner uses the partitioning criterion in each event (e.g., a stock symbol or a client IP address) to read out the current vector
of states from a local state memory. State vector and source event are then routed to the actual
pattern matcher which will (a) report eventual matches to the outside and (b) send an updated
state vector back to the partitioner component.

5.2.2 HARDWARE PARTITIONER

To sustain real-world input data rates, such as the line rate of a monitored network port, state
lookups and write-backs must adhere to tight performance constraints. On the one hand, this
rules out typical approaches that are optimized for average-case performance in software solutions, such as hash tables. On the other hand, the intrinsic parallelism of FPGA hardware allows
us to perform a large number of (lookup) tasks in parallel. Woods et al. [2010] showed how this
FPGA feature can be used to guarantee line-rate processing for hundreds of stream partitions,
with the only limit being the available chip space.
e idea of this approach is illustrated in Figure 5.10. hgroup-id; state-veci pairs are organized as a chain of hardware components. Incoming events (containing a group id and further
semantic content) are handed from one element in this chain to the next, one every FPGA clock
cycle. e key idea is that chain elements can be swapped under certain conditions. Swapping
means that two neighboring elements exchange their contents (state vector and group id).
In our case, whenever a group-id matches the one in the passing-by event, we swap that
element toward the end of the chain. us, when the event has been propagated through the
entire chain, the last element will contain its associated state vector. From there, both event and
state vector are handed over to the NFA that performs the actual pattern matching.

5. DATA STREAM PROCESSING

120
100

80
basic events

% packets processed

million packets/events per sec.

% processed
10

20
input packets

.
1

30
40
50
60
basic events per packet

Figure 5.11: Complex event processing in hardware guarantees line-rate performance, independent
of how events are packeted on the network wire. Results taken from Woods et al. [2011].

5.2.3 BEST-EFFORT ALLOCATION

e chain of hgroup-id; state-veci pairs denes a xed number of partitions (e.g., the number
of distinct stock symbols) that can be monitored concurrently. ese available chain elements are
allocated dynamically based on incoming events. To this end, elements are not only swapped when
they match the group id of a passing-by event, but also when they have not yet been assigned to
any group id. In Figure 5.10, for instance, the event hGOOG; 17i may have pushed the still-empty
element in the middle to its current position. As a consequence, when an event reaches the right
end of the chain, it will nd there either its matching state vector or a vacant chain element, which
will then be allocated for the respective group id.
Some input streams may exceed the hardware limit on the number of group ids (many
hundreds, in practice). In such cases, chain elements can be allocated using a best-eort strategy,
where recently seen group ids are prioritized over very old ones. Such best-eort allocation can be
achieved with a simple timeout counter associated with each chain element [Woods et al., 2010].
5.2.4 LINE-RATE PERFORMANCE
A key selling point of hardware-based solutions are their strong performance guarantees, even
under unforeseen load characteristics. In the case of monitoring or event processing applications,
the ability to process input data at full line rate is particularly desirable.

5.3. FILTERING IN THE DATA PATH

e hardware partitioning mechanism illustrated above was shown to guarantee such linerate performance, as illustrated in Figure 5.11. Software-based alternatives are often very sensitive
to the way events are presented at the input of the system. If events are sent as very small Ethernet
packets, for instance, most software solutions become overloaded because of their high per-packet
processing overhead. As the gure shows, hardware-accelerated complex event processing is not
sensitive to this problem and guarantees 100 % line rate performance independent of event packeting.

5.3

FILTERING IN THE DATA PATH

Up until now we had looked at the FPGA as an isolated component. It would listen to an incoming stream, but we left unspecied what kind of output or action is being generated by the
programmable hardware. In practice, most applications will be too complex to be solved entirely
in FPGA hardware, which suggests hybrid FPGA/CPU processing to jointly solve the application
task. Ideally, a hybrid system design would also allow for a transition path, where performancecritical functionality is o-loaded one-by-one without major disruptions on the user-level interface.
A system architecture that can satisfy all these criteria is when the FPGA is plugged into
the data path of the processing engine:

source
.

ltered
raw data

CPU

FPGA

data
In such an architecture, the FPGA consumes the raw input datausually high-volumeand
applies ltering or other pre-processing steps that reduce the volume of data. e remaining lowvolume data set is forwarded to a conventional CPU, where, for instance, it could be fed into the
processing pipeline of an existing system. A practical example of this model could be an FPGA
that applies selection and projection to a large data set while it is read from a hard disk (i.e.,
source disk). Netezza commercialized this concept in their Netezza Performance Server (NPS)
system.
A data path architecture elegantly separates the application task to match the strengths of
both parts of an FPGA/CPU hybrid. FPGAs are extremely good when the task to perform is
relatively simple, but the data volume is huge. Conversely, sophisticated control logic in generalpurpose processors makes them very ecient at complex operations; but their I/O capabilities
and their energy eciency fall way behind those of modern FPGA devices.

5.3.1 DATA PATH ARCHITECTURE IN THE REAL WORLD

Architecting an FPGA/CPU hybrid along its data path ts well also with needle in a haystack
application patterns that are becoming dominant in many business domains. Event processing

5. DATA STREAM PROCESSING

systems need to sift through large amounts of dynamic data, but typically only few events are
actually relevant for further processing. High expectations toward ad hoc querying force data
analytics engines to execute most of their work as brute force scans over large data volumes (Unterbrunner et al. [2009] describe a ight booking system as a concrete example).
Electronic stock trading is a prototype example of how FPGAs can signicantly accelerate
existing application systems or even enable new market opportunities that will remain hard to
address with software-only solutions even in upcoming CPU architectures. e challenge here
is a very high input data volume, combined with uniquely tight latency requirements. Any improvement in latencyon a micro-second scale!will bring a competitive advantage that may be
worth millions of dollars [Schneider, 2012].
High-Frequency Trading. In high-frequency trading, stock traders monitor market price information. Automated systems buy or sell stocks within fractions of a milli-second to, e.g., benet
from arbitrage eects. ese systems usually focus on a very particular market segment, stock
symbol, or class of shares. If the input stream from the stock markettypically a high-volume
stream with information about a large market subsetis pre-processed on an FPGA and with
negligible latency, the core trading system can focus just on the relevant parts of the market, with
lower latency and better forecasting precision.
Risk Management. In response to incidents on the stock market, where erroneous trades by
automated systems have led to severe market disruptions (e.g., the May 6, 2010 Flash Crash or
a technology breakdown at Knight Capital Group on August 1, 2012), the American SEC began
to impose risk management controls on stock brokers. Today, any broker with direct market access
has to sanity-check all orders sent to the stock market to prevent unintended large-scale stock
orders (SEC Rule 15c3-5).
Software alone would not be able to evaluate all SEC criteria without adding signicant
latency to the trading process. But if the control program listens in to the trading stream through
an FPGA with pre-processing capabilities, risk evaluation can be performed with a latency of only
a micro-second or less [Lockwood et al., 2012].

5.4

DATA STREAM PROCESSING

For certain application areaswe discussed network monitoring and electronic stock trading
hereFPGAs oer signicant advantages in terms of performance, but also in terms of their
energy eciency. Ideally, these advantages would carry over to more general cases of stream or
database processing.
is is exactly the goal of the Glacier system [Mueller et al., 2009, 2010]. Glacier is a compiler that can translate queries from a dialect of SQL into the VHDL description of an equivalent
hardware circuit. is circuit, when loaded into an FPGA, implements the given query in hardware and at a guaranteed throughput rate.

5.4. DATA STREAM PROCESSING

Table 5.1: Streaming algebra supported by the Glacier SQL-to-hardware compiler (a; b; c : eld
names; q; qi : sub-plans; x : parameterized sub-plan input). Taken from [Mueller et al., 2009].

operator
a1 ;:::;an .q/
a .q/
? aW.b1 ;b2 / .q/

q1 [ q2
aggbWa .q/
q1 grpxjc q2 .x/
q1 txjk;l q2 .x/
q1 . q2

semantics

projection
select tuples where eld a contains true
arithmetic/Boolean operation a D b1 ? b2
union
aggregate agg using input eld a, agg 2 favg; count; max; min; sumg
group output of q1 by eld c , then
invoke q2 with x substituted by the group
sliding window with size k , advance by l ; apply q2 with x substituted
on each window; t 2 ftime; tupleg: time- or tuple-based
concatenation; position-based eld join

.
tuple

xj4;1
SELECT avg (Price) AS avgprice
FROM (SELECT * . FROM Trades
WHERE Symbol = "UBSN")
[ SIZE 4 ADVANCE 1 TUPLES ]
(a) Example sliding-window query.

avgavgpriceW.Price/

D aW.Symbol;"UBSN"/

Trades
(b) Corresponding algebra plan.

Figure 5.12: Query compilation in Glacier. Streaming SQL queries are converted into an algebraic
representation, then compiled into a hardware circuit.

e heart of the Glacier compiler operates on a streaming algebra that assumes tuplestructured input events. is algebra, shown in Table 5.1, is sucient to express particularly those
aspects of a user query that can be realized as a pre-processing step in the sense of the data path
architecture discussed above. Operators can be nested arbitrarily. Following the usual design of a
database-style query processor, SQL queries stated by the user are rst converted into an internal
algebraic form, then translated to VHDL code. Figure 5.12 illustrates an example adapted from
Mueller et al. [2009] and its corresponding algebraic plan.

5. DATA STREAM PROCESSING

data_
valid

5.4.1 COMPOSITIONAL QUERY COMPILATION

Glacier derives its power from the ability to translate the aforementioned algebra in a fully compositional manner. To this end, the hardware equivalent for every compiled algebra (sub-)expression
follows the same wiring interface, which is sketched in Figure 5.13. Each n-bit-wide tuple
is represented as a set of n parallel wires on the hardware side. An additional data_valid
line signals the presence of a tuple in the remaining wires. For every
algebraic operator , Glacier knows a set of compilation rules that produces the wiring interface in Figure 5.13, provided that the operands
n wires
of are already compiled according to that interface.

Figure 5.14 illustrates two such compilation rules. e rule on
q
.
the left (translating the selection operator ) uses an AND gate to set
the data_valid line to false if the selection condition is not met, thus
Figure 5.13: Wiring in- invalidating the respective tuple. e projection operator eectively
terface in Glacier.
drops columns from the relation schema. In hardware, this can be
achieved by simply not wiring up the respective signals to any upstream
operator, as shown in the right rule in Figure 5.14.
Z)

a .q/

a
q

a1 ;:::;an .q/

Z)
.

a1 an
q

Figure 5.14: Glacier compilation rules for selection (left) and projection (right).

is compilation rule for algebraic projection illustrates an interesting interplay between

query compilation into a logical hardware circuit and the subsequent compilation stages (synthesis, map, and place & route) that generate the actual physical circuit. For many reasons, loose
wires often occur in generated hardware circuits. FPGA design tools thus optimize out all subcircuits whose output is not connected to any outside port. Glacier s compilation rule for projection
takes advantage of this feature. Eectively, Glacier o-loads projection pushdowna well-known
database optimization strategyto the FPGA design tools.
Push-Based Execution Strategy

Glacier -generated hardware execution plans follow a strictly push-based execution strategy. at
is, individual sub-plans write their output into a single register set, from where they assume an
upstream operator will pick up the result immediately in the next clock cycle.
Such a strategy is adequate for execution in hardware, because a circuits runtime characteristics can be inferred statically at circuit generation time with very high accuracy. More specifNote that in the Glacier algebra, operates on Boolean columns only. Complex selection criteria must be made explicit by
applying, e.g., arithmetic or Boolean operations beforehand.

5.4. DATA STREAM PROCESSING

ically, the latency, i.e., the number of clock cycles needed by an operator to present the result of
a computation at its output port, and the issue rate, i.e., the minimum gap (in clock cycles) between two successive tuples pushed into an operator, are precisely dened by the structure of the
hardware circuit. Glacier submits the generated hardware description together with the desired
clock frequency (e.g., sucient to meet the line rate of a network data stream) to the FPGA tool
chain, which will verify that the generated circuit can meet the frequency requirements. If a circuit cannot meet its requested throughput rate, the situation will be detected at compile time and
the user demand rejected by the Glacier system.
For most query types, Glacier can maintain an issue rate of one tuple per clock cycle [Mueller et al., 2009] (which in practice signicantly eases timing and interfacing with the
respective stream source). Typical queries roughly compile into a pipeline-style query plan, which
means that the latency of the generated plan depends linearly on the query complexity. In practice, the resulting latencies are rarely a concern; for clock frequencies of 100200 MHz, a few
cycles (typically less than a hundred) still result in a latency of less than a micro-second.
Resource Management

Glacier draws its predictable runtime performance from statically allocating all hardware resources
at circuit compilation time. Each operator instance, for example, receives its dedicated chip area
and no resources are shared between operators. In eect, the compiler lays out a hardware plan
on the two-dimensional chip space with a structure that resembles the shape of the input algebra
plan. Processed data ows through this plan, resulting in a truly pipelined query execution.
Resource management becomes an issue mainly in the context of stateful operators such as
grouping (aggregation) or windowing. Glacier assumes that the necessary state for such operations
can be determined at query compilation time (e.g., with knowledge about group cardinalities or
from the combination of window and slide sizes in the case of windowing operators). Glacier then
replicates dependent sub-plans and produces a dispatch logic that routes tuples to any involved
replica.
Figure 5.15 illustrates this mechanism for the query shown earlier in Figure 5.12. e botD combination (this part will
tom part of this generated hardware execution plan reects the -
result in strictly pipelined processing). e upper part contains ve replicas of the avg sub-plan
(the windowing clause 4; 1 allows at most four windows to be open at any time) and dispatch
logic to drive them. For windowing operators, the dispatch logic consists of cyclic shift registers
(CSRs) that let windows open and close according to the windowing clause given.
In the case of windowing clauses, Glacier will route tuples to sub-circuits for all active windows and typically many of them are active at a time. For grouping operations, by contrast, tuples
must be routed to exactly one group. Glacier still uses circuit replication to represent grouping in
hardware (and a content-addressable memory (CAM) to look up matching groups). If replicaGlacier does allow a restricted form of multi-query optimization. Because of the strictly push-based processing model, identical
sub-plans can be shared across queries.

5. DATA STREAM PROCESSING

5-way union
0

eos

&
0

eos

avgprc

avg

&
1

eos

avgprc

avg

eos

avgprc

avg

CSR2

CSR1

counter
.

adv

avgprc

avg

eos

avgprc

avg

a
a

Symbol

a
D

"UBSN"

Trades

D aW.Symbol;"UBSN"/

Figure 5.15: Hardware execution plan for the query in Figure 5.12(a). Glacier replicates the sub-plan
of the windowing operator . A combination of two cyclic shift registers (CSR) routes tuples to the
right replica(s). Illustration adapted from [Mueller et al., 2009].

tion is not desirable (or not feasible, e.g., because of hardware resource constraints), the hardware
partitioning mechanism discussed in Section 5.2.2 could be used to memorize group states and
create only one circuit instance to serve all groups.
Hash Tables in Hardware. For even larger groups, circuits may have to resort to external memory to keep group states. A hash table would be the obvious choice for a data structure for this
purpose. e crux of hashing is, however, its unpredictable performance. Hash collisions may
require multiple round trips to memory, leading to a response time dependence on key distributions.
Kirsch and Mitzenmacher [2010] describe and analyze hashing schemes for hardware implementations. If used correctly, multiple-choice hashing schemes can signicantly reduce the probability of hash collisions. And once collisions have become very rare (Kirsch and Mitzenmacher
[2010] report probabilities below one in a million entries), a small content-addressable memory
(CAM), installed side-by-side to the hash table, is enough to capture them. is way, predictable
performance can be achieved with minimal hardware resource consumption.
In multiple-choice hashing schemes, every key has multiple locations where it could be
placed in memory. Cuckoo hashing [Pagh and Rodler, 2001] is a known software technique based
on multiple-choice hashing to guarantee constant-time lookups. When realized in hardware, all

5.5. DYNAMIC QUERY WORKLOADS

possible locations of an entry can be tested in parallel, hence avoiding one of the down sides of
multiple-choice hashing.

5.4.2 GETTING DATA IN AND OUT

e design of Glacier, but also that of other FPGA-based stream processors, assumes that data can
be consumed and produced in a particular format dened by the processing engine. For instance,
Glacier assumes that data arrives as a parallel set of wires, with a data_valid signal indicating the
presence of data (see Section 5.4.1) and the same data format is produced as the engines output.
us, to interface with application-level stream formats, glue logic must be placed around
hardware query plans and translate between formats:
external
format

glue .logic
(de-serialize)

internal
format

hardware
query plan

internal
format

glue logic
(serialize)

external
format

FPGA
Incoming data is being de-serialized to the internal format of the hardware execution plan. e
produced query output is brought into an external wire format before it leaves the FPGA chip
again.
Observe how glue logic usually requires some additional chip area. Interfacing with outside
data formats does not, however, usually lead to a noticeable runtime overhead on the query processing task. Assuming that glue logic components can keep up with the stream data rates, they
cooperate with the hardware plan in a pipelining fashion. e additional latency (order of a few
FPGA cycles) is typically negligible.
Implementing a de-serialization component can be a tedious task. Conceptually, it resembles the writing of a software parser. But unlike in the software world, very few tools (such as
(f)lex or JLex for C/Java) exist to generate hardware parsers from high-level language specications. One tool to generate VHDL code from regular language grammars is Snowfall [Teubner
and Woods, 2011]. Similar in spirit to compiler generators for software parsers, Snowfall allows
a grammar specication to be annotated with action code. At runtime, this action code may, for
instance, drive signal wires depending on the syntactical structure of the input stream.
As in the software domain, serializing the output of a query processor into an application format usually amounts to a sequential program (implemented via a small, sequential state
machine) that emits eld values as required by the format.

5.5

DYNAMIC QUERY WORKLOADS

All of the systems and strategies discussed above assume a processing model that is illustrated in
Figure 5.16 (assuming the Glacier query compiler). In this model, a query compiler generates a
description for a dedicated hardware circuit (e.g., using VHDL) that implements the input query.

5. DATA STREAM PROCESSING

user
.
query

Glacier
compiler

VHDL
code

FPGA
tools

bitstream

FPGA
chip

Figure 5.16: Processing model for FPGA-accelerated query processing. A query-to-hardware compiler (e.g., Glacier ) compiles the user query into a circuit description. Standard FPGA design tools
generate a bitstream from that, which is uploaded to the FPGA chip.

FPGA design tools, such as Xilinx ISE convert this description into a runnable bitstream that is
uploaded to the FPGA for execution.
Intuitively, this model oers high exibility, allows maximum performance (by fully retailoring the chip for every workload change), and leverages the re-programming capabilities
of FPGA devices in a meaningful way. is intuition, however, is overshadowed by a massive
compilation overhead that the approach brings. e conversion of high-level circuit descriptions
into a runnable bitstream is highly compute intensive; the vpr program in the SPEC CPU2000
benchmark suite even uses the place & route task to measure CPU performance. By comparison,
compilation to VHDL and re-conguration of the FPGA can be performed quickly.

5.5.1 FAST WORKLOAD CHANGES THROUGH PARTIAL MODULES

A countermeasure for high circuit compilation cost was suggested by Dennl et al. [2012] and
applied to a Glacier -like scenario. e idea is to build hardware execution plans from a set of hardwired components, but composing and wiring them in exible ways, tailored to the particular
user query at hand. To this end, all components adhere to a very strict port interface, so arbitrary
components can be mixed and matched as needed.
Interfacing Between Components. e wiring interface used in Glacier, illustrated in Figure 5.13, is not suited for such a use. e width of the tuple data part of this interface depends
on the schema of the (sub-)expression q . If turned into generic components, those would have to
have a xed tuple width. Limited chip resources rule out straightforward workarounds, such as
over-provisioning the wiring interface for very wide tuples.
Dennl et al. [2012] thus forward tuple data in chunks of xed size (which is chosen to match
the processing word width). at is, tuples ti D hai ; bi ; ci ; di i are pushed through the series of
e latter is only bottlenecked by a limited conguration bus bandwidth, which can be mitigated, e.g., by compressing the
bitstream before uploading it to the chip [Li and Hauck, 2001]. Multi-context FPGA devices oer a double buering mechanism for conguration; Tabulas 3d programmable logic devices allow us to switch between congurations at multi-gigahertz
rates [Tabula, Inc., 2010].

5.6. BIBLIOGRAPHIC NOTES

processing components in a pipeline fashion:

b.3

op1

op2

op3

d1 :

(e illustration assumes an execution plan of three operators. e rst of them is just seeing the
last attribute of tuple t2 .) In this representation, the port width between neighboring components
is constant and tuples of arbitrary width could be processed.
Conguring Modules. Minimal compilation cost is achieved by using pre-compiled modules to
instantiate the opi in the above illustration. To guarantee the necessary exibility, most modules
will be generic operators that can be parameterized to t the particular operator requirements
and the given input/output tuple schemata. In the implementation of Dennl et al. [2012], conguration parameters are communicated over the same data path as the payload data. Additional
handshake signals ensure that both uses are handled as appropriate.
A particular conguration parameter are tuple eld names where hardware operators read
their input from or, equivalently, the chunk index where the respective eld can be found within
the input stream. For operator results (e.g., the outcome of an arithmetic or Boolean operation),
Dennl et al. [2012] use an in-band transfer mechanism; results are simply written into available
chunks of the data stream. To make this possible, the input data stream is interspersed with spare
chunks that can be used for that purpose.
Runtime Reconguration. Once the interfaces between hardware components are xed, the
mix & match idea of Dennl et al. [2012] may even be used in combination with partial reconguration. With some restrictions, this feature of modern FPGA chips allows us to swap sub-circuits
of a larger hardware design in and out at runtime. Such a strategy is best suited if multiple queries
are run concurrently on a single FPGA chip. Queries can then be added/removed from the chip
without a need to stop running queries for the reconguration.
e price to pay for the partial reconguration capability is that the size of all components
must t the granularity of a reconguration frame (a device-specic value; 20/40 congurable
logic blocks for Xilinx Virtex-5/6 devices). According to Dennl et al. [2012], this overhead is
bearable and amounts to about 30 % lost chip space.

5.6

BIBLIOGRAPHIC NOTES

FPGAs have been used for regular expression matching in a number of scenarios. Clark and
Schimmel [2004] suggested their use for network packet analysis. Mitra et al. [2009] showed
how XML publish/subscribe systems could be accelerated with FPGA-based pattern matching.
Later, they rened their approach to handle a larger class of twig-based XML patterns [Moussalli
et al., 2011]. Sadoghi et al. [2011] used FPGAs for a very similar application scenario, but their

5. DATA STREAM PROCESSING

system is not based on state automata and regular expression matching. Rather, they break down
their input data into attribute/value pairs and match them through their Propagation algorithm.
Vaidya et al. [2010] extended the Borealis stream processing engine [Abadi et al., 2005] and
accelerated a use case where trac information obtained through a video channel is pre-processed
using dedicated image processing logic on the FPGA. e design of this Symbiote system is such
that partial or entire plan trees can be migrated to the FPGA or handled on the CPU.
e data path concept of Section 5.3 resembles the idea of database machines. In this line
of research, various groups built special-purpose hardware that could pre-lter data as it is being
read from persistent storage. Most notable here is the D engine of DeWitt [1979].
Outside the database and stream processing world, researchers have very successfully used
FPGAs in, e.g., scientic applications. e use of FGPAs at CERNs Large Hadron Collider
(LHC) essentially follows the data path architecture that we looked at on page 61. Data rates
of several terabits per second, produced by the particle accelerator, are way above what could
be processed and archived on commodity hardware. us, FPGA-based triggers pre-analyze the
high-volume input stream, so only relevant information gets forwarded to the main processing
ow. Gregerson et al. [2009] illustrate this for (parts of ) the Compact Muon Solenoid (CMS)
Trigger at CERN. ere, an FPGA-based lter reduces a 3 Tb/s input stream to manageable
100 Mb/s.

CHAPTER

Accelerated DB Operators
In the previous chapter, we illustrated various ways of applying FPGAs to stream processing
applications. In this chapter, we illustrate that FPGAs also have the potential to accelerate more
classical data processing tasks by exploiting various forms of parallelism inherent to FPGAs. In
particular, we will discuss FPGA-acceleration for two dierent database operators (i) sort and (ii)
skyline.

6.1

SORT OPERATOR

Sorting is a fundamental operation in any database management system. Various other database
operators such as joins or GROUP BY aggregation can be implemented eciently when input
tuples to these operators are sorted. However, sorting is a rather expensive operation that can
easily become the bottleneck in a query plan. Hence, accelerating sorting can have great impact
on overall query processing performance in a database. In this section, we will discuss a number
of dierent approaches to sort small as well as large problem sets with FPGAs.

6.1.1 SORTING NETWORKS

We will start this chapter with a simple example, where only very small data sets are sorted in
hardware, consisting of, say, only eight values. Mueller et al. [2012] investigated how to do this on
FPGAs using so-called sorting networks. is example will underline once more the fundamental
dierences of laying out an algorithm in hardware versus executing a corresponding software
program on a microprocessor.
e key idea of sorting networks is to stream unsorted data
through a series of compare-and-swap elements, as the one il<
lustrated on the right. is will bring the data into the desired
sort order. e compare-and-swap element is constructed from a
32
32-bit comparator, which drives two cross-coupled 32-bit multi- A
0
A
1
plexers, i.e., if input B is smaller than input A both multiplexers
will select the other value and hence the values will be swapped.
32
In Figure 6.1, eighteen such compare-and-swap elements B
0
B
1
(illustrated as .) are combined to form an eight-way sorting network. We display the new values on the wires in gray whenever
a swap happens, e.g., at the rst stage four pairs of 32-bit wire buses with values (5,8), (3,1),
.
(2,7), and (4,6) are processed in parallel and only the values of the third and fourth bus need to

6. ACCELERATED
. DB OPERATORS

1
3

2
8

2
7

5
6

4
6

6
7

5
6

Figure 6.1: Eight-way even-odd sorting network using 18 compare-and-swap elements.

Figure 6.2: Fully pipelined even-odd sorting network with six pipeline stages.

be swapped: (3,1) ! (1,3). After all values have propagated through all the compare-and-swap
elements they appear in sorted order at the output of the circuit, i.e., the smallest value is at the
topmost output and the largest value appears at the bottom output.
ere are several ways that compare-and-swap elements can be arranged to build a sorting network. e arrangement shown here is known as even-odd sorting network. Mueller et al.
[2012] also discuss other sorting networks in detail such as bitonic sorting networks or networks
based on bubble sort. Dierent arrangements of the compare-and-swap elements mostly aect resource consumption and ease of implementation, and we will not discuss them any further here.
One problem with the circuit depicted in Figure 6.1 is that the longest signal path has to
traverse six compare-and-swap elements. e maximum clock frequency at which a circuit can be
clocked is determined by the longest signal path of a circuit. By inserting pipeline registers into the
circuit the longest signal path can be shortened. In Figure 6.2, we illustrate such pipeline registers

6.1. SORT OPERATOR

sorted runs

merged run

0
1

select-value component
Figure 6.3: Select-value component merges two sorted runs into a larger sorted run.

.
with gray boxes. With the pipeline registers in place, every signal now only needs to traverse one
compare-and-swap element to the next pipeline register. At every clock tick, intermediate states
are stored in the pipeline registers, allowing the circuit to partially process six dierent data sets
concurrently in six dierent stages.
Mueller et al. [2012] report that a similar pipelined circuit to the one shown in Figure 6.2
could be clocked at fclk D 267 MHz. Since the circuit processes 8 32 bits per clock cycle, a data
processing rate of 8.5 GB/s was achieved. A sorting network with twice as many inputs (i.e., 16
32-bit words) at the same clock rate would double the throughput. However, it quickly becomes
dicult to move data in and out of the FPGA at these high processing rates and the complexity
of the sorting circuits exponentially increases with more inputs. us, sorting networks are a very
ecient way to sort small sets of values that could be used, e.g., to implemented a hardware
accelerated sort instruction of a closely coupled microprocessor to sort SIMD registers.

6.1.2 BRAM-BASED FIFO MERGE SORTER

e sorting network example of the previous section suggests a
tightly coupled CPU/FPGA architecture, where the FPGA is
<
invoked at the granularity of individual instructions. If we could
o-load bigger chunks of data to be completely sorted by the
32
FPGA, this would reduce communication between FPGA and A 32
0
out
1
CPU, and the CPU would have more time to take care of other B
tasks while the FPGA is sorting. Koch and Torresen [2011]
proposed to use BRAM-based FIFO queues to implement a merge sorter that can eciently sort
larger problem sizes inside the FPGA.
At the heart of the merge sorter a select-value component is used, which is illustrated on the
right. e select-value component is half of the compare-and-swap component of the previous
.
section, i.e., two values A and B are compared and the smaller one is selected. e select-value
component can be used to merge two sorted runs into a larger sorted run, as depicted in Figure 6.3.
e inputs to the select-value component A and B are read from two FIFOs, where each
FIFO stores a sorted run. e smaller of the two inputs is selected and written to an output

6. ACCELERATED DB OPERATORS
>

0
1

1
2

0
1

1
2

0
1

Figure 6.4: A cascade of FIFO merge sorters used to produce large sorted runs.

.
FIFO. en the next value is requested from the input FIFO that has submitted the smaller
value. Figure 6.4 shows how a cascade of such merge sorters can be combined to create larger and
larger sorted runs. e output of the select-value component is directed to a rst output FIFO
until it is lled and then redirected to a second output FIFO. ose output FIFOs then serve as
inputs to the next select-value component.
Koch and Torresen [2011] observed that at each stage, processing can start after the rst
input FIFO has been lled and the rst value of the second FIFO arrives. Furthermore, the selectvalue component always only reads from one of the two input FIFOs and the result is written to
only one of the two output FIFOs. Hence, the overall ll level of the FIFOs is constant. erefore,
the second FIFO is not strictly necessary. Koch and Torresen [2011] showed how to get rid of
the second FIFO. To do so, however, they had to build a custom FIFO based on a linked list
structure to deal with the two read and two write pointers within a single FIFO.
Koch and Torresen [2011] evaluated the FIFO merge sorter described above. Using 98% of
the available BRAM it was possible to sort 43,000 64-bit keys (344 KB) in a single iteration, i.e.,
by streaming the keys through the FPGA once. e circuit could be clocked at fclk D 252 MHz
resulting in a throughput of 2 GB/s.

6.1.3 EXTERNAL SORTING WITH A TREE MERGE SORTER

To sort even bigger data sets that exceed the capacity of on-chip BRAM storage, runs need to be
stored outside the FPGA, e.g., in DRAM on the FPGA card. Nevertheless, merging of several
external runs can still be performed in a streaming manner inside the FPGA, i.e., as the FPGA
reads from several smaller runs stored in DRAM, it can continuously write tuples of the merged
result back to DRAM to generate larger runs, or back to the host (e.g., via PCI-Express) for the
nal run.
To merge several runs into one larger run a merge sorter tree can be used, e.g., as the one
illustrated in Figure 6.5 on the left, which merges eight runs. e tree is constructed from several
select-value components, represented by the gray boxes in the gure. DRAM latency can be hidden
by placing small FIFOs between the tree and external memory. A load unit then monitors the
ll-level of all FIFOs and coordinates DRAM read requests accordingly. To avoid clock speed
degradation, large merge sorter trees need to be pipelined, e.g., by adding tiny FIFOs between
every level of the tree.

6.1. SORT OPERATOR

memory

load unit

host
FIFO
merge
sorter

tree
merge
sorter

Figure 6.5: Tree merge sorter (left), and a combination of a FIFO merge sorter with a tree merge
sorter connected to external DRAM (right).

On the right-hand-side of Figure 6.5 a combination of the FIFO merge sorter, described
in the previous section, and a tree merge sorter is illustrated. In a rst step, unsorted data is
streamed through the FIFO merge sorter on the FPGA, which generates initial runs that are
stored in external DRAM. en these runs are read back from DRAM and merged by a tree
merge sorter before the nal result is sent back to the host. If the data set is so large that one
pass through the tree merge sorter is not enough then multiple round trips to DRAM would be
necessary.
Koch and Torresen [2011] report that the tree-merge-sorter achieved a throughput of
1 GB/s on their FPGA, and could merge up to 4.39 million keys (35.1 MB) in one pass. However,
this measurement assumes that the entire FPGA can be used for merging. For a conguration
like the one in Figure 6.5 on the right, where both a FIFO merge sorter and a tree merge sorter
need to t on the same FPGA 1,08 million keys (8.6 MB) could be sorted at a throughput of
1 GB/s.

6.1.4 SORTING WITH PARTIAL RECONFIGURATION

For a conguration as in Figure 6.5 on the right, where a FIFO merge sorter needs to share the
available FPGA resources with a tree merge sorter, each module can only solve half the problem size
per run. However, notice that the two modules do not run in parallel, i.e., rst the FIFO merge
sorter generates initial runs in DRAM, and once those runs are generated, they are merged by
the tree merge sorter.
To increase device utilization, an alternative is to rst use the entire FPGA for the FIFO
merge sorter, and then use dynamic partial reconguration to load the tree merge sorter after the
FIFO merge sorter has terminated, as illustrated in Figure 6.6.
With this approach, both the FIFO merge sorter and the tree merge sorter can sort larger
problem sizes in one pass. However, the time required for reconguration needs to be considered

6. ACCELERATED DB OPERATORS
memory

host

dynamic partial
reconfiguration

FIFO
merge
sorter

host

memory

tree
merge
sorter

Figure 6.6: Using dynamic partial reconguration to rst run the FIFO merge sorter and then the
tree merge sorter in the same partially recongurable region.

as well. e conguration time of a partially recongurable region is directly proportional to its

size. Koch and Torresen [2011] measured roughly 3 MB conguration data for the sorter modules.
At a conguration speed of 400 MB/s, this means that during the time it takes to swap in the tree
merge sorter module, the FIFO merge sorter module could have sorted 5 3 MB of data because
of the ve times higher throughput (2 GB/s).
us, dynamic partial reconguration only pays o for large problem sizes. For a sorting
problem with 4 million keys Koch and Torresen [2011] measured a conguration overhead of
46 %, while for 448 million keys (3.58 GB) the conguration overhead is negligible. To sort this
large data set, one run of the FIFO merge sorter and two runs of the tree merge sorter were
necessary resulting in an overall throughput of 667 MB/s.

6.2

SKYLINE OPERATOR

Skyline queries compute the Pareto-optimal set of multi-dimensional data points. ey are a good
example of a complex database task that can greatly benet from FPGA acceleration due to their
compute-intensive nature, especially when dealing with higher dimensions. Formally, the skyline
of a set of multi-dimensional data points is dened as follows:
A tuple ti dominates () another tuple tj i every dimension of ti is better than
or equal to the corresponding dimension of tj and at least one dimension of ti is strictly better than
the corresponding dimension of tj .
Denition 6.1

Denition 6.2 Given a set of input tuples I D ft1 ; t2 ; : : : tn g, the skyline query returns a set of
tuples S , such that any tuple ti 2 S is not dominated by any other tuple tj 2 I .
Here, better means either smaller or larger depending on the query.

6.2. SKYLINE OPERATOR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

foreach tuple qi 2 queue do

isDominated D false;
foreach tuple pj 2 window do
if qi :timestamp > pj :timestamp then
output(pj );
window.drop(pj );

/* pj is part of the skyline */

else if qi pj then
window.drop(pj );

/* pj is dominated by qi */

else if pj qi then
isDominated D true;
break;

/* qi is dominated by pj */

if not isDominated then

timestamp(qi );
if window.isFull() then
queue.insert(qi );

/* qi is a potential skyline tuple */

else
window.insert(qi );

/* add qi to the window later */

/* there is space in the window */

Figure 6.7: Standard Block Nested Loops (BNL) Algorithm ( means dominates).

6.2.1 STANDARD BLOCK NESTED LOOPS (BNL) ALGORITHM

FPGAs exhibit tremendous aggregated compute power but they can only keep a limited amount
of state in the chip during computation, as we have already seen in Section 6.1. In the case of
skyline queries, it is possible that the intermediate state or even the nal result exceed the capacity
of current FPGAs. Woods et al. [2013] implemented a highly parallel variant of the block nested
loops (BNL) [Brzsnyi et al., 2001] algorithm on an FPGA. e BNL algorithm is appealing
for an FPGA implementation because it was designed exactly for this scenario, although in a
software context, where all data does not t into main memory.
In Figure 6.7, the BNL algorithm is given in pseudocode. All input tuples are stored in a
queue and then compared one by one against a window of w potential skyline tuples. If an input
tuple is dominated by a potential window tuple, it is discarded. On the other hand, all window
tuples that are dominated by an input tuple are removed from the window.
After an input tuple has been compared against all potential skyline tuples in the window,
the input tuple itself is either inserted into the window as a potential skyline tuple (when there
is room), or else it is inserted into the queue of input tuples again for later processing. In both
cases, the tuple is timestamped. A potential window tuple pi becomes a true skyline tuple when

6. ACCELERATED DB OPERATORS

n-dimensional tuple

PE 0

message channels

PE 1

PE 2

Figure 6.8: Window tuples (consisting of several dimensions) are distributed over a pipeline of processing elements. Neighboring processing elements are connected via 32-bit message channels.

it either encounters the rst tuple qj from the input queue that has a larger timestamp or when
the input queue is empty. A larger timestamp indicates that two tuples must have already been
compared and since the queue is totally ordered, all following tuples in the queue will also have
larger timestamps. e algorithm terminates when the input queue is empty.

6.2.2 PARALLEL BNL WITH FPGAS

In the FPGA implementation of Woods et al. [2013], the queue of input tuples is stored in
DRAM mounted on an FPGA card, i.e., outside of the FPGA chip, and only the window of
potential skyline tuples is maintained inside the FPGA. Result tuples (i.e., skyline tuples) are also
written back to DRAM. Hence, input tuples are streamed from the DRAM through the FPGA
and back to DRAM. Several iterations may be necessary until the algorithm terminates.
Pipeline of Processing Elements (PEs)

In BNL, each input tuple needs to be compared against all potential skyline tuples stored in the
window. is is an expensive process since the window may consist of several hundred tuples. To
achieve high throughput only a minimal number of clock cycles should be spent on each input
tuple before the next tuple is read from the queue. us, Woods et al. [2013] propose to distribute
the window of potential skyline tuples over a pipeline of daisy-chained processing elements, as
illustrated in Figure 6.8.
A processing element stores a single tuple of the window. An input tuple is submitted to the
rst processing element in the pipeline from where it is forwarded to the neighboring processing
element after evaluation via the specied message channel. us, w processing elements operate
on the window concurrently such that w dominance tests are performed in parallel, where w is
the size of the window.
Causality Guarantees

e processing elements are organized in a way that input tuples are evaluated in a strictly feedforward oriented way. is has important consequences that can be exploited in order to parallelize
the execution over many processing elements while preserving the causality of the corresponding
sequential algorithm.

6.2. SKYLINE OPERATOR

?
.

PE d

?
PE h

Figure 6.9: Causality guarantees. e earlier xi will see no eects caused by the later xj but xj sees
all eects of xi .

Feed-forward processing implies that the global working set is scanned exactly once in a
dened order. What is more, once an input tuple xi has reached a processing element h , its evaluation cannot be aected by any later input tuple xj that is evaluated over a preceding processing
element d (conversely, the later xj is guaranteed to see all eects caused by the earlier xi ).
ese causality guarantees hold even if we let the executions of xi on h and xj on d
run in parallel on independent compute resources, as illustrated in Figure 6.9. For example, once
an input tuple xi reaches the last processing element, it can safely be assumed that it has been
compared against all other working set tuples and appropriate actions can be invoked.
Parallel BNL as Two-Phase Algorithm

In summary, the parallel version of BNL works as follows. Input tuples propagate through the
pipeline of processing elements and are evaluated against a dierent window tuple at every stage
in the pipeline. If at one point an input tuple is dominated, a ag is set that propagates through
the pipeline together with every tuple. On the other hand, if a window tuple is dominated it is
deleted. When an input tuple reaches the last processing element, and was not dominated by any
window tuple (indicated by the domination ag), the tuple is timestamped and then inserted into
the window if there is space, or written back to DRAM otherwise. Notice that new potential
skyline tuples can only be inserted into the window at the last processing element to ensure that
they have been compared to all existing window tuples. is means that free slots in the window
that occur when window tuples are dominated need to propagate toward the end of the pipeline.
To enforce this, neighboring processing elements can swap their contents.
e processing elements execute the algorithm just described in two phases: (i) an evaluation phase and (ii) a shift phase. During the evaluation phase, a new state is determined for each
processing element; but these changes are not applied before the shift phase, which is the phase that
allows nearest neighbor communication. ose two phases run synchronously across the FPGA,
as depicted in Figure 6.10.

6.2.3 PERFORMANCE CHARACTERISTICS

In their evaluation, Woods et al. [2013] showed interesting performance characteristic of parallel BNL running on an FPGA. ey evaluated their FPGA-based skyline operator against a

6. ACCELERATED DB OPERATORS
eval.

shift

eval.

shift

eval.

shift

eval.

shift

eval.

shift

throughput (tuples/sec)

Figure 6.10: Two-phase processing in parallel BNL. Circles represent processing elements. A processing element consists of logic, storage for the window tuple, and communication channels to neighboring processing elements.

109

BNL Software
BNL FPGA

108

CPU
FPGA

107

. 0

41 M tuples/sec
25 ms exec. time

17 M tuples/sec
61 ms exec. time

8
16 32 64 128 256
window size : number of tuples

Figure 6.11: Correlated dimensions resulting in a small skyline. Performance is memory bound.

single-threaded software implementation, as well as a state-of-the-art multi-threaded skyline implementation [Park et al., 2009]. As we will discuss, the comparisons with the single-threaded
software implementation (Figure 6.11 and Figure 6.12) highlight several fundamental dierences
between CPUs and FPGAs.
For their experiments, Woods et al. [2013] use data sets with common data distributions to
evaluate skyline computation. Figure 6.11 shows the results for a data set, where the dimensions of
the tuples are correlated. is means that the values in all dimensions of a tuple are similar resulting
in very few skyline tuples that dominate all other tuples. As can been seen in the gure, both
implementations achieve a throughput that is close to the maximum possible throughput given
by the memory subsystem (dashed lines). e CPU achieves better performance here because it
has the more ecient memory subsystem. e total number of dominance tests is very low, i.e.,
essentially what is measured is how fast main memory can be read/written on the given platform.
By contrast, the results displayed in Figure 6.12 for a data set where the dimensions of the
tuples are anti-correlated, present the two implementations in a dierent light. Anti-correlated
means that tuples with high values in some dimensions are likely to have low values in other

throughput (tuples/sec)

6.2. SKYLINE OPERATOR

105

BNL Software
BNL FPGA

32 K tuples/sec
32 sec exec. time

104
103

. 0

1.8 K tuples/sec
579 sec exec. time

8
16 32 64 128 256
window size : number of tuples

Figure 6.12: Anti-correlated dimensions resulting in a large skyline. Performance is compute bound.

dimensions, i.e., there are many incomparable tuples leading to many dominance tests. Hence,
skyline computation is now heavily compute-bound.
For the software variant, increasing the window size has little eect on throughput because
the number of total dominance tests stays roughly the same independent of the window size. For
the FPGA implementation, on the other hand, increasing the window means adding processing
elements, i.e., compute resources, which is why throughput increases linearly with the window
size.
Notice that the computation of the skyline for the anti-correlated data set is signicantly
more expensive, e.g., the best execution time of the CPU-based version has gone from 18 milliseconds to almost 10 minutes. is slowdown is due to the increased number of comparisons
since all skyline tuples have to be pairwise compared with each other. us, the workloads where
the FPGA excels are also the ones where acceleration is needed most.
As mentioned previously, Woods et al. [2013] also compared their implementation against
PSkyline [Park et al., 2009], a state-of-the-art parallel skyline operator for multicores. e performance achieved with a low-end FPGA was comparable to the one of PSkyline on a 64-core Dell
PowerEdge Server using 64 threads. However, notice that there is a signicant dierence in price
(FPGA = $750 versus Dell = $12,000), as well as power consumption between the two systems.
Moreover, with 192 processing elements a throughput of 32,000 tuples/sec (anti-correlated distribution) is reached on the FPGA. is is more than two orders of magnitude below the upper
bound of 17 million tuples/sec (cf. Figure 6.11), i.e., with a larger FPGA, there is still a lot of
leeway to further increase performance by adding more processing elements.

Two tuples are incomparable if neither tuple dominates the other one.
FPGAs use between one and two orders of magnitude less power than CPUs.

CHAPTER

Secure Data Processing

So far, we have manly highlighted advantages of FPGAs with respect to performance, e.g., for
low-latency and/or high-volume stream processing. An entirely dierent area where FPGAs provide a number of benets is secure data processing. Especially in the cloud computing era, guaranteeing data condentiality, privacy, etc., are increasingly important topics. In this chapter, we
rst compare FPGAs to CPUs and ASICs regarding security aspects before we describe a few
key features of modern FPGAs that allow them to be used as even more secure co-processors.
Finally, we discuss a use caseCipherbasea recent project that achieves data condentiality in
Microsofts SQL Server using FPGAs.

7.1

FPGAS VERSUS CPUS

CPU-based systems are typically very complex, both in respect to hardware and software. is
means that there are many possibilities for attacks, i.e., bugs in the operating system, the device
drivers, the compiler, hardware components, etc., can all be exploited to attack the system, and
as a result it is very dicult to make such systems secure. FPGAs have a much smaller attack
surface.

7.1.1 VON NEUMANN ARCHITECTURE

Most CPU-based systems today are designed according to the Von Neumann architecture, i.e., data
and executable code are stored in the same contiguous memory. is is a major security problem,
allowing for various code injection attacks. On an FPGA, it is easy to physically separate data and
program, i.e., a program can be implemented in logic, while data is stored in on-chip dedicated
memory.
Another problem is isolating multiple applications from each other. Again, in an FPGA
two applications could run in physically isolated regions of the FPGA, only sharing perhaps some
I/O channel to which access can be granted in a secure manner. On a multicore CPU this is
virtually impossible, i.e., even if two applications are executed on dierent cores, those cores will
still share a lot of resources such as caches, memory, I/O devices, etc.
7.1.2 TRUSTED PLATFORM MODULE (TPM)
ere have been numerous attempts in the past to make CPU-based software systems more secure.
For instance, the Trusted Platform Module (TPM) chip is a secure cryptoprocessor that can store

7. SECURE DATA PROCESSING

cryptographic keys. Together with the BIOS, the TPM chip provides a root of trust that can
be used for authenticated boot. However, one limitation of this approach is that while software is
authenticated when loaded, there is no protection against modication of the code at runtime.
Circuits running on an FPGA are much more dicult to tamper with, especially if the security
features that we will discuss in Section 7.3 are enabled.
Another limitation is that the TPM chip cannot do encryption/decryption on its own. It
can only transfer the cryptographic keys to a region in memory, which the BIOS is supposed to
protect. Encryption/decryption are then performed by the processor. However, the BIOS cannot
always protect main memory, e.g., with physical access to a computer an attacker can retrieve
the encryption keys from a running operating system with a so-called cold boot attack. is is an
attack that relies on the fact that DRAM contents are still readable for a short period of time
after power supply has been removed. In Section 7.4, we will discuss how FPGAs can be used
as secure cryptoprocessors capable of encrypting/decrypting data, such that plaintext data never
leave the chip.

7.2

FPGAS VERSUS ASICS

With respect to protecting intellectual property, FPGAs have an important advantage over
ASICs. If circuit designer and circuit manufacturer are two dierent parties, then the designer
needs to provide the manufacturer with the sensitive circuit description. In fact, very few companies such as Samsung or IBM design circuits and also own high-end semiconductor foundries
to produce them. Most semiconductor companies are fabless. FPGAs allow a company to benet
from the latest manufacturing technology while keeping their circuit designs fully condential.
Another threat to which ASICs are susceptible is destructive analysis, where each layer of the
device is captured to determine its functionality. is technique is not applicable to determining
the functionality of the circuit loaded onto an FPGA since the entire circuit specication is stored
in on-chip conguration memory, i.e., there are no physical wires and gates of such a circuit that
can be analyzed.
While circuits running on FPGAs are better protected than ASICs against reverse engineering threats as the ones described above, FPGAs exhibit other vulnerabilities. For example, by
intercepting the bitstream during conguration of an FPGA the design could relatively easily be
cloned or tampered with. In the next section, we discuss common mechanisms that guard FPGAs
from such and other attacks.

7.3

SECURITY PROPERTIES OF FPGAS

FPGAs provide a number of security features to protect intellectual property (i.e., the circuit
specication uploaded to the FPGA) from reverse engineering, tampering, and counterfeiting. In
this section, we highlight the most important security-related features of modern FPGAs.

7.3. SECURITY PROPERTIES OF FPGAS

7.3.1 BITSTREAM ENCRYPTION

A number of FPGAs ship with an on-chip decryption engine to support encrypted bitstreams.
Typically, the advanced encryption standard (AES) is implemented, which is a state-of-the art
block cipher. e software that generates the bitstream also encrypts it before sending it to the
FPGA. e FPGA then decrypts the bitstream on the chip before it writes the conguration to
the corresponding locations. e key used for decryption is loaded into the FPGA once via the
JTAG port, i.e., physical access to the FPGA is required. en the key is stored using either onchip (battery-backed) memory or non-volatile key storage such as OTP fuses. Hence, turning o
power will not erase the key. Bitstream encryption ensures the condentiality of a circuit design,
preventing reverse engineering and device cloning.
7.3.2 BITSTREAM AUTHENTICATION
Besides bitstream encryption, Xilinx also introduced bitstream authentication with their Virtex6 series. us, an on-chip keyed-HMAC algorithm implemented in hardware ensures that only
authenticated users can modify an existing conguration or overwrite it with a dierent bitstream.
During bitstream generation a message authentication code (MAC) is generated and embedded in
the encrypted bitstream together with the HMAC key. During device conguration the MAC
is recomputed on the FPGA using the provided HMAC key. In case the MAC computed by
the FPGA does not match the MAC provided in the encrypted bitstream there are two options:
either a fall-back bitstream is loaded or the conguration logic is disabled all together. HMAC
ensures that the bitstream loaded into the FPGA has not been altered, preventing tampering of
the bitstream, i.e., even single bit ips are detected. is protects an application running on an
FPGA from attacks such as spoong and trojan horse attacks.
7.3.3 FURTHER SECURITY MECHANISMS
Besides bitstream encryption and authentication FPGAs incorporate several other security mechanisms to guard against dierent kinds of attacks. For instance, the JTAG portintended for
device conguration and debuggingcould be misused to reverse engineer the functionality of a
specic design. erefore, access to the JTAG port is usually highly restricted. In general, dedicated logic ensures that conguration memory cannot be read back via any external interface
including the JTAG interface.
Some FPGA vendors go even one step further and continuously compute a CRC of the
conguration data in the background, which is used to verify that no bits of conguration data
have changed. A bit ip in the conguration data could be caused by a so-called single-event upset
due to the impact of a high-energy neutron. us, monitoring the CRC of conguration data on
a regular basis even allows detecting highly sophisticated side channel attacks at runtime.

7. SECURE DATA PROCESSING

7.4

FPGA AS TRUSTED HARDWARE

e Cipherbase system [Arasu et al., 2013] extends Microsofts SQL Server with customized
trusted hardware built using FPGAs. e project targets data condentiality in the cloud. Cloud
computing oers several advantages that make it attractive for companies to outsource data processing to the cloud. However, a major concern is processing sensitive data. Companies might
not trust a cloud provider to keep their data condential. In fact, in some cases a company might
even want to protect highly condential data from its own employees.

7.4.1 FULLY HOMOMORPHIC ENCRYPTION WITH FPGAS

Fully homomorphic encryption [Gentry, 2010] is a way to encrypt data such that certain types
of computation can be executed directly on the ciphertext, i.e., without the need to decrypt the
ciphertext. For instance, addition of two encrypted numbers would directly produce the encrypted
result, i.e.:
Decrypt.Encrypt.A/ C Encrypt.B// D A C B
Since a program would never need to decrypt the data it is processing, fully homomorphic encryption would allow to safely execute computations on encrypted data in an untrusted environment.
Unfortunately, fully homomorphic encryption is prohibitively expensive in practice [Bajaj and
Sion, 2011].
In Cipherbase, fully homomorphic encryption is simulated by integrating closely coupled,
trusted hardware into an untrusted system. Cipherbase considers an FPGA as a trusted black box
that can compute a number of operations on data that are encrypted using a non-homomorphic
encryption scheme such as AES (advanced encryption standard). at is, the FPGA actually
decrypts data internally, computes the operation, and encrypts the result again before it leaves the
trusted hardware, as illustrated below:
EncryptedAES (A)
EncryptedAES (B)
Operation (+)

trusted
hardware

FPGA

EncryptedAES (A+B)

7.4.2 HYBRID DATA PROCESSING

In Cipherbase, the FPGA that implements the trusted cryptoprocessor is connected to the host
server (running MS SQL Server) via PCI Express. e bandwidth of the PCI Express bus is
lower than the bandwidth of the servers memory system, i.e., a performance penalty is associated
with using the trusted cryptoprocessor. erefore, a design goal of Cipherbase is to minimize data
transfers between host and FPGA, and to do as little computation on the FPGA as possible to
meet given security constraints.

7.4. FPGA AS TRUSTED HARDWARE

Table 7.1: Typical plaintext operations and corresponding primitives to execute the same operations
on ciphertext via FPGA.

Plaintext Operation
AD5
ACB
hash
ADB
sum.B/
Index lookup
Range lock

Primitive(s) executed on FPGA

Dec.fAgAES / D Dec.f5gAES /
Enc.Dec.fAgAES / C Dec.fBgAES //
Hash.Dec.fAgAES //; Dec.fAgAES / D Dec.fBgAES /
Enc.Dec.fBgAES / C Dec.fpart i alsumgAES //
FindPos.Dec.fkgAES /; hDec.fk1 gAES /; : : : ; Dec.fkn gAES /i/
Dec.fvgAES / 2 Dec.flgAES /; Dec.fhgAES /

We can distinguish several forms of hybrid data processing in Cipherbase, meaning that
some parts of the processing are handled on commodity hardware while other parts are executed
in the secure environment on the FPGA. First of all, users can specify the level of condentiality
guarantees at a column granularity. For example, when processing employee records, the salary
eld might require strong condentiality guarantees while employer address might not. Hence,
columns and tables that have no condentiality restrictions will not be processed by the trusted
hardware, allowing for a more ecient execution.
Furthermore, even individual operators can be broken down into the parts that need to
be executed by the cryptoprocessor and others that can be handled unprotected by the standard
database software. For instance, consider (B-tree) index processing, where a condential column
is indexed. Searching within an index page requires decrypting condential keys, i.e., needs to be
executed by the cryptoprocessor. On the other hand, many other index operations such as concurrency control, buer pool management, recovery, etc., can be handled outside the crpytoprocesor.
Since FPGA and host system are tightly coupled, a few primitives (cf. Table 7.1) that can be
called from the host system and are implemented on the FPGA are sucient to extend a complete
database system to support fully homomorphic encryption, as described above. Table 7.1 shows
how primitives for decryption (Dec./), encryption (Enc./), and expression evaluation are invoked
by the host system to execute a given operation on encrypted data. Coming back to our index
processing example, for each index page the FindPos./ primitive would be called to determine the
next page to visit. All other indexing logicwith the exception of checking key range lockscan
be handled by the standard database software running on the host.

Encrypted data are denoted f: : : gAES assuming an AES encryption scheme.

To check range locks another primitive needs to be invoked: Dec.fvgAES / 2 Dec.flgAES /; Dec.fhgAES /.

7. SECURE DATA PROCESSING

Table 7.2: Stack machine for the expression Dec.f$0gAES / D Dec.f$1gAES /.

Id
1
2
3
4
5

Instruction

GetData $0
Decrypt
GetData $1
Decrypt
Compare

7.4.3 TRUSTED HARDWARE IMPLEMENTATION

In Cipherbase, the main reason to use FPGAs is security rather than performance, i.e., expressiveness and simplicity are more important goals than speed of execution. On an FPGA, there
is always a trade-o between expressiveness and performance. In Cipherbase it is of utmost importance that the system is complete. Notice that expressions in a projection, for instance, can be
arbitrarily complex. Hence, to support all possible SQL queries the trusted hardware engine was
designed as a stack machine, i.e., a restricted, special-purpose processor that runs inside the FPGA.
e stack machine can execute a number of special instructions (e.g., for encryption, decryption,
etc.), where each one is directly implemented in hardware.
When a query is compiled by the modied MS SQL Server, code is generated for the stack
machine that runs on the FPGA. Each instruction is directly implemented in hardware and only
parameters can be supplied at runtime. An example is given in Table 7.2, where the stack machine
code is shown that evaluates a restriction expression of the form Dec.f$0gAES / D Dec.f$1gAES /
on encrypted data. e compiler generates a sequence of instructions, of which some can have
parameters (e.g., $0 and $1 in the example represent parameters).
e mechanisms described so far ensure that decrypted data never leave the FPGA. Nevertheless, access patterns are not hidden. While for many applications this is acceptable, Cipherbase
also introduces the notion of oblivious operators that even hide data access patterns for applications that require a higher degree of condentiality. It is beyond the scope of this book to discuss
oblivious operators, and we therefore refer to the work of Arasu et al. [2013], Goodrich [2011]
for a thorough discussion of the topic.

CHAPTER

Conclusions
Almost 50 years have passed since Gordon Moore observed that the number of transistors per
integrated chip would double approximately every 2 years. roughout these years, hardware technology followed the exponential growth with remarkable precision, and there is no indication that
the trend will change any time soon.
e consequences of Moores Law on chip technology, however, have changed dramatically
over time. As already sketched in Chapter 1, Moores Lawin combination with Dennard scaling allowed to drive up clock frequencies and micro-architecture sophistication for many years.
Power constraints set an end to this approach some ten years ago and hardware designers started
to focus on multi-core architectures instead: additional transistors oered by Moores Law were now
turned into replicated on-chip processors (cores).
Multi-core computing never solved the power consumption problem, however. At best, the
now-prevalent way of leveraging hardware parallelism can be considered a temporary mechanism
to mitigate the fundamental problem. A one-to-one conversion of Moores dividend into additional CPU cores would incur an exponentially growing power consumption. But todays chips
already operate at the limit in terms of heat dissipation and cooling.

POWER CONSTRAINTS AND DARK SILICON

Taylor [2012] discussed the search space for escapes to the urging power constraints and identied
four basic strategies that he termed the four horsemen of the coming dark silicon apocalypse:
e Shrinking Horseman. Increased transistor densities could be used to reduce chip sizes
and thus costrather than to improve performance. e strategy does not per se oer a
solution to the increased performance demand coming from the application side.
e Dim Horseman. An excess power budged from additional CPU cores can be avoided by
clocking down all cores in the chip. Since power consumption grows super-linearly with the
clock frequency, a chip with many low-frequency cores may be more power-ecient than
the alternative of fewer high-frequency cores.
e Specialized Horseman. Rather than replicating the same core over and over, the chip could
be populated with a diverse mix of processing units, each one (energy-)optimized for a
particular task. Only few of these can be active together at any time (to meet the power
budget). But hardware specialization will likely lead to a higher net energy eciency.

8. CONCLUSIONS

e Deus Ex Machina Horseman. ere might be an entirely unforeseen escape to the power
limitation, e.g., by leveraging technologies other than the current MOSFET. However,
waiting for miracles in the future can hardly be considered a strategy for problems that
applications are suering already today.

In this solution space, FPGAs could be considered a variation of the specialized horseman.
But by contrast to the strategy in Taylors narrow sense, an FPGA-based realization avoids the
need to decide on a set of specialized units at chip design time. Rather, new or improved specialized units could be added to the portfolio of the system at any time. At the same time, no chip
resources are wasted for specialized functionality that a particular system installation may never
actually need (e.g., a machine purely used as a database server will hardly benet from an on-chip
H 264 video decoder).

FPGAS VS. ASICS

Recongurable hardware always comes at a price, however. In particular, a circuit realized in an
FPGA will never be competitive with an implementation of the same functionality in dedicated
custom hardwareneither in terms of raw performance, nor in terms of energy eciency. Kuon
and Rose [2007], for instance, quantied the penalty of recongurability to about 3 in performance and 14 in energy eciency.
However, continuing integration densities will work more in favor of recongurable hardware than in favor of hard-wired alternatives, mainly for two reasons.
Reason 1: What to Specialize For? As transistor count continues to increase, more and more
tasks will have to be moved to specialized components (otherwise, technology will soon no longer
benet from Moores Law again). e need to continuously specialize, however, will emphasize
the challenge of deciding what to specialize for. As hardware moves toward specialized designs,
the rst specializations are rather easy to decide (for instance, a oating-point unit will benet a
wide range of applications at relatively little cost). But eectiveness will quickly wane for those
specializations that come after. As mentioned before, this limitation will not arise in congurable
hardware, since any application can instantiate just those accelerators that it actually can benet
from.
Reason 2: I/O Limitations While available logic resourcesfor FPGAs as well as for ASICs
are expected to still grow exponentially for at least some more years, the number of pins per
chipand thus its I/O capacityhas been mostly stagnating for years already. is leads to a
widening I/O gap between an increased compute capacity inside the chip and a lack of bandwidth
to feed the logic with data.
Eectively, conventional circuits have growing diculties to actually benet from all their
compute resources purely for bandwidth reasons (even if they were not constrained by energy). In
such a situation, the better chip space eciency of ASICs no longer provides an advantage over
congurable hardware.

In summary, the fundamental FPGA technologywe sketched the combination of lookup

tables and other components in Chapter 3has been around for quite some time, but remained
mostly a niche technology throughout most of this time. However, the conditions around the
technology have changed only recently. ASIC technology is reaching limits (as discussed above),
whereas todays FPGA chip sizes allows competitive functionality to be realized in congurable
hardware.

OPEN CHALLENGES
Under these premises, we expect the relevance of FPGAs in computing systems to increase in
the coming years. Hardware makers already demonstrated various ways to integrate congurable
logic with commodity CPUs. Hybrid chipswhere CPU(s) and congurable logic sit on the
same dieare available today, commercially and at volume.
It is thus no longer a question whether or not FPGAs will appear in (mainstream) computing systems. Rather, the database community should begin to worry about how the potential
of FPGAs can be leveraged to improve performance and/or energy eciency.
Tools and Libraries Several decades of research and development work have matured the software world to a degree that virtually any application eld receives a rich set of support by tools,
programming languages, libraries, but also design patterns and good practices. Hardware development is signicantly more complex and has not yet reached the degree of convenience that
software developers have long become used to.
FPGA development still has a rather steep learning curve and many software people shy
away from the technology, because they cannot see the quick progress that they are used to from
their home eld. is is unfortunate not only because the potential of FPGA technologyonce
the entrance fee has been paidis high, but also because hardware/software co-design has so far
mostly been left to the hardware community. Clearly, the concept could benet a lot from experience and technology (e.g., in software compilers, to name just one) that have become ubiquitous
in the software world.
Systems Architectures As discussed in Chapter 4, the nal word on what is the best system
architecture for hybrid CPU/FPGA processing has not yet been spoken. Moreover, data processing
engines would likely benet also from even more classes of modern system technology, including
graphics processors (GPUs), massively parallel processor arrays (MPPAs) [Butts, 2007], or smart
memories [Mai et al., 2000]. However, only small spots in the large space of possible system
architectures have been explored, yet.
Finding a system architecture that brings together the potential of hardware and the requirements of (database) applications requires a fair amount of experimentation, systems building, and evaluation. Early resultssome of which we also sketched in this bookare promising.
But they also still show rough edges that need to be ironed out before they become attractive for
practical use.

8. CONCLUSIONS

FPGAs as a Technology Enabler Specialization and the use of FPGAs is often seen as a mechanism to improve quantitative properties of a data processing engine, e.g., its throughput, its
latency, or its energy eciency.
In Chapter 7, we showed that there are also scenarios where FPGAs can act as an enabler
for functionality that cannot be matched with commodity hardware. With their outstanding exibility, FPGAs might serve as an enabler also in further ways. For instance, placing congurable
logic into the network fabric or near storage modules might open new opportunities that cannot
be realized by conventional means.
Flexibility vs. Performance In Chapter 4, we discussed trade-os between runtime performance,
exibility, and re-conguration speed in FPGA designs. Circuit and systems design in general
face very similar trade-os.
General-purpose CPUs were always designed for exibility, with application performance
or energy eciency only as secondary goals. ese priorities are necessary with a technology and
a market where one size must t all.
Designers of FPGA circuits, however, are not bound to this prioritization. Rather, their
circuit typically has to support just one very specic application type and the circuit can be rebuilt whenever it is no longer adequate for the current application load. And specialization usually
yields sucient performance advantages, so a slow path (e.g., using a general-purpose processor)
for exceptional situations will not seriously aect performance.
Given this freedom to re-decide on trade-os, it is really not clear which point in the design
space should be chosen for a particular use case. As we already discussed in Chapter 4, existing
research mostly explored congurations with fairly extreme decisions. We think, however, that
the true potential of FPGA lies in the open space in-between. Most likely, the sweet spot is when
dierent processing units, composed of one or more FPGA units; one or more general-purpose
CPUs; and potentially even more units, operate together in the form of a hybrid system design.

APPENDIX

Commercial FPGA Cards

Most commonly, FPGAs are mounted onto PCI Express plug-in cards. ese cards are tailored at
specic purposes (e.g., networking) but usually also have some general purpose extension slots to
support pluggable daughter cards that can directly interface with the high-speed I/O transceivers
of the FPGA.

A.1

NETFPGA

e NetFPGA project originated at Stanford University in 2007. It is an open source hardware

and software platform targeted at networking research. e rst generation development platform is called NetFPGA-1G. It is a PCI card with a Xilinx Virtex-II FPGA and a 4 1 Gbps
networking interface. NetFPGA-10G is the latest development platform. Besides a much more
powerful Xilinx Virtex-5 FPGA it also provides signicantly more bandwidth, e.g., 40 Gigabit
per second network bandwidth (4 10 Gbps Ethernet ports), and theoretical 4 Gigabyte per second maximum (bi-directional) PCI Express bandwidth (8 lanes, PCIe 2.0, 500MB/s per lane).
Network data are routed via high-speed I/O transceivers directly into the FPGA chip,
where it can be processed at full line-rate without dropping packets, etc., making these cards
very attractive for a variety of networking applications ranging from switches and routers to content processing applications such as deep packet inspection and network intrusion detection. e
NetFPGA project is the only one of its kind, in the sense that it supports an active community
of hardware and software developers that contribute to the project in the form of open source IP
cores and reference designs. is signicantly increases productivity for other developers.
NetFPGA has been successful also for commercial use, however. In particular, Algo-Logic,
a US-based company, utilizes NetFPGA for applications in the area of low-latency trading applications [Lockwood et al., 2012].

A.2

SOLARFLARES APPLICATIONONLOAD ENGINE

Solarare is one of the leading suppliers of low-latency Ethernet. e company provides

10 Gigabit Ethernet adapters, primarily targeting the nancial markets. Recently, Solarare has
launched a new productthe ApplicationOnload Engine (AOE)that combines one of their
network interface cards (NICs) with an Altera Stratix V (GX A5) FPGA. e AOE (SFA6902F)
provides two 10 Gbps Ethernet ports, PCI Express (8 lanes, PCIe 2.0, 500MB/s per lane), and
four SODIMM DDR3 memory sockets (supporting up to 16 GB each). Form a hardware per-

A. COMMERCIAL FPGA CARDS

spective, Solarares AOE and the NetFPGA are conceptually similar, however, the focus is different. While NetFPGA is a very FPGA-centric project, in Solarares AOE the FPGA is added
to an existing product as a bump-in-the-wire co-processor. at is, the existing software stack
for Solarfalres NICs still runs on AOE, and only users with extreme performance demands, say
for high-frequency trading, will start moving parts of the application into the FPGA on the NIC.
is makes the transition to an FPGA-based system very smooth.

A.3

FUSION I/OS IODRIVE

Fusion I/O operates in the PCIe SSD market. Solid state drives (SSDs) access ash storage
via SATA/SAS interface, which were designed for hard disk access. Fusion I/Os ioDrive cards
allow direct access to a ash memory storage tier via PCI Express, oering lower latency and
better overall performance than commodity SSDs. Since Fusion I/O is a young company with a
revolutionary product, they decided to implement the ash controller on the ioDrive card using
an FPGA, rather than an ASIC. is allows the company to easily modify the controller, and
provide hardware-updates to their customers. However, notice that here the FPGA really is a
means to an end, i.e., ioDrive is a pure storage solution, and it is not intended that users program
the FPGA themselves.

Bibliography
Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur etintemel, Mitch Cherniack,
Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. e design of the Borealis stream processing
engine. In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, pages 277289, January 2005. 70
Arvind Arasu, Spyros Blanas, Ken Eguro, Raghav Kaushik, Donald Kossmann, Ravi Ramamurthy, and Ramaratnam Venkatesan. Orthogonal security with Cipherbase. In Proc. 6th
Biennial Conf. on Innovative Data Systems Research, January 2013. 7, 86, 88
Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah,
and Sunil Shukla. A compiler and runtime for heterogeneous computing. In Proc. 49th Design
Automaton Conference, pages 271276, June 2012. DOI: 10.1145/2228360.2228411 40
Sumeet Bajaj and Radu Sion. Trusteddb: a trusted hardware based database with privacy and
data condentiality. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 205
216, June 2011. DOI: 10.1145/1989323.1989346 86
Zoran Basich and Emily Maltby. Looking for the next big thing? ranking the top 50 start-ups.
e Wall Street Journal, September 2012. 32
Stephan Brzsnyi, Donald Kossmann, and Konrad Stocker. e skyline operator. In Proc. 17th
Int. Conf. on Data Engineering, 2001. DOI: 10.1109/ICDE.2001.914855 77
Mike Butts. Synchronization through communication in a massively parallel processor array.
IEEE Micro, 27(5):3240, September 2007. DOI: 10.1109/MM.2007.4378781 91
Christopher R. Clark and David E. Schimmel. Scalable pattern matching for high speed networks. In Proc. 12th IEEE Symp. on Field-Programmable Custom Computing Machines, pages
249257, April 2004. DOI: 10.1109/FCCM.2004.50 69
Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. Proc.
VLDB Endowment, 1(2):15301541, 2008. DOI: 10.1007/3-540-45465-9_59 46
NVIDIA Corp. NVIDIAs next generation CUDA compute architecture: Kepler GK110,
2012. White Paper; version 1.0. 49

BIBLIOGRAPHY

Sudipto Das, Shyam Antony, Divyakant Agrawal, and Amr El Abbadi. read cooperation
in multicore architectures for frequency counting over multiple data streams. Proc. VLDB
Endowment, 2(1):217228, August 2009. 46
Jerey Dean and Sanjay Ghemawat. MapReduce: Simplied data processing on large clusters.
In Proc. 6th USENIX Symp. on Operating System Design and Implementation, pages 137150,
December 2004. DOI: 10.1145/1327452.1327492 42
R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of ionimplanted mosfets with very small physical dimensions. Solid-State Circuits, IEEE Journal of,
9(5):256268, October 1974. DOI: 10.1109/JSSC.1974.1050511 1
Christopher Dennl, Daniel Ziener, and Jrgen Teich. On-the-y composition of FPGA-based
SQL query accelerators using a partially recongurable module library. In Proc. 20th IEEE
Symp. on Field-Programmable Custom Computing Machines, pages 4552, May 2012. DOI:
10.1109/FCCM.2012.18 35, 68, 69
David J. DeWitt. DIRECTa multiprocessor organization for supporting relational database
management systems. IEEE Trans. Comput., c-28(6):182189, June 1979.
DOI: 10.1109/TC.1979.1675379 70
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug
Burger. Dark silicon and the end of multicore scaling. In Proc. 38th Annual Symp. on Computer
Architecture, pages 365376, 2011. DOI: 10.1145/2024723.2000108 4
Robert W. Floyd and Jerey D. Ullman. e compilation of regular expressions into integrated
circuits. J. ACM, 29(3):603622, July 1982. DOI: 10.1145/322326.322327 54
Phil Francisco. e Netezza Data Appliance Architecture: A platform for high performance data
warehousing and analytics. Technical Report REDP-4725-00, IBM Redguides, June 2011. 6,
36
Craig Gentry. Computing arbitrary functions of encrypted data. Commun. ACM, 53(3):97105,
March 2010. DOI: 10.1145/1666420.1666444 86
Michael T. Goodrich. Data-oblivious external-memory algorithms for the compaction, selection,
and sorting of outsourced data. In Proc. 23rd Annual ACM Symp. on Parallelism in Algorithms
and Architectures, June 2011. DOI: 10.1145/1989493.1989555 88
Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. Fast
computation of database operations using graphics processors. In Proc. ACM SIGMOD Int.
Conf. on Management of Data, pages 215226, June 2004. DOI: 10.1145/1007568.1007594
49

BIBLIOGRAPHY

Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. GPUTeraSort:
High performance graphics co-processor sorting for large database management. In Proc.
ACM SIGMOD Int. Conf. on Management of Data, pages 325336, June 2006. DOI:
10.1145/1142473.1142511 49
David J. Greaves and Satnam Singh. Kiwi: Synthesis of FPGA circuits from parallel programs.
In Proc. 16th IEEE Symp. on Field-Programmable Custom Computing Machines, pages 312,
April 2008. DOI: 10.1109/FCCM.2008.46 41
Anthony Gregerson, Amin Farmahini-Farahani, Ben Buchli, Steve Naumov, Michail Bachtis,
Katherine Compton, Michael Schulte, Wesley H. Smith, and Sridhara Dasu. FPGA design analysis of the clustering algorithm for the CERN Large Hadron Collider. In Proc. 17th
IEEE Symp. on Field-Programmable Custom Computing Machines, pages 1926, 2009. DOI:
10.1109/FCCM.2009.33 70
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V.
Sander. Relational joins on graphics processors. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 511524, June 2008. DOI: 10.1145/1376616.1376670 49
Martin C. Herbordt, Yongfeng Gu, Tom VanCourt, Josh Model, Bharat Sukhwani, and Matt
Chiu. Computing models for FPGA-based accelerators. Computing in Science and Engineering,
10(6):3545, 2008. DOI: 10.1109/MCSE.2008.143 24
Mark D. Hill and Michael R. Marty. Amdahls law in the multicore era. IEEE Computer, 41(7):
3338, July 2008. DOI: 10.1109/MC.2008.209 3
Amir Hormati, Manjunath Kudlur, Scott A. Mahlke, David F. Bacon, and Rodric M. Rabbah.
Optimus: Ecient realization of streaming applications on FPGAs. In Proc. Intl Conf. on
Compilers, Architecture, and Synthesis for Embedded Systems, pages 4150, October 2008. DOI:
10.1145/1450095.1450105 40, 41
Intel Corp. e Intel Xeon Phi coprocessor 5110P, 2012. Product Brief; more information
at http://www.intel.com/xeonphi. 49
Hubert Kaeslin. Digital Integrated Circuit Design. Cambridge University Press, 2008. ISBN
978-0-521-88267-5. 44, 48
Adam Kirsch and Michael Mitzenmacher. e power of one move: Hashing schemes for hardware. IEEE/ACM Transactions on Networking, 18(6):17521765, December 2010. DOI:
10.1109/TNET.2010.2047868 66
Dirk Koch and Jim Torresen. FPGASort: A high performance sorting architecture exploiting
run-time reconguration on FPGAs for large problem sorting. In Proc. 19th ACM SIGDA Int.
Symp. on Field Programmable Gate Arrays, 2011. DOI: 10.1145/1950413.1950427 73, 74, 75,
76

BIBLIOGRAPHY

Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. IEEE
Trans. Computer-Aided Design of Integrated Circuits, 26(2), February 2007. DOI: 10.1109/TCAD.2006.884574 90
Zhiyuan Li and Scott Hauck. Conguration compression for Virtex FPGAs. In Proc. 9th IEEE
Symp. on Field-Programmable Custom Computing Machines, pages 147159, April 2001. DOI:
10.1109/FCCM.2001.19 68
John W. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, Tom English, and Kees
Vissers. A low-latency library in FPGA hardware for high-frequency trading (HFT). In
IEEE 20th Annual Symp. on High-Performance Interconnects, pages 916, August 2012. DOI:
10.1109/HOTI.2012.15 62, 93
Anil Madhavapeddy and Satnam Singh. Recongurable data processing for clouds. In Proc. 19th
IEEE Symp. on Field-Programmable Custom Computing Machines, pages 141145, May 2011.
DOI: 10.1109/FCCM.2011.35 6
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. Smart
memories: A modular recongurable architecture. In Proc. 27th Symp. on Computer Architecture,
pages 161171, June 2000. DOI: 10.1145/342001.339673 91
Robert McNaughton and Hisao Yamada. Regular expressions and state graphs for automata.
IEEE Trans. Electr. Comp., 9:3947, 1960. DOI: 10.1109/TEC.1960.5221603 52
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. An integrated ecient solution for
computing frequent and top-k elements in data streams. ACM Trans. Database Syst., 31(3):
10951133, September 2006. DOI: 10.1145/1166074.1166084 46, 47
Abhishek Mitra, Marcos R. Vieira, Petko Bakalov, Vassilis J. Tsotras, and Walid A. Najjar. Boosting XML ltering through a scalable FPGA-based architecture. In Proc. 4th Biennial Conf. on
Innovative Data Systems Research, January 2009. 69
Roger Moussalli, Mariam Salloum, Walid A. Najjar, and Vassilis J. Tsotras. Massively parallel
XML twig ltering using dynamic programming on FPGAs. In Proc. 27th Int. Conf. on Data
Engineering, pages 948959, April 2011. DOI: 10.1109/ICDE.2011.5767899 69
Rene Mueller, Jens Teubner, and Gustavo Alonso. Streams on wiresa query compiler for FPGAs. Proc. VLDB Endowment, 2(1):229240, August 2009. 62, 63, 65, 66
Rene Mueller, Jens Teubner, and Gustavo Alonso. Glacier: A query-to-hardware compiler. In
Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 11591162, June 2010. DOI:
10.1145/1807167.1807307 62
Rene Mueller, Jens Teubner, and Gustavo Alonso. Sorting networks on FPGAs. VLDB J., 21
(1):123, February 2012. DOI: 10.1007/s00778-011-0232-z 71, 72, 73

BIBLIOGRAPHY

Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In Proc. 9th European Symp. on
Algorithms, pages 121133, August 2001. DOI: 10.1007/978-0-387-30162-4_97 66
Sungwoo Park, Taekyung Kim, Jonghyun Park, Jinha Kim, and Hyeonseung Im. Parallel skyline
computation on multicore architectures. In Proc. 25th Int. Conf. on Data Engineering, 2009.
DOI: 10.1109/ICDE.2009.42 80, 81
Oliver Pell and Vitali Averbukh. Maximum performance computing with dataow engines. Computing in Science and Engineering, 14(4):98103, 2012. DOI: 10.1109/MCSE.2012.78 49
Mohammad Sadoghi, Harsh Singh, and Hans-Arno Jacobsen. Towards highly parallel event
processing through recongurable hardware. In Proc. 7th Workshop on Data Management on
New Hardware, pages 2732, June 2011. DOI: 10.1145/1995441.1995445 69
David Schneider. e microsecond market. IEEE Spectrum, 49(6):6681, June 2012. DOI:
10.1109/MSPEC.2012.6203974 62
Reetinder Sidhu and Viktor K. Prasanna. Fast regular expression matching using FPGAs. In
Proc. 9th IEEE Symp. on Field-Programmable Custom Computing Machines, pages 227238,
April 2001. DOI: 10.1109/FCCM.2001.22 54
Satnam Singh. Computing without processors. Commun. ACM, 54(8):4654, August 2011.
DOI: 10.1145/1978542.1978558 41
Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem,
and Pat Helland. e end of an architectural era: (its time for a complete rewrite). In Proc.
33rd Int. Conf. on Very Large Data Bases, pages 11501160, 2007. 4
Tabula, Inc. Spacetime architecture, 2010. URL http://www.tabula.com/. White Paper.
68
Michael B. Taylor. Is dark silicon useful? harnessing the four horsemen of the coming dark silicon
apocalypse. In Proc. 49th Design Automaton Conference, June 2012.
DOI: 10.1145/2228360.2228567 89
Jens Teubner and Louis Woods. Snowfall: Hardware stream analysis made easy. In Proc. 14th
Conf. on Databases in Business, Technology, and Web, pages 738741, March 2011. 67
Jens Teubner, Rene Mueller, and Gustavo Alonso.
Frequent item computation on a
chip.
IEEE Trans. Knowl. and Data Eng., 23(8):11691181, August 2011. DOI:
10.1109/TKDE.2010.216 46, 47, 48
Jens Teubner, Louis Woods, and Chongling Nie. Skeleton automata for FPGAs: Reconguring
without reconstructing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
229240, May 2012. DOI: 10.1145/2213836.2213863 36, 45

100

BIBLIOGRAPHY

Ken ompson. Programming techniques: Regular expression search algorithm. Commun. ACM,
11(6):419422, 1968. DOI: 10.1145/363347.363387 52
Maamar Touiza, Gilberto Ochoa-Ruiz, El-Bay Bourennane, Abderrezak Guessoum, and Kamel
Messaoudi. A novel methodology for accelerating bitstream relocation in partially recongurable systems. Microprocessors and Microsystems, 2012. DOI: 10.1016/j.micpro.2012.07.004
30
Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, and Donald Kossmann. Predictable performance for unpredictable workloads. Proc. VLDB Endowment, 2(1):
706717, 2009. 62
Pranav Vaidya and Jaehwan John Lee. A novel multicontext coarse-grained recongurable architecture (CGRA) for accelerating column-oriented databases. ACM Trans. Recong. Technol.
Syst., 4(2), May 2011. DOI: 10.1145/1968502.1968504 38
Pranav Vaidya, Jaehwan John Lee, Francis Bowen, Yingzi Du, Chandima H. Nadungodage,
and Yuni Xia. Symbiote: A recongurable logic assisted data stream management system
(RLADSMS). In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 11471150,
June 2010. DOI: 10.1145/1807167.1807304 70
Louis Woods, Jens Teubner, and Gustavo Alonso. Complex event detection at wire speed with
FPGAs. Proc. VLDB Endowment, 3(1):660669, September 2010. 58, 59, 60
Louis Woods, Jens Teubner, and Gustavo Alonso. Real-time pattern matching with FPGAs. In Proc. 27th Int. Conf. on Data Engineering, pages 12921295, April 2011. DOI:
10.1109/ICDE.2011.5767937 59, 60
Louis Woods, Jens Teubner, and Gustavo Alonso. Parallel computation of skyline queries. In
Proc. 21st IEEE Symp. on Field-Programmable Custom Computing Machines, April 2013. 77,
78, 79, 80, 81
Yi-Hua E. Yang and Viktor K. Prasanna. High-performance and compact architecture for regular
expression matching on FPGA. IEEE Trans. Comput., 61(7):10131025, July 2012. DOI:
10.1109/TC.2011.129 45, 55, 56
Yi-Hua E. Yang, Weirong Jiang, and Viktor K. Prasanna. Compact architecture for highthroughput regular expression matching on FPGA. In Proc. ACM/IEEE Symp. on Architecture for Networking and Communication Systems, pages 3039, November 2008. DOI:
10.1145/1477942.1477948 46, 55, 56, 57, 58

101

Authors Biographies
JENS TEUBNER
Jens Teubner is leading the Databases and Information Systems Group at TU Dortmund in Germany. His main research interest is data processing on modern hardware platforms, including FPGAs, multi-core processors, and hardwareaccelerated networks. Previously, Jens Teubner was a postdoctoral researcher at ETH Zurich (20082013) and IBM Research (20072008). He holds a Ph.D. in Computer Science
from TU Mnchen (Munich, Germany) and an M.S. degree
in Physics from the University of Konstanz in Germany.

LOUIS WOODS
Louis Woods is a Ph.D. student, who joined the Systems
Group at ETH Zurich in 2009. His research interests include FPGAs in the context of databases, modern hardware,
stream processing, parallel algorithms, and design patterns.
Louis Woods received both his B.S. and M.S. degree in Computer Science from ETH Zurich in 2008 and 2009, respectively.

103

Index
bitstream, 28
bitstream authentication, 85
bitstream encryption, 85
block nested loops, 77
Block RAM (BRAM), 24
carry chain, 21
CEP (complex event processing), 58
chip space, 58
Cipherbase, 83, 86
CLB, see logic island
clock signal, 11, 39, 65
cloud computing, 6
co-processor, 51
combinational logic, 9
complex event processing, 58
congurable logic block, see logic island
cryptoprocessor, 86
dark silicon, 89
data ow computing, 49
data parallelism, 42
data path, 61
de-serialization, 67
Dennard scaling, 1
deterministic automaton, 52
DFA, 52
die stacking, 30
distributed RAM, 19
DSP unit, 25
electronic stock trading, 62

elementary logic unit, 20

encryption
bitstream, 85
nite-state automaton, 51
ip-op, 11, 39
frequent item computation, 46
Fusion I/O, 94

Glacier, 62
GPU, 3, 48
graphics processor, 3, 48
hard cores, 26
hardware description language, 12
hash table, 66
high-frequency trading (HFT), 62
high-level synthesis, 32, 40
homomorphic encryption, 86
instruction set processor, 37
interconnect, 9, 22
latch, 11
line-rate performance, 60, 93
logic island, 21
look-up table (LUT), 18
many-core processor, 49
map (design ow), 28
memory, 9
merge sorter, 73
Moores Law, 1, 89
multiplexer, 9, 13

104

INDEX

NetFPGA, 93
network intrusion detection, 5, 51, 56
NFA, 52
non-deterministic automaton, 52
parameterization, 35
partial reconguration, 28, 34, 69, 75
pattern matching, 51
PCI Express, 93
pipelining, 43, 57
in FPGAs, 45
place and route (design ow), 28
PLD (programmable logic device), 17
power consumption, 2
programmable logic devices, 17
propagation delay, 10, 56, 72
push-based execution, 64
query compilation, 64
RAM
distributed, 19, 24
reconguration
partial, 28
register-transfer level, 14
regular expression, 51
replication, 42
risk management, 62
RTL (register-transfer level), 14

security, 6
sequential logic, 10
asynchronous, 11
synchronous, 11
serialization, 67
simulation, 13
skeleton automata, 36
skyline operator, 76
slice, see elementary logic unit
soft cores, 26
Solarare, 93
sorting, 71
sorting network, 71
stack machine, 88
stream partitioning, 58
stream processing, 5
synthesis, 14, 27
time-multiplexed FPGAs, 31
translate (design ow), 27
tree merge sorter, 74
trusted platform module, 83
von Neumann architecture, 83
von Neumann bottleneck, 2
XPath, 36

Routing Algorithms in Networksonchip 2014 PDF
No ratings yet
Routing Algorithms in Networksonchip 2014 PDF
411 pages
Database Modeling For Industrial Data Management Emerging Technologies and Applications Zongmin Ma PDF Download
100% (1)
Database Modeling For Industrial Data Management Emerging Technologies and Applications Zongmin Ma PDF Download
79 pages
Atmel AVR Micro Controller Primer - Programming and Interfacing
89% (9)
Atmel AVR Micro Controller Primer - Programming and Interfacing
195 pages
Digital Systems Design and Prototyping
67% (3)
Digital Systems Design and Prototyping
633 pages
Fpga-Based Hardware Accelerators: Iouliia Skliarova Valery Sklyarov
No ratings yet
Fpga-Based Hardware Accelerators: Iouliia Skliarova Valery Sklyarov
257 pages
Fpga Programming Using Verilog HDL Language
50% (2)
Fpga Programming Using Verilog HDL Language
89 pages
FPGA Tutorial
100% (1)
FPGA Tutorial
424 pages
Bare Metal CPP
100% (2)
Bare Metal CPP
199 pages
FPGA Basics
100% (8)
FPGA Basics
11 pages
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
100% (1)
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
37 pages
Cloud-Based RDF Data Management
No ratings yet
Cloud-Based RDF Data Management
105 pages
SystemC Methodologies and Applications-Müller-Kluwer
No ratings yet
SystemC Methodologies and Applications-Müller-Kluwer
356 pages
FPGA
100% (4)
FPGA
122 pages
FPGAs For Software Programmers
No ratings yet
FPGAs For Software Programmers
331 pages
My First Fpga
100% (2)
My First Fpga
50 pages
Verilog Quick Start - Practical Guide To Simulation & Synthesis in Verilog (3rd Ed.)
100% (2)
Verilog Quick Start - Practical Guide To Simulation & Synthesis in Verilog (3rd Ed.)
378 pages
Best FPGA Development Practices 2014-02-20
No ratings yet
Best FPGA Development Practices 2014-02-20
16 pages
R-32 Refrigerant Gas Pressure Temperature Chart
100% (3)
R-32 Refrigerant Gas Pressure Temperature Chart
2 pages
Mixed Signal Embedded Programming
100% (15)
Mixed Signal Embedded Programming
268 pages
LTE Implementation Uing Xilinx FPGA
100% (1)
LTE Implementation Uing Xilinx FPGA
294 pages
Embedded Systems Design With The Atmel AVR Micro Controller
100% (4)
Embedded Systems Design With The Atmel AVR Micro Controller
183 pages
Engineering Applications of FPGAs
100% (3)
Engineering Applications of FPGAs
230 pages
Principles and Structures of FPGAs
100% (1)
Principles and Structures of FPGAs
234 pages
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
44 pages
Beginning FPGA Programming - Partie73
0% (1)
Beginning FPGA Programming - Partie73
5 pages
Verilog by Example
100% (1)
Verilog by Example
18 pages
Icse - Computer - Question Paper-4
No ratings yet
Icse - Computer - Question Paper-4
8 pages
Stephen Naylor Thomas - Practical Reasoning in Natural Language
50% (2)
Stephen Naylor Thomas - Practical Reasoning in Natural Language
484 pages
Computer Science-Research Methods
No ratings yet
Computer Science-Research Methods
25 pages
Option MCQ
No ratings yet
Option MCQ
10 pages
Control Valves Regulators Actuators Product Catalogue Polna
100% (1)
Control Valves Regulators Actuators Product Catalogue Polna
243 pages
Data Processing On Fpgas
No ratings yet
Data Processing On Fpgas
14 pages
Designing With Xilinx FPGAs
90% (10)
Designing With Xilinx FPGAs
257 pages
Signal Processing
67% (3)
Signal Processing
487 pages
Verilog Text Book
100% (1)
Verilog Text Book
431 pages
Fpga VHDL Front
0% (2)
Fpga VHDL Front
1 page
Embedded Signal Processing
67% (3)
Embedded Signal Processing
103 pages
II 2 Reservoir Rock Properties Compressibility
100% (1)
II 2 Reservoir Rock Properties Compressibility
13 pages
FPGAs - Fundamentals, Advanced Features, and Applications in Industrial Electronics PDF
100% (4)
FPGAs - Fundamentals, Advanced Features, and Applications in Industrial Electronics PDF
267 pages
Pectinase
No ratings yet
Pectinase
4 pages
Hcia Ai 1 PDF
No ratings yet
Hcia Ai 1 PDF
171 pages
Personalized Medicine Recommendation System
No ratings yet
Personalized Medicine Recommendation System
17 pages
Environmental Engg Lab
No ratings yet
Environmental Engg Lab
72 pages
How To Properly Size A Steam Trap
100% (2)
How To Properly Size A Steam Trap
4 pages
APMR Series: Packaged Air Conditioners
100% (1)
APMR Series: Packaged Air Conditioners
28 pages
Isl Cif
No ratings yet
Isl Cif
3 pages
Preparation of Cement Samples Using The POLAB APM: Verification of Suitability According To ASTM C114
No ratings yet
Preparation of Cement Samples Using The POLAB APM: Verification of Suitability According To ASTM C114
2 pages
Effects of Rotational Inertia On A Fastball
No ratings yet
Effects of Rotational Inertia On A Fastball
10 pages
Class 7 Computer Term 1 CH 2 Gimp An Introduction CW
No ratings yet
Class 7 Computer Term 1 CH 2 Gimp An Introduction CW
4 pages
Descriptors: Fundamentals of Symbian C++
No ratings yet
Descriptors: Fundamentals of Symbian C++
115 pages
Engine Ice Protect Sys Show
No ratings yet
Engine Ice Protect Sys Show
21 pages
Steel Grades 2 PDF
0% (1)
Steel Grades 2 PDF
2 pages
BIO 11 Lect - Botany Part 2 Assignment
No ratings yet
BIO 11 Lect - Botany Part 2 Assignment
2 pages
Lecture 10 ADD, SUB, Adders, Subtractors & Comparator
No ratings yet
Lecture 10 ADD, SUB, Adders, Subtractors & Comparator
76 pages
High-Performance Thermoelectrics and Challenges For Practical Devices
No ratings yet
High-Performance Thermoelectrics and Challenges For Practical Devices
11 pages
MDB Simple Strain
No ratings yet
MDB Simple Strain
4 pages
LUZ MEDINA, Checkpoint 4 Corrected
No ratings yet
LUZ MEDINA, Checkpoint 4 Corrected
3 pages
Integrated Pollution Prevention and Control (IPPC) Reference Document On Best Available Techniques For The Textiles Industry July 2003
No ratings yet
Integrated Pollution Prevention and Control (IPPC) Reference Document On Best Available Techniques For The Textiles Industry July 2003
22 pages
6fm200se X PDF
No ratings yet
6fm200se X PDF
2 pages
Socket Set Screw Types
No ratings yet
Socket Set Screw Types
1 page
Experiment 7 Electric Field Data Sheet Group 1 June 8
No ratings yet
Experiment 7 Electric Field Data Sheet Group 1 June 8
6 pages
Olefini Catalogue 2015 - Industrial
No ratings yet
Olefini Catalogue 2015 - Industrial
2 pages
BOSCH Automotive Handbook - 2022
91% (33)
BOSCH Automotive Handbook - 2022
2,040 pages
Lab # 4-Head Loss in Pipes - Fillable
No ratings yet
Lab # 4-Head Loss in Pipes - Fillable
4 pages
Assignment 1 - 14-09-2021
No ratings yet
Assignment 1 - 14-09-2021
2 pages
Hands On Electronics
100% (14)
Hands On Electronics
228 pages
Copy of Modern Automotive Technology 7th Ed
100% (10)
Copy of Modern Automotive Technology 7th Ed
1,631 pages
Essential Calculus Skills Practice Workbook With Full Solutions
95% (84)
Essential Calculus Skills Practice Workbook With Full Solutions
528 pages
100 Electronic Projects With Circuit Diagram PDF
97% (33)
100 Electronic Projects With Circuit Diagram PDF
105 pages
Automotive Electrical Diagnosis Course
100% (29)
Automotive Electrical Diagnosis Course
304 pages
Understanding Automotive Electronics Se
100% (13)
Understanding Automotive Electronics Se
617 pages
Engine Mechanical Diagnosis Course
100% (17)
Engine Mechanical Diagnosis Course
147 pages
Basic Engineering Mathematics
89% (38)
Basic Engineering Mathematics
301 pages
Automotive Electronics Design
92% (12)
Automotive Electronics Design
270 pages
Troubleshooting Automotive Computers, Sensors & Network PDF
100% (23)
Troubleshooting Automotive Computers, Sensors & Network PDF
97 pages
All-In-One Electronics Guide Your Complete Ultimate Guide To Understanding and Utilizing Electronics!
96% (24)
All-In-One Electronics Guide Your Complete Ultimate Guide To Understanding and Utilizing Electronics!
469 pages
Make - Electronics 3rd Edition PDF
47% (17)
Make - Electronics 3rd Edition PDF
351 pages
Ecu Repair
90% (62)
Ecu Repair
146 pages
Arduino For Beginners EBOOK
100% (11)
Arduino For Beginners EBOOK
134 pages
BasicElectronics BernardGrob PDF
92% (12)
BasicElectronics BernardGrob PDF
786 pages
The Art of Electronics
91% (22)
The Art of Electronics
314 pages
The Giant Book of Electronics Projects
98% (49)
The Giant Book of Electronics Projects
496 pages
Make Your Own Diagnostic Equipment Mandy Concepcion PDF
96% (24)
Make Your Own Diagnostic Equipment Mandy Concepcion PDF
129 pages
All The Math You Missed - But Need To Know For Graduate School
100% (35)
All The Math You Missed - But Need To Know For Graduate School
417 pages
Automotive Mechatronics PDF
97% (29)
Automotive Mechatronics PDF
549 pages
How To Diagnose and Repair Automotive Electrical Systems
83% (29)
How To Diagnose and Repair Automotive Electrical Systems
135 pages
(Industrial and Applied Mathematics) Martin Brokate, Pammy Manchanda, Abul Hasan Siddiqi - Calculus For Scientists and Engineers (2019, Springer)
100% (14)
(Industrial and Applied Mathematics) Martin Brokate, Pammy Manchanda, Abul Hasan Siddiqi - Calculus For Scientists and Engineers (2019, Springer)
655 pages
Smith D. - Arduino For Complete Idiots - 2017
90% (21)
Smith D. - Arduino For Complete Idiots - 2017
175 pages
Automotive Electronics 1
90% (63)
Automotive Electronics 1
108 pages
Power Electronics Devices and Circuits Second Edition PDF
100% (19)
Power Electronics Devices and Circuits Second Edition PDF
383 pages
How To Tune and Modify Automotive Engine Management Systems
98% (61)
How To Tune and Modify Automotive Engine Management Systems
338 pages
Automotive Electronics - Vol 2
100% (18)
Automotive Electronics - Vol 2
40 pages
Hands On Electronics A Practical Introduction To Analog and Digital Circuits by Daniel M Kaplan and Christopher G White
100% (10)
Hands On Electronics A Practical Introduction To Analog and Digital Circuits by Daniel M Kaplan and Christopher G White
228 pages
Encyclopedia of Electronic Circuits Volume 1
95% (19)
Encyclopedia of Electronic Circuits Volume 1
800 pages
Programming the BeagleBone
From Everand
Programming the BeagleBone
Chavan Yogesh
No ratings yet
The FPGA Programming Handbook: An essential guide to FPGA design for transforming ideas into hardware using SystemVerilog and VHDL
From Everand
The FPGA Programming Handbook: An essential guide to FPGA design for transforming ideas into hardware using SystemVerilog and VHDL
Frank Bruno
No ratings yet
BeagleBone Black Cookbook: Over 60 recipes and solutions for inventors, makers, and budding engineers to create projects using the BeagleBone Black
From Everand
BeagleBone Black Cookbook: Over 60 recipes and solutions for inventors, makers, and budding engineers to create projects using the BeagleBone Black
Charles A. Hamilton
No ratings yet
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
From Everand
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
Derek Molloy
4/5 (1)
USB Embedded Hosts: The Developer's Guide
From Everand
USB Embedded Hosts: The Developer's Guide
Jan Axelson
3/5 (1)
Digital Filters
From Everand
Digital Filters
Richard W. Hamming
4/5 (1)
Fundamental Concepts of MATLAB Programming: From Learning the Basics to Solving a Problem with MATLAB
From Everand
Fundamental Concepts of MATLAB Programming: From Learning the Basics to Solving a Problem with MATLAB
Kulwinder Singh Parmar
No ratings yet
Fault Tolerant & Fault Testable Hardware Design
From Everand
Fault Tolerant & Fault Testable Hardware Design
Parag K. Lala
5/5 (2)
Optimal Control: Linear Quadratic Methods
From Everand
Optimal Control: Linear Quadratic Methods
Brian D. O. Anderson
4/5 (2)
Hardware Description Language Demystified: Explore Digital System Design Using Verilog HDL and VLSI Design Tools
From Everand
Hardware Description Language Demystified: Explore Digital System Design Using Verilog HDL and VLSI Design Tools
Rajkumar Sarma
No ratings yet
Bare-Metal Embedded C Programming: Develop high-performance embedded systems with C for Arm microcontrollers
From Everand
Bare-Metal Embedded C Programming: Develop high-performance embedded systems with C for Arm microcontrollers
Israel Gbati
No ratings yet
Digital Circuit Simulation Using Excel
From Everand
Digital Circuit Simulation Using Excel
Anthony Mazzurco
No ratings yet
Mastering C++ Network Automation
From Everand
Mastering C++ Network Automation
Justin Barbara
No ratings yet
Deep Learning on Microcontrollers: Learn how to develop embedded AI applications using TinyML (English Edition)
From Everand
Deep Learning on Microcontrollers: Learn how to develop embedded AI applications using TinyML (English Edition)
Atul Krishna Gupta
5/5 (1)
Professional Embedded ARM Development
From Everand
Professional Embedded ARM Development
James A. Langbridge
No ratings yet
Learning BeagleBone
From Everand
Learning BeagleBone
Hunyue Yau
No ratings yet
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
From Everand
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
Naim Dahnoun
No ratings yet
Implementing a Cpu using Fpga
From Everand
Implementing a Cpu using Fpga
Othman Ahmad
No ratings yet
Safety of Computer Architectures
From Everand
Safety of Computer Architectures
Jean-Louis Boulanger
No ratings yet
Stack Computers: The New Wave
From Everand
Stack Computers: The New Wave
Philip Koopman
No ratings yet
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
General Purpose Computing On Graphics Processing Units: Utilizing the Graphics Processing Unit (GPU) to carry out calculations that are normally performed by the CPU
From Everand
General Purpose Computing On Graphics Processing Units: Utilizing the Graphics Processing Unit (GPU) to carry out calculations that are normally performed by the CPU
Fouad Sabry
No ratings yet
Analog Dialogue Volume 46, Number 1: Analog Dialogue, #5
From Everand
Analog Dialogue Volume 46, Number 1: Analog Dialogue, #5
Analog Dialogue
5/5 (1)
Finite-state machine A Complete Guide
From Everand
Finite-state machine A Complete Guide
Gerardus Blokdyk
No ratings yet
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet