Introduction To Reconfigurable Computing

Introduction to Reconfigurable Computing
Introduction
to Reconfigurable
Computing
Architectures, Algorithms,
and Applications
by
Christophe Bobda
University of Kaiserslautern, Germany
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 978-1-4020-6088-5 (HB)

ISBN 978-1-4020-6100-4 (e-book)
Published by Springer,
P.O. Box 17, 3300 AA Dordrecht, The Netherlands.
www.springer.com
Printed on acid-free paper
All Rights Reserved

c 2007 Springer
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microfilming, recording
or otherwise, without written permission from the Publisher, with the exception
of any material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work.
To Lewin, Jan and Huguette for being so patient
Foreword
"Christophe Bobda’s book is an all-embracing introduction to the fundamentals

of the entire discipline of Reconfigurable Computing, also seen with the eyes
of a software developer and including a taxonomy of application areas.
Reconfigurable Computing is a disruptive innovation currently going to com-
plete the most important breakthrough after introduction of the von Neumann
paradigm. On software to FPGA migrations a dazzling array of publications
from a wide variety application areas reports speed-up factors between 1 and
4 orders of magnitude and promises to reduce the electricity bill by at least an
order of magnitude. Facing the tcyberinfrastructure’s growing electricity con-
sumption (predicted to reach 35–50% of the total electricity production by the
year 2020 in the USA) also this energy aspect is a strategic issue.
The focal point of worldwide almost 15 million software developers will
shift toward new solutions of the productivity problems which stem from pro-
gramming the coming many-core microprocessors and from educational deficits.
Currently Reconfigurable Computing for High Performance computing is even
disorienting the supercomputing scene. Their tremulous question is: Do we
need to learn hardware design?
In the past, when students asked for a text book, we had to refer to a col-
lection of specialized books and review articles focused on individual topics or
special application areas, where Reconfigurable Computing takes only a corner
or a chapter, sometimes even treating FPGAs as exotic technology (although
in important areas it is mainstream for a decade). The typical style of those
books or articles assumes that the reader has a hardware background: a leap
too far for the existing software developers community.
The book by Christophe Bobda, however, has also been written for peo-
ple with a software background, substantially reducing the educational leap
bybridging the gap. His book has the potential to become a best-seller and to
viii Foreword
stimulate the urgently needed transformation of the software developer popu-

lation’s mindset, by playing a similar role as known from the famous historic
Mead-&-Conway textbook for the VLSI design revolution.
Reiner Hartenstein, IEEE fellow,

Professor, TU Kaiserslautern "
Contents
Foreword vii
Preface xiii
About the Author xv
List of Figures xvii
List of Tables xxv
1. INTRODUCTION 1
1 General Purpose Computing 2
2 Domain-Specific Processors 5
3 Application-Specific Processors 6
4 Reconfigurable Computing 8
5 Fields of Application 9
6 Organization of the Book 11
2. RECONFIGURABLE ARCHITECTURES 15
1 Early Work 15
2 Simple Programmable Logic Devices 26
3 Complex Programmable Logic Device 28
4 Field Programmable Gate Arrays 28
5 Coarse-Grained Reconfigurable Devices 49
6 Conclusion 65
3. IMPLEMENTATION 67
1 Integration 68
2 FPGA Design Flow 72
3 Logic Synthesis 75
4 Conclusion 98
vi Contents
4. HIGH-LEVEL SYNTHESIS FOR RECONFIGURABLE DEVICES 99

1 Modelling 100
2 Temporal Partitioning Algorithms 120
3 Conclusion 148
5. TEMPORAL PLACEMENT 149
1 Offline Temporal Placement 151
2 Online Temporal Placement 160
3 Managing the Device’s Free Space with Empty Rectangles 161
4 Managing the Device’s Occupied Space 165
5 Conclusion 179
6. ONLINE COMMUNICATION 181
1 Direct Communication 181
2 Communication Over Third Party 182
3 Bus-based Communication 183
4 Circuit Switching 183
5 Network on Chip 188
6 The Dynamic Network on Chip (DyNoC) 199
7 Conclusion 212
7. PARTIAL RECONFIGURATION DESIGN 213
1 Partial Reconfiguration on Virtex Devices 214
2 Bitstream Manipulation with JBits 216
3 The Modular Design Flow 217
4 The Early Access Design Flow 225
5 Creating Partially Reconfigurable Designs 234
6 Partial Reconfiguration using Handel-C Designs 244
7 Platform design 246
8 Enhancement in the Platform Design 256
9 Conclusion 257
8. SYSTEM ON A PROGRAMMABLE CHIP 259
1 Introduction to SoPC 259
2 Adaptive Multiprocessing on Chip 268
3 Conclusion 284
Contents vii
9. APPLICATIONS 285
1 Pattern Matching 286
2 Video Streaming 294
3 Distributed Arithmetic 298
4 Adaptive Controller 307
5 Adaptive Cryptographic Systems 310
6 Software Defined Radio 313
7 High-Performance Computing 315
8 Conclusion 317
References 319
Appendices 336
A Hints to Labs 337
1 Prerequisites 338
2 Reorganization of the Project Video8 non pr 338
B Party 345
C Quick Part-Y Tutorial 349
Preface
One indicator of the growing importance of Reconfigurable Computing is

the large number of events (conferences, workshops, meetings) organized and
devoted to this topic in the last couple of years. Also, the growth observed in
the market share of programmable logic devices, particulary the FPGAs, is an
indicator of the strong interest in reconfigurable logic.
Following this development, teaching reconfigurable computing, which was
initiated in many Universities a couple of years before, has gained more impor-
tance. The curricula in reconfigurable computing varies from simple seminars
to more heavy syllabus including lectures, exercises and labs.
Many people among whom Reiner Hartenstein have been advocating years
ago in favour of a normalized reconfigurable computing syllabus. Aware of
the importance of a teaching book in this normalization, during a bi-annual
meeting on reconfigurable computing held in Dagstuhl in 2003 [11], Harten-
stein coined the importance of a text book in reconfigurable computing and
proposed the attendees to write one. He suggested to have several people writ-
ing to minimize the work and have the book published as faster as possible.
Unfortunately, this initiative was not pursued.
A couple of months after this initiative, in the summer term 2004, I started
teaching a course in reconfigurable computing at the university of Erlangen-
Nuremberg. With the difficulties of acquiring teaching materials and labs, I
started writing a script to ease the learning process of students. The positive
feedback gained from the student encouraged me to continue writing. Further
repetitions of the course in winter term 2004 and in winter term 2006 were
used to improve the course contents. It should be mentioned that in a couple
of books [153] [220] [134], reconfigurable computing were published in be-
tween. However, none of them were found to cover the complete aspects of
reconfigurable computing as I use to teach in my course.
My goal in writing this book was to provide a strong theoretical and practi-
cal background, as a contribution for a syllabus in reconfigurable computing.
xiv Preface
A short overview on the content of each chapter is provided in Section 6 of

Chapter 1.
This book targets graduate students and lecturers in computer engineering,
computer science and electrical engineering. Also professional in the afore
mentioned field can use the book as well. We supply the book with teaching
materials (slides and labs) to ease the course preparation for those willing to
introduce a reconfigurable computing curricula. The teaching material as well
as the labs can be downloaded from the course Web page at www.bobda.net/
rc-book.
Finally, despite all the effort place in the review, we cannot be sure that all
the mistakes were filtered out. We will therefore be grateful to receive your
comments and feedback on possible errors.
Kaiserslautern, June 2007

Christophe Bobda
About the Author
Dr. Bobda received the Licence degree in mathematics from the Univer-
sity of Yaounde, Cameroon, in 1992, the diploma of computer science and the
Ph.D. degree (with honors) in computer science from the University of Pader-
born in Germany in 1999 and 2003, respectively. In June 2003, he joined the
department of computer science at the University of Erlangen-Nuremberg in
Germany as post doc. In October 2005, he moved to the University of Kaiser-
slautern as Junior Professor, where he leads the working group Self-Organizing
Embedded Systems in the department of computer science. His research inter-
ests include reconfigurable computing, self-organization in embedded systems,
multiprocessor on chip and adaptive image processing.
Dr. Bobda received the Best Dissertation Award 2003 from the University of
Paderborn for his work on synthesis of reconfigurable systems using temporal
partitioning and temporal placement.
Dr. Bobda is member of The IEEE Computer Society, the ACM and the
GI. He has also served in the program committee of several conferences (FPL,
FPT, RAW, RSP, ERSA, DRS) and in the DATE executive committee as pro-
ceedings chair (2004, 2005, 2006, 2007). He served as reviewer of several
journals (IEEE TC, IEEE TVLSI, Elsevier Journal of Microprocessor and Mi-
crosystems, Integration the VLSI Journal ) and conferences (DAC, DATE, FPL,
FPT, SBCCI, RAW, RSP, ERSA).
List of Figures
1.1 The Von Neumann Computer architecture 2

1.2 Sequential and pipelined execution of instructions on a
Von Neumann Computer 4
1.3 ASIP implementation of Algorithm 1 7
1.4 Flexibility vs performance of processor classes 8
2.1 Structure of the Estrin Fix-Plus Machine 17
2.2 The basic building blocks of Fix-Plus Machine 17
2.3 The wiring harness of the Estrin-Machine 18
2.4 The motherboard of the Estrin’s Fix-Plus 18
2.5 Estrin at work: Hand-Controlled Reconfiguration 19
2.6 Structure of the Rammig machine 20
2.7 META-46 GOLDLAC 20
2.8 General architecture of the XPuter as implemented in
the Map oriented Machine (MOM-3) prototype 21
2.9 Structure of the XPuter’s reconfigurable ALU 22
2.10 Programmable Active Memory (PAM) architecture as
array of (PABs) 23
2.11 Architecture of the SPLASH II array board 25
2.12 PAL and PLA implementations of the functions F 1 =
A · C + A · B and F 2 = A · B + B · C 27
2.13 Structure of a CPLD device 28
2.14 Structure of an FPGA 29
2.15 Antifuse FPGA Technology 30
2.16 A Xillinx SRAM cell 31
2.17 Use of SRAM in FPGA-Configuration 31
xviii List of Figures
2.18 EEPROM Technology 32

2.19 Implementation of f=ab in a 2-input MUX 33
2.20 Implementation of a Full adder using two 4-input one
output MUX 35
2.21 The Actel basic computing blocks uses multiplexers as
function generators 36
2.22 2-input LUT 36
2.23 Implementation of a full adder in two 3-input LUTs 37
2.24 Basic block of the Xilinx FPGAs 38
2.25 CLB in the newer Xilinx FPGAs (Spartan 3, Virtex 4
and Virtex 5) 38
2.26 Logic Element in the Cyclone II 39
2.27 Stratix II Adaptive Logic Module 40
2.28 The four basic FPGA structures 41
2.29 Symmetrical array arrangement in a) the Xilinx and b)
the Atmel AT40K FPGAs 42
2.30 Virtex routing resource 42
2.31 Local connection of an Atmel Cell 42
2.32 Row based arrangement on the Actel ACT3 FPGA Family 43
2.33 Actel’s ACT3 FPGA horizontal and vertical routing resources 44
2.34 Actel ProASIC local routing resources 45
2.35 Hierarchical arrangement on the Altera Stratix II FPGA 46
2.36 LAB connection on the Altera Stratix devices 46
2.37 General structure of an I/O component 47
2.38 Structure of a Xilinx Virtex II Pro FPGA with two Pow-
erPC 405 Processor blocks 49
2.39 Structure of the PACT XPP device 51
2.40 The XPP ALU Processing Array Element. The struc-
ture of the RAM ALU is similar. 52
2.41 Structure of the NEC Dynamically Reconfigurable Processor 53
2.42 The DRP Processing Element 54
2.43 Structure of the picoChip device 55
2.44 The Quicksilver ACM hierarchical structure with 64 nodes 56
2.45 ACM Node and Routing resource 57
2.46 IPflex DAP/DNA reconfigurable processor 59
2.47 The Stretch 5530 configurable processor 59
List of Figures xix
2.48 Pipeline Reconfiguration: Mapping of a 5 stage virtual

pipeline auf eine 3 stage 63
3.1 Architecture of a run-time reconfigurable system 69
3.2 A CPU-RPU configuration and computation step 70
3.3 The FPGA design flow 73
3.4 A structured digital system 75
3.5 Example of a boolean network with: y1 = x1 + x2 ,
y2 = x3 · x4 , y3 = x5 · x6 , y4 = y1 + y2 , z1 = y1 + y4 ,
and z2 = y2 ⊕ y3 76
3.6 BDD-representation of the function f = abc + bd + bcd 79
3.7 Example of a K-feasible cone Cv at a node v 83
3.8 Example of a graph covering with K-feasible cone and
the corresponding covering with LUTs 83
3.9 Chortle two-level decomposition 84
3.10 Example of multi-level decomposition 86
3.11 Exploiting reconvergent paths to reduce the amount of
LUTs used 87
3.12 Logic replication at fan-out nodes to reduce the number
of LUTs used 88
3.13 Construction of the network Nt from the cone Ct at
node t 90
3.14 Minimum height 3-feasible cut and node mapping 91
3.15 Illustration of the two cases in the proof of Lemma 3.8 92
3.16
Transforming Nt into N t by node collapsing 94
3.17 Transforming the node cut constraints into the edge cut ones 94
3.18 Improvement of the FlowMap algorithm through effi-
cient predecessor packing 97
4.1 Dataflow Graph for Quadratic Root 102
4.2 Sequencing graph with a branching node linking to two
different sub graphs 103
4.3 Transformation of a sequential program into a FSMD 105
4.4 Transformation of the greatest common divisor pro-
gram into an FSMD 106
4.5 The datapath and the corresponding FSM for the GCD-FSMD 107
4.6 Dataflow graph of the functions: x = ((a × b) − (c ×
d)) + ((c × d) − (e − f )) and y = ((c × d) − (e − f )) −
((e − f ) + (g − h)) 109
xx List of Figures
4.7 HLS of the graph in figure 4.6 on a an architecture with

one instances of the resource types +, ∗ and − 110
4.8 HLS of the graph in figure 4.6 on a reconfigurable device 111
4.9 Partitioning of a coarse-grained node in the dataflow graph. 113
4.10 Example of configuration graph 116
4.11 Wasted resources 117
4.12 Partitioning of the graph G with connectivity 0.24 with
an algorithm that produces a quality of 0.25 119
4.13 Partitioning of the graph G with connectivity 0.24 with
an algorithm that produces a quality of 0.45 119
4.14 Scheduling example with the ASAP-algorithm 122
4.15 ALAP Scheduling example 122
4.16 An example of list scheduling using the depth of a node
as priority 124
4.17 Levelizing effect on the list-scheduling on a dataflow graph 128
4.18 Partitioning with a better quality than the list-scheduling 128
4.19 Dataflow graph transformation into a network 134
4.20 Transformation and partitioning steps using the network
flow approach 135
4.21 1-D and 2-D spectral-based placement of a graph 136
4.22 Dataflow graph of f = ((a + b) ∗ c) − ((e + f ) + (g ∗ h)) 140
4.23 3-D spectral placement of the DFG of figure 4.22 141
4.24 Derived partitioning from the spectral placement of fig-
ure 4.23 141
4.25 Internal and external edges of a given nodes 143
4.26 Partitioning of a graph into two sets with common sets
of operators 146
4.27 Logical partitioning of the graph of figure 4.26 147
4.28 Implementation of configuration switching with the par-
titions of figure 4.27 147
5.1 Temporal placement as 3-D placement of blocks 150
5.2 First-fit temporal placement of a set of clusters 154
5.3 Valid two dimensional packing 157
5.4 A non valid two dimensional packing 158
5.5 3-D placement and corresponding interval graphs, com-
plement graphs and oriented packing 159
5.6 Various types of empty rectangles 162
List of Figures xxi
5.7 Increase of the number of MER through the insertion

of a new component 163
5.8 Two different non-overlapping rectangles representations 164
5.9 Splitting alternatives after new insertion 164
5.10 IPR of a new module v relative to a placed module v 167
5.11 Impossible and possible placement region of a compo-
nent v prior to its insertion 168
5.12 Nearest possible feasible point of an optimal location
that falls within the IPR 170
5.13 Moving out of consecutive overlapping IPRs 172
5.14 Expanding existing modules and shrinking chip area
and the new 173
5.15 Characterization of IPR of a given component: The set
of contours (left), the contour returned by the modified
CUR (middle), the contour returned by the CUR (right) 174
5.16 Placement of a component on the chip (left) guided by
the communication with its environment (right) 176
5.17 Computation of the union of contours. The point on
the boundary represent the potentially moves from the
median out of the IPR. 178
6.1 Direct communication between placed modules on a
reconfigurable device 182
6.2 Drawback of circuit switching in temporal placement
Placing a component using 4 PEs will not be possible,
although enough free resources are available 185
6.3 The RMBoC architecture 186
6.4 RMBoC FPGA implementation 187
6.5 Crosspoint architecture 187
6.6 A Network on Chip on a 2-D Mesh 189
6.7 Router Architecture 190
6.8 A general FIFO Implementation 191
6.9 General format of a packet 192
6.10 Arbiter to control the write access at output data lines 193
6.11 A general wrapper architecture 194
6.12 Implementation of a large reconfigurable module on a
Network on Chip 199
6.13 The communication infrastructure on a DyNoC 201
6.14 A impossible placement scenario 203
xxii List of Figures
6.15 A strongly connected configuration on a DyNoC 204

6.16 Obstacle avoidance in the horizontal direction 206
6.17 Obstacle avoidance in the vertically direction 207
6.18 Placement that cause an extreme long routing path 208
6.19 Router guiding in a DyNoC 209
6.20 DyNoC implementation of a traffic light controller on
a VirtexII-1000 212
7.1 Generation of bitstreams for partial reconfiguration 215
7.2 Routing tools can route the same signals on different paths 217
7.3 The recommended directory structure for the modular
design flow 219
7.4 Limitation of a PR area to a block (dark) and the actual
dimensions (light) 225
7.5 Scheme of a PR application with a traversing bus that
may not be interrupted 226
7.6 Improved directory structure for the Early Access De-
sign Flow 227
7.7 Usage of the new EA bus macros 229
7.8 Narrow (a) and wide (b) bus macro spanning two or
four CLBs 230
7.9 Three nested bus macros 230
7.10 Scheme of Animated Patterns 235
7.11 Two patterns that can be created with modules of An-
imated Patterns. Left: the middle beam is moving.
Right: the diagonal stripes are moving 236
7.12 Scheme of Video8 236
7.13 Reconstructing example Video8 to place the partially
reconfigurable part and connected modules in the top-level 238
7.14 Moving a partially reconfigurable module to the top-
level design on the example of Video8 238
7.15 Modules using resources (pins) not available in their
placement area (the complete column) must use feed-
through signals to access those resources 246
7.16 Illustration of th pin problematique on the RC200-Board 248
7.17 Architecture of the ESM-Baby board 251
7.18 Architecture of the ESM MotherBoard 253
7.19 Intermodule communication possibilities on the ESM 254
7.20 SRAM-based intermodule communication on the ESM 255
List of Figures xxiii
7.21 Possible enhancement of the Erlangen Slot Machine on

the Xilinx Virtex 4 and Virtex 5 FPGAs 257
8.1 Integration of PCB modules into a single chip: from
system on PCB to SoC 260
8.2 Example of system ingration with CoreConnect buses 266
8.3 Implementation of the OPB for two maters and two slaves 268
8.4 General adaptive multiprocessor hardware infrastructure 271
8.5 Structure of the on chip network 273
8.6 Implementation of the transceiver 274
8.7 The communication protocoll 275
8.8 Automatic hardware generation flow 277
8.9 The Platform-independent Hardware generation tool (PinHat) 279
8.10 The PinHaT framework 280
8.11 software configuration flow 282
8.12 4-Processor infrastructure for the SVD on the ML310-Board 283
9.1 Sliding windows for the search of three words in parallel 288
9.2 FSM recognizers for the word ‘conte’: a) sate diagram,
b) transition table, c) basis structure the hardware im-
plementation: 4 flip flops will be need to code a 5 × 6
transition table 291
9.3 a) Use of the common prefix to reduce the number
of flip flops of the common word detector for ‘partir’,
‘paris’, ‘avale’,‘avant’. b) implementation without use
of common prefix and common comparator set 291
9.4 Basic structure of a FSM-based words recognizer that
exploits the common prefix and a common set of characters 292
9.5 Processing steps of the FSM for the word ‘tictic’ 293
9.6 Implementation of a 5 × 5 sliding windows 296
9.7 A modular architecture for video streaming on the ESM 297
9.8 Architecture of a distributed arithmetic datapath 300
9.9 k-parallel distributed arithmetic datapath 301
9.10 Datapath of the distributed arithmetic computation for
floating-point numbers 304
9.11 An optical multimode waveguide is represented by a
multiport with several transfer paths 305
9.12 Screenshot of the 6-parallel DA implementation of the
recursive convolution equation on the Celoxica RC100-
PP platform 306
xxiv List of Figures
9.13 Adaptive controller architecture 308

9.14 Adaptive controller architecture. Left: the one slot im-
plementation, and right: the two slot implemenation 310
9.15 Architecture of an adaptive cryptographic system 312
9.16 Architecture of a software defined radio system 315
C.1 Tree View after Top Assembly 350
List of Tables
2.1 Truth table of the Full adder 35

3.1 Language and tools overview for coarse-grained RPUs 72
3.2 Overview of FPGA manufacturers and tool providers 74
4.1 Laplacian matrix of the graph of figure 4.22 140
6.1 Router Statistics 210
6.2 TLC and CG Statistics 211
7.1 New address ranges to be set in EDK 241
9.1 Results of the recursive convolution equation on differ-
ent platforms 306
Chapter 1
INTRODUCTION
Research in architecture of computer systems has always been a central

preoccupation of the computer science and computer engineering communi-
ties. The investigation goals vary according to the target applications, the price
of the final equipment, the programmability of the system, the environment in
which processors will be deployed and many others.
For processors to be used in parallel machines for high-performance com-
puting as it is the case in weather simulation, the focus is placed on high clock
rates, parallelism and high communication bandwidth at the expense of power.
In many embedded systems, the price of the final equipment is the governing
factor during the development. A small microcontroller is usually used to con-
trol data acquisition from sensors and provide data to actuators at a very low
frequency. In many other embedded systems, in particular in untethered sys-
tems, power and cost optimization are the central goals. In those systems, the
growing need of more computation power that contradict with power and cost
optimization put a lot of pressure on engineers who must find a good balance
of all contradicting goals. For an autonomous cart used to explore a given en-
vironment, the processing unit must be able to capture images, compress the
images and send the compressed images to a base station for control. Parallel
to this, the system must perform other actions such as obstacle detection and
avoidance. In such a system, power must be optimized to allow the system to
run as long as possible. On the other hand, the processor must process image
frames as fast as possible, to avoid important frames to be missed. Obstacle
detection and avoidance must also be done as faster as possible to avoid a pos-
sible crash of the cart. The multiplicity of goals has led to the development
of several processing architectures, each optimized according to a given goal.
Those architectures can be categorized in three main groups according to their
degree of flexibility: the general purpose computing group that is based on the
2 Reconfigurable Computing
Von Neumann (VN) computing paradigm; domain-specific processors, tailored

for a class of applications having in common a great range of characteristics;
application-specific processors tailored for only one application.
1. General Purpose Computing

In 1945, the mathematician John Von Neumann demonstrated in a study
of computation that a computer could have a simple, fixed structure, able to
execute any kind of computation, given a properly programmed control, with-
out the need for hardware modification. The VN contribution was universally
adopted and quickly became the fundament of future generations of high-speed
digital computers. One of the reasons for the acceptance of the VN approach is
its simplicity of programming that follows the sequential way of human think-
ing.
The general structure of a VN machine as shown in figure 1.1 consists of:
A memory for storing program and data. Harvard architectures contain two
parallel accessible memories for storing program and data separately.
A control unit (also called control path) featuring a program counter that
holds the address of the next instruction to be executed.
An arithmetic and logic unit (also called data path) in which instructions
are executed.
Figure 1.1. The Von Neumann Computer architecture

Introduction 3
A program is coded as a set of instructions to be executed sequentially,

instruction after instruction. At each step of the program execution, the next
instruction is fetched from the memory at the address specified in the program
counter and decoded. The required operands are then collected from the mem-
ory before the instruction is executed. After execution, the result is written
back into the memory. In this process, the control path is in charge of setting
all signals necessary to read from and write to the memory, and to allow the
data path to perform the right computation. The data path is controlled by the
control path, which interprets the instructions and sets the data path’s signals
accordingly to execute the desired operation.
In general, the execution of an instruction on a VN computer can be done in
five cycles: Instruction Read (IR) in which an instruction is fetched from the
memory; Decoding (D) in which the meaning of the instruction is determined
and the operands are localized; Read Operands (R) in which the operands
are read from the memory; Execute (EX) in which the instruction is executed
with the read operands; Write Result (W) in which the result of the execution
is stored back to the memory. In each of those five cycles, only the part of
the hardware involved in the computation is activated. The rest remains idle.
For example if the IR cycle is to be performed, the program counter will be
activated to get the address of the instruction, the memory will be addressed
and the instruction register to store the instruction before decoding will be
also activated. Apart from those three units (program counter, memory and
instruction register), all the other units remain idle. Fortunately, the structure
of instructions allows several of them to occupy the idle part of the processor,
thus increasing the computation throughput.
1.1 Instruction Level Parallelism

Pipelining is a transparent way to optimize the hardware utilization as well
as the performance of programs. Because the execution of one instruction
cycle affects only a part of the hardware, idle parts could be activated by having
many different cycles executing together. For one instruction, it is not possible
to have many cycles being executed together. For instance, any attempt to
perform the execute cycle (EX) together with the reading of an operand (R)
for the same instruction will not work, because the data needed for the EX
should first be provided by R cycle. Nevertheless, the two cycles EX and D
can be performed in parallel for two different instructions. Once the data have
been collected for the first instruction, its execution can start while the data are
being collected for the second instruction. This overlapping in the execution
of instructions is called pipelining or instruction level parallelism (ILP), and it
is aimed at increasing the throughput in the execution of instructions as well as
the resource utilization. It should be mentioned that ILP does not reduce the
execution latency of a single execution, but increases the throughput of a set
of instructions. The maximum throughput is dictated by the impact of hazards

in the computation. Those Hazards can be reduced for example by the use of a
Harvard architecture.
If tcycle is the time needed to execute one cycle, then the execution of one
instruction will require 5∗tcycle to perform. If three instructions have to be exe-
cuted, then the time needed to perform the execution of those three instructions
without pipelining is 15 ∗ tcycle , as illustrated in figure 1.2. Using pipelining,
the ideal time needed to perform those three instruction, when no hazards have
to be dealt with, is 7 ∗ tcycle . In reality, we must take hazards into account. This
increases the overall computation time to 9 ∗ tcycle .
The main advantage of the VN computing paradigm is its flexibility, be-
cause it can be used to program almost all existing algorithms. However, each
algorithm can be implemented on a VN computer only if it is coded according
to the VN rules. We say in this case that ‘The algorithm must adapt itself to the
hardware’. Also because of the temporal use of the same hardware for a wide
variety of applications, VN computation is often characterized as ‘temporal
computation’.
With the fact that all algorithms must be sequentially programmed to run on
a VN computer, many algorithms cannot be executed with their potential best
performance. Algorithms that usually perform the same set of inherent parallel
operations on a huge set of data are not good candidates for implementation on
a VN machine.
Figure 1.2. Sequential and pipelined execution of instructions on a Von Neumann Computer
Introduction 5
If the class of algorithms to be executed is known in advance, then the pro-

cessor can be modified to better match the computation paradigm of that class
of application. In this case, the data path will be tailored to always execute the
same set of operations, thus making the memory access for instruction fetching
as well as the instruction decoding redundant. Moreover, the memory access
for data fetching and storing can also be avoided if the sources and destinations
of data are known in advance. A bus could for instance provide sensor data to
the processor, which in turn sends back the computed data to the actuators
using another bus.
2. Domain-Specific Processors
A domain-specific processor is a processor tailored for a class of algorithms.
As mentioned in the previous section, the data path is tailored for an optimal
execution of a common set of operations that mostly characterizes the algo-
rithms in the given class. Also, memory access is reduced as much as possible.
Digital Signal Processor (DSP) belong to the most used domain-specific pro-
cessors.
A DSP is a specialized processor used to speed-up computation of repeti-
tive, numerically intensive tasks in signal processing areas such as telecommu-
nication, multimedia, automobile, radar, sonar, seismic, image processing, etc.
The most often cited feature of the DSPs is their ability to perform one or more
multiply accumulate (MAC) operations in single cycle. Usually, MAC opera-
tions have to be performed on a huge set of data. In a MAC operation, data
are first multiplied and then added to an accumulated value. The normal VN
computer would perform a MAC in 10 steps. The first instruction (multiply)
would be fetched, then decoded, then the operand would be read and multiply,
the result would be stored back and the next instruction (accumulate) would be
read, the result stored in the previous step would be read again and added to
the accumulated value and the result would be stored back. DSPs avoid those
steps by using specialized hardware that directly performs the addition after
multiplication without having to access the memory.
Because many DSP algorithms involve performing repetitive computations,
most DSP processors provide special support for efficient looping. Often a
special loop or repeat instruction is provided, which allows a loop implementa-
tion without expending any instruction cycles for updating and testing the loop
counter or branching back to the top of the loop. DSPs are also customized for
data with a given width according to the application domain. For example if a
DSP is to be used for image processing, then pixels have to be processed. If the
pixels are represented in Red Green Blue (RGB) system where each colour is
represented by a byte, then an image processing DSP will not need more than
8 bit data path. Obviously, the image processing DSP cannot be used again for
applications requiring 32 bits computation.
This specialization of the DSPs increases the performance of the processor

and improves the device utilization. However, the flexibility is reduced, be-
cause it cannot be used anymore to implement other applications other than
those for which it was optimally designed.
3. Application-Specific Processors
Although DSPs incorporate a degree of application-specific features such
as MAC and data width optimization, they still incorporate the VN approach
and, therefore, remain sequential machines. Their performance is limited. If
a processor has to be used for only one application, which is known and fixed
in advance, then the processing unit could be designed and optimized for that
particular application. In this case, we say that ‘the hardware adapts itself to
the application’.
In multimedia processing, processors are usually designed to perform the
compression of video frames according to a video compression standard. Such
processors cannot be used for something else than compression. Even in com-
pression, the standard must exactly match the one implemented in the proces-
sors. A processor designed for only one application is called an Application-
Specific Processor (ASIP). In an ASIP, the instruction cycles (IR, D, EX, W)
are eliminated. The instruction set of the application is directly implemented
in hardware. Input data stream in the processor through its inputs, the proces-
sor performs the required computation and the results can be collected at the
outputs of the processor. ASIPs are usually implemented as single chips called
Application-Specific Integrated Circuit (ASIC)
Example 1.1 If algorithm 1 has to execute on a Von Neumann computer,

then at least 3 instructions are required.
Algorithm 1
if a < b then
d=a+b
c=a·b
else
d=b+1
c=a−1
end if
With tcycle being the instruction cycle, the program will be executed in 3 ∗
5 ∗ tcycles = 15 ∗ tcycle without pipelining.
Let us now consider the implementation of the same algorithm in an ASIP.
We can implement the instructions d = a + b and c = a ∗ b in parallel. The
same is also true for d = b + 1, c = a − 1 as illustrated in figure 1.3
Introduction 7
Figure 1.3. ASIP implementation of Algorithm 1
The four instructions a+b, a∗b, b+1, a−1 as well as the comparison a < b
will be executed in parallel in a first stage. Depending on the value of the com-
parison a < b, the correct values of the previous stage computations will be
assigned to c and d as defined in the program. Let tmax be the longest signal
needed by a signal to move from one point to another in the physical implemen-
tation of the processor (this will happen on the path Input-multiply-multiplex).
tmax is also called the cycle time of the ASIP processor. For two inputs a and
b, the results c and d can be computed in time tmax . The VN processor can
compete with this ASIP only if 15 ∗ tcycle < tmax , i.e. tcycle < tmax /15. The
VN must be at least 15 times faster than the ASIP to be competitive. Obviously,
we have assumed a VN without pipeline. The case where a VN computer with
a pipeline is used can be treated in the same way.
ASIPs use a spatial approach to implement only one application. The func-
tional units needed for the computation of all parts of the application must be
available on the surface of the final processor. This kind of computation is
called ‘Spatial Computing’.
Once again, an ASIP that is built to perform a given computation cannot be
used for other tasks other than those for which it has been originally designed.
4. Reconfigurable Computing
From the discussion in the previous sections, where we studied three
different kinds of processing units, we can identify two main means to charac-
terize processors: flexibility and performance.
The VN computers are very flexible because they are able to compute any
kind of task. This is the reason why the terminology GPP (General Pur-
pose Processor) is used for the VN machine. They do not bring so much
performance, because they cannot compute in parallel. Moreover, the five
steps (IR, D, R, EX, W) needed to perform one instruction becomes a major
drawback, in particular if the same instruction has to be executed on huge
sets of data. Flexibility is possible because ‘the application must always
adapt to the hardware’ in order to be executed.
ASIPs bring much performance because they are optimized for a particular
application. The instruction set required for that application can then be
built in a chip. Performance is possible because ‘the hardware is always
adapted to the application’.
If we consider two scales, one for the performance and the other for the flex-
ibility, then the VN computers can be placed at one end and the ASIPs at the
other end as illustrated in figure 1.4.
Figure 1.4. Flexibility vs performance of processor classes

Introduction 9
Between the GPPs and the ASIPs are a large numbers of processors. De-
pending on their performance and their flexibility, they can be placed near or
far from the GPPs on the two scales.
Given this, how can we choose a processor adapted to our computation
needs? If the range of applications for which the processor will be used is large
or if it is not even defined at all, then the GPP should be chosen. However, if
the processor is to be used for one application like it is the case in embedded
systems, then the best approach will be to design a new ASIP optimized for
that application.
Ideally, we would like to have the flexibility of the GPP and the performance
of the ASIP in the same device. We would like to have a device able ‘to adapt
to the application’ on the fly. We call such a hardware device a reconfigurable
hardware or reconfigurable device or reconfigurable processing unit (RPU)
in analogy the Central Processing Unit (CPU). Following this, we provide a
definition of the term reconfigurable computing. More on the taxonomy in
reconfigurable computing can be found in [111] [112].
Definition 1.2 (Reconfigurable Computing) Reconfigurable com-
puting is defined as the study of computation using reconfigurable devices.
For a given application, at a given time, the spatial structure of the device
will be modified such as to use the best computing approach to speed up that
application. If a new application has to be computed, the device structure will
be modified again to match the new application. Contrary to the VN comput-
ers, which are programmed by a set of instructions to be executed sequentially,
the structure of reconfigurable devices are changed by modifying all or part
of the hardware at compile-time or at run-time, usually by downloading a so-
called bitstream into the device.
Definition 1.3 (Configuration, Reconfiguration) Configuration
respectively reconfiguration is the process of changing the structure of a recon-
figurable device at star-up-time respectively at run-time
Progress in reconfiguration has been amazing in the last two decades. This is
mostly due to the wide acceptance of the Field Programmable Gate Array (FP-
GAs) that are now established as the most widely used reconfigurable devices.
The number of workshops, conferences and meetings dealing with this topics
has also grown following the FPGA evolution. Reconfigurable devices can be
used in a wide number of fields, from which we list some in the next section.
5. Fields of Application
In this section, we would like to present a non-exhaustive list of fields, where
the use of reconfiguration can be of great interest. Because the field is still
growing, several new fields of application are likely to be developed in the
future.

Introduction To Reconfigurable Computing

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Introduction To Reconfigurable Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Reconfigurable Computing

Uploaded by

Copyright:

Available Formats

Introduction to Reconfigurable Computing

ISBN 978-1-4020-6088-5 (HB)

Printed on acid-free paper

All Rights Reserved

"Christophe Bobda’s book is an all-embracing introduction to the fundamentals

stimulate the urgently needed transformation of the software developer popu-

Reiner Hartenstein, IEEE fellow,

4. HIGH-LEVEL SYNTHESIS FOR RECONFIGURABLE DEVICES 99

One indicator of the growing importance of Reconfigurable Computing is

A short overview on the content of each chapter is provided in Section 6 of

Kaiserslautern, June 2007

1.1 The Von Neumann Computer architecture 2

2.18 EEPROM Technology 32

2.48 Pipeline Reconfiguration: Mapping of a 5 stage virtual

4.7 HLS of the graph in figure 4.6 on a an architecture with

5.7 Increase of the number of MER through the insertion

6.15 A strongly connected configuration on a DyNoC 204

7.21 Possible enhancement of the Erlangen Slot Machine on

9.13 Adaptive controller architecture 308

2.1 Truth table of the Full adder 35

Research in architecture of computer systems has always been a central

Von Neumann (VN) computing paradigm; domain-specific processors, tailored

1. General Purpose Computing

Figure 1.1. The Von Neumann Computer architecture

A program is coded as a set of instructions to be executed sequentially,

1.1 Instruction Level Parallelism

of instructions. The maximum throughput is dictated by the impact of hazards

If the class of algorithms to be executed is known in advance, then the pro-

This specialization of the DSPs increases the performance of the processor

Example 1.1 If algorithm 1 has to execute on a Von Neumann computer,

Figure 1.3. ASIP implementation of Algorithm 1

Figure 1.4. Flexibility vs performance of processor classes

You might also like