Introduction To Reconfigurable Computing Architectures Algorithms and Applications
Introduction To Reconfigurable Computing Architectures Algorithms and Applications
to
Reconfigurable Computing
CHRISTOPHE BOBDA
University of Kaiserslautern, Germany
Will be written
by
Peter Cheung
and
Reiner Hartenstein
Contents
Foreword iii
Preface ix
List of Figures xi
List of Tables xix
1. INTRODUCTION 1
1 General Purpose Computing 2
2 Domain Specific Processors 5
3 Application Specific Processors 6
4 Reconfigurable Computing 8
5 Fields of Application 9
6 Organization of the Book 11
2. RECONFIGURABLE ARCHITECTURES 15
1 Early Work 15
2 Simple Programmable Logic Devices 26
3 Complex Programmable Logic Device 28
4 Field Programmable Gate Arrays 28
5 Coarse-Grained Reconfigurable Devices 49
6 Conclusion 65
3. IMPLEMENTATION 67
1 Integration 67
2 FPGA Design Flow 72
3 Logic Synthesis 74
4 Conclusion 97
vi RECONFIGURABLE COMPUTING
9. APPLICATIONS 285
1 Pattern Matching 286
2 Video Streaming 294
3 Distributed Arithmetic 298
4 Adaptive Controller 306
5 Adaptive Cryptographic Systems 310
6 Software Defined Radio 313
7 High Performance Computing 315
8 Conclusion 317
References 319
Appendices 336
A Hints to Labs 337
1 Prerequisites 338
2 Reorganization of the Project Video8 non pr 338
B Party 345
C Quick Part-Y Tutorial 349
Preface
INTRODUCTION
instruction is fetched from the memory at the address specified in the program
counter and decoded. The required operands are then collected from the mem-
ory before the instruction can be executed. After execution, the result is written
back into the memory. In this process, the control path is in charge of setting
all signals necessary to read from and write to the memory, and to allow the
data path to perform the right computation. The data path is controlled by the
control path, which interprets the instructions and sets the data path’s signals
accordingly to execute the desired operation.
In general the execution of an instruction on a Von Neumann computer can
be done in five cycles: Instruction Read (IR) in which an instruction is fetched
from the memory; Decoding (D) in which the meaning of the instruction is
determined and the operands are localized; Read Operands (R) in which the
operands are read from the memory; Execute (EX) in which the instruction is
executed with the read operands; Write Result (W) in which the result of the
execution is stored back to the memory. In each of those five cycles only the
part of the hardware involved in the computation is activated. The rest remains
idle. For example if the IR cycle is to be performed, the program counter will
be activated to get the address of the instruction, the memory will be addressed
and the instruction register to store the instruction before decoding will be
also activated. Apart from those three units (program counter, memory and
instruction register), all the other units remain idle. Fortunately, the structure
of instructions allows several of them to occupy the idle part of the processor,
thus increasing the computation throughput.
the impact of hazards in the computation. Those Hazards can be reduced for
example by the use of a Harvard architecture.
Figure 1.2. Sequential and pipelined execution of instructions on a Von Neumann Computer
If tcycle is the time needed to execute one cycle, then the execution of one
instruction will require 5∗tcycle to perform. If three instructions have to be exe-
cuted, then the time needed to perform the execution of those three instructions
without pipelining is 15 ∗ tcycle . As illustrated in figure 1.2. Using pipelining,
the ideal time needed to perform those three instruction, when no hazards have
to be dealed with, is 7 ∗ tcycle . In reality we must take hazards into account.
This increase the overall computation time to 9 ∗ tcycle .
The main advantage of the Von Neumann computing paradigm is its flexi-
bility, since it can be used to program almost all existing algorithms. However,
each algorithm can be implemented on a Von Neumann computer only if it is
coded according to the Von Neumann rules. We say in this case that "The al-
gorithm must adapt itself to the hardware". Also due to the temporal use of the
same hardware for a wide variety of applications, Von Neumann computation
is often characterized as "temporal computation".
With the fact that all algorithms must be sequentially programmed in order
to run on a Von Neumann computer, many algorithms cannot be executed with
their potential best performance. Algorithms, which usually perform the same
set of inherent parallel operations on a huge set of data are not good candidates
for implementation on a Von Neumann machine.
If the class of algorithms to be executed is known in advance, then the pro-
cessor can be modified to better match the computation paradigm of that class
Introduction 5
of application. In this case, the data path will be tailored to always execute the
same set of operations, thus making the memory access for instruction fetching
as well as the instruction decoding redundant. Moreover, the memory access
for data fetching and storing can also be avoided if the sources and destinations
of data are known in advance. A bus could for instance provide sensor data to
the processor, which in turn sends back the computed data to the actuators
using another bus.
it cannot be used anymore to implement other applications other than those for
which it was optimally designed.
Algorithm 1
if a < b then
d=a+b
c=a·b
else
d=b+1
c=a−1
end if
With tcycle being the instruction cycle, the program will be executed in 3 ∗
5 ∗ tcycles = 15 ∗ tcycle without pipelining.
Let us now consider the implementation of the same algorithm in an ASIP.
We can implement the instructions d = a + b and c = a ∗ b in parallel. The
same is also true for d = b + 1, c = a − 1 as illustrated in figure 1.3
The four instruction a + b, a ∗ b, b + 1, a − 1 as well as the comparison
a < b will be executed in parallel in a first stage. Depending on the value of
Introduction 7
the comparison a < b, the correct values of the previous stage computations
will be assigned to c and d as defined in the program. Let tmax be the longest
signal needed by a signal to move from one point to another in the physical
implementation of the processor (this will happen on the path Input-multiply-
multiplex). tmax is also called the cycle time of the ASIP processor. For two
inputs a and b, the results c and d can be computed in time tmax . The Von
Neumann processor can compete with this ASIP only if 15 ∗ tcycle < tmax , i.e.
tcycle < tmax /15. The Von Neumann must be at least 15 times faster than the
ASIP to be competitive. Obviously we have assumed a Von Neumann without
pipeline. The case, where a Von Neumann computer with a pipeline is used,
can be treated in the same way.
ASIPs use a spatial approach to implement only one application. The func-
tional units needed for the computation of all parts of the application must be
available on the surface of the final processor. This kind of computation is
called "Spatial Computing".
Once again, an ASIP that is built to perform a given computation cannot be
used for other tasks other than those for which it has been originally designed.
8 RECONFIGURABLE COMPUTING
4. Reconfigurable Computing
From the discussion in the previous sections, where we studied three differ-
ent kinds of processing units, we can identify two main means to characterize
processors: flexibility and performance.
The Von Neumann computers are very flexible because they are able to
compute any kind of task. This is the reason why the terminology GPP
(General Purpose Processor) is used for the Von Neumann machine. They
don’t bring so much performance, because they cannot compute in parallel.
Moreover the five steps (IR, D, R, EX, W) needed to perform one instruc-
tion becomes a major drawback, in particular if the same instruction has
to be executed on huge sets of data. Flexibility is possible because "the
application must always adapt to the hardware" in order to be executed.
ASIPs bring much performance because they are optimized for a particular
application. The instruction set required for that application can then be
built in a chip. Performance is possible because "the hardware is always
adapted to the application".
If we consider two scales, one for the performance and the other for the flex-
ibility, then the Von Neumann computers can be placed at one end and the
ASIPs at the other end as illustrated in figure 1.4
Between the GPPs and the ASIPs are a large numbers of processors. De-
pending on their performance and their flexibility, they can be placed near or
far from the GPPs on the two scales.
Given this, how can we choose a processor adapted to our computation
needs? If the range of applications for which the processor will be used is large
or if it is not even defined at all, then the GPP should be chosen. However if
the processor is to be used for one application like it is the case in embedded
systems, then the best approach will be to design a new ASIP optimized for
that application.
Ideally, we would like to have the flexibility of the GPP and the performance
of the ASIP in the same device. We would like to have a device able "to adapt
to the application" on the fly. We call such a hardware device a reconfigurable
hardware or reconfigurable device or reconfigurable processing unit (RPU)
in analogy the Central Processing Unit (CPU). Following this, we provide a
definition of the term reconfigurable computing. More on the taxonomy in
reconfigurable computing can be found in [111] [112].
For a given application, at a given time, the spatial structure of the device
will be modified such as to use the best computing approach to speed up that
application. If a new application has to be computed, the device structure
will be modified again to match the new application. Contrary to the Von
Neumann computers, which are programmed by a set of instructions to be
executed sequentially, the structure of reconfigurable devices are changed by
modifying all or part of the hardware at compile-time or at run-time, usually
by downloading a so called bitstream into the device.
Progress in reconfiguration has been amazing in the last two decades. This
is mostly due to the wide acceptance of the FPGAs (Field Programmable Gate
Array) that are now established as the most widely used reconfigurable devices.
The number of workshops, conferences and meetings dealing with this topics
has also grown following the FPGA evolution. Reconfigurable devices can be
used in a wide number of fields, from which we list some in the next section.
5. Fields of Application
In this section, we would like to present a non exhaustive list of fields, where
the use of reconfiguration can be of great interest. Because the field is still
10 RECONFIGURABLE COMPUTING
fixed devices like navigation systems, music and video players as well as TV
devices are available in cars and at home. All those devices are equipped with
electronic control units that run the desire application on the desired device.
Furthermore, many of the devices are use in a time multiplexed fashion. It is
difficult to imagine someone playing mp3 songs while watching a video clip
and given a phone call. For a group of devices used exclusively in a time mul-
tiplexed way, only one electronic control unit can be used. Whenever a service
is needed, the control unit is connect to the corresponding device at the correct
location and reconfigure with the adequate configuration. For instance a do-
mestic mp3, a domestic DVD player, a car mp3, a car DVD player as well as a
mobile mp3 player and a mobile video player can all share the same electronic
unit, if they are always used by the same person. However, if several persons
have access to the same devices (this can happen in a household with several
people), then sharing of the electronic control unit will be rather difficult. In
the first case, the user just needs to remove the control unit from the domestic
devices and connect them to one car device when going to work. The control
unit can be removed from the car and connected to a mobile device if the user
decides to go for a walk. Coming back home, the electronic control unit is
removed from the mobile device and used for watching video.
RECONFIGURABLE ARCHITECTURES
1. Early Work
Attempts to have a flexible hardware structure that can be dynamically mod-
ified at run-time to compute a desired function are almost as older as the devel-
opment of other computing paradigms. In order to overcome the non flexibility
of the first computer, the ENIAC (Electronical Numerical Integrator and Com-
puter) that could be programmed only by hardwiring an algorithm, Jon Von
16 RECONFIGURABLE COMPUTING
Neumann proposed a first universal architecture made upon three main blocks
(memory, datapath and control path), able to run any given and well coded
program. The Von Neumann approach was not intended to provide the most
efficient hardwired structure for each application, but a platform to run all type
of programs without spending too much effort in the rearrangement of the un-
derlying hardware. Thee previous effort of Von Neumann was then pursued by
other researchers with goal of always have the best computation structure of a
given application. We present some of those investigations in this section.
entailed four amplifiers and associated input logic for signal inversion, ampli-
fication, or high-speed storage. The second basic block used as combinatoric
that was made upon ten diodes and four output drivers.
The connection between the modules is done by wiring harness (figure 2.3).
With this architecture, the reconfiguration was done manually by replacing
18 RECONFIGURABLE COMPUTING
some modules on the mother board (figure 2.4) or by changing a wiring harness
for a new connection among the existing modules.
The fix-plus machine was intended to be used for accelerating Eigenvalues
computation of matrices and has shown a speed gain of 2.5 to 1000 over the
IBM7090 [80] [79]. The available technology at that time however made the
use of the fix-plus machine difficult. Reconfiguration had to be done manually,
and substantial software efforts were required to implement applications.
Reconfigurable Architectures 19
The Rammig machine was heavily used as emulation platform, which is also
one of the largest application fields of today’s FPGAs. It was therefore possible
to control the complete behaviour of a circuit under observation from the soft-
ware side. This was done by buffering module outputs in registers and transfer
the contents of these registers to the host before clocking the system. The sys-
tem was implemented with a 128x192 crossbar and got the name META-46
GOLDLAC (figure 2.7).
Figure 2.8. General architecture of the XPuter as implemented in the Map oriented machine
(MOM-3) prototype.
The goal was to have a very high degree of programmable parallelism into
the hardware, at lowest possible level, in order to obtain performance not possi-
ble with the Von Neumann computers. Instead of sequencing the instructions,
the Xputer used to sequences data, thus exploiting the regularity in the data
dependencies of some class of applications like in image processing, where a
repetitive processing is performed on a large amount of data. An Xputer con-
sists of three main parts: the data sequencer, the data memory and the reconfig-
urable ALU (rALU) that permits the run-time configuration of communication
at levels below instruction set level. Within a loop, data to be processed were
accessed via a data structure called scan window. Data manipulation was done
by the rALU that had access to many scan windows. The most essential part of
the data sequencer were the Generic Address Generator (GAG) that were able
to produce address sequences corresponding to the data of up to three nested
loops. A rALU subnet that could be configured to perform all computations on
the data of a scan window is generally, was required for each level of a nested
loop.
The general XPuter architecture is presented on figure 2.8. Shown is the
realization of the XPuter as a Map Oriented Machine (MoM). The overall sys-
tem was made upon of a host processor, whose memory was accessible by the
MoM.
The rALU subnets received their data directly from local memory or from
the host main memory via the MoM bus. Communication was also possible
22 RECONFIGURABLE COMPUTING
among the rALUs via direct serial connections. Several XPuters could also be
connected in order to provide more performance.
For the executing a program, the hardware had first to be configured. If no
reconfiguration would take place at run-time, then only the data memory was
necessary. Otherwise, a configuration memory would be required to hold all
the configurations to be used at run-time.
The basic building block of the reconfigurable ALU was the so called recon-
figurable Datapath Unit (rDPU) (figure 2.9). Several rDPUs were used within
an rALU for data manipulation. Each rDPU had two registered inputs and two
registered ouputs with a datawidth of 32 bit. Input data were provided either
from the north or from the west, while the south and east were used for the
output. Besides the interconnection lines for the rALUs, a global I/O-Bus is
available for the connection of designs to the external world. The I/O bus was
principally used for accessing the scan windows.
The global view of the reconfigurable datapath attached to a host processor
is given on figure 2.9. It consists of two main parts: the control unit and a field
of rDPUs. The register file was used for optimizing the memory access when
the GAG operated with overlapping scan windows. In this case, data in the
actual scan window position will be reused in the following positions. Those
data could therefore be temporary stored in registers and copied back into the
memory if they were no more needed.
The control implemented a program that is loaded on reconfiguration to
control different units of the rALU. Its instruction set consisted of instructions
for loading the data as well as instructions for collecting results from the field.
Reconfigurable Architectures 23
Application of the XPuters were in image processing, systolic array and sig-
nal processing.
A prototype of the PAM, the Perle was built using a 5x5 array of Logic
Array Cell (LCA) , a CMOS cell designed by Xillinx. Perle-0 features a VME
bus to interface a host CPU to which it is coupled. The host is in charge of
downloading the bitstream to the Perle. The configuration controller, as well
as the host-bus communication protocol are programmed into two extra LCAs,
statically configured at power-up time from a PROM.
Example of applications for which the Perle was used to implement, are data
compression, cryptography, image processing, energetic physics.
1.8 DISC
The purpose of the Dynamic Instruction Set Computer (DISC) presented in
[219] is to support demand-driven instruction set modifications. In contrast to
the PRISM approach where the specific instructions are synthesized and fixed
at compiled-time, the DISC approach uses partial reconfiguration of FPGAs
to place on the FPGA, hardware modules, each of which implements a given
instruction. The relocation was also proposed as a means to reduced defrag-
mentation of the FPGA. Due to its partial reconfiguration capabilities, the Na-
tional Semiconductor Configurable Logic Array (Clay) was chosen for build-
ing a prototype consisting of a printed circuit board with two CLA31 FPGA
and some memory. While the first FPGA was used to control the configuration
process, the second FPGA was used for implementing the instruction specific
hardware blocks. Via an ISA Bus, the board was attached to a host processor
running on Linux. A simple image mean filter first implemented as applica-
tion specific module, and later as sequence of general purpose instructions was
used to show the viability of the platform.
(a) PAL: The OR-connections are fixed (b) PLA: Programmable OR-connections
gate in the OR-plane whose outputs correspond to those of the PAL/PLA. The
connections in the two planes are programmable by the user.
Because every Boolean function can be written as a sum of products, the
PLAs and PALs can be used to program any Boolean function after fabrica-
tion. The products are implemented in the AND-plane by connecting the wires
accordingly and the sums of the product are realized in the OR-plane. While in
the PLAs both fields can be programmed by the user, this is not the case in the
PALs where only the AND-plane is programmable. The OR-plane is fixed by
the manufacturer. Therefore, the PALs can be seen as a subclass of the PLAs.
Example 2.1 Figure 2.12 shows the PLA and PAL implementations of the
functions F 1 = A·B+A·B and F 2 = A·B+B·C. While the OR-plane of the
PAL is no more programmable, by modifiying the connection in the OR-Plane
of the PLA, diferent sums of the products can be generated.
Further enhancements (feeding the OR-output to a flip flop or feeding back a
OR-output to a AND-input) are made on PLAs to allow more complex circuits
to be implemented.
PALs and PLAs are well suited to implement two-level circuits, those are
circuits made upon the sum of product as described earlier. At the first level, all
the products are implemented and all the sum are implemented on the second
level.
The main limitation of PLAs and PALs is their low capacity, which is due
to the nature of the AND-OR-plane. The size of the plane grows too quickly
as the number of inputs increases. Due to their low complexities, PALs and
PLAs belong to the class of devices called simple programmable logic devices
(SPLD).
28 RECONFIGURABLE COMPUTING
4.1 Technology
The technology defines how the different blocks (logic blocks, intercon-
nect, input/output) are physically realized. Basically, two major technologies
exist: Antifuse and memory-based. While the antifuse paradigm is limited to
the realization of interconnexion, the memory-based paradigm is used for the
computation as well as the interconnection. In the memory-based category, we
can list the SRAM the EEPROM and the Flash based FPGAs.
4.1.1 Antifuse
Contrary to a fuse an antifuse is normally an open circuit. An antifuse based
FPGAs uses special antifuses included at each connection customization point.
The two-terminal elements are connected to the upper and lower layer of the
antifuse, in the middle of which a dielectric is placed (figure 2.15). In its initial
state, the high resistance of the dielectric does not allow any current to flow
between the two layers. Applying a high voltage causes large power dissi-
pation in a small area, which melts the dielectric. This operation drastically
reduces the resistance and a link can be be built, which permanently connects
the two layers. The two types of antifuses actually commercialized are: The
PLICE (Programmable Low-Impedance Circuit Element) (2.15(a)), which is
30 RECONFIGURABLE COMPUTING
manufactured by the by the company Actel and the Metal Antifuse also called
ViaLink made by the company QuickLogic.
The PLICE antifuse consists of an Oxygen-Nitrogen-Oxygen (ONO) dielec-
tric layer sandwiched between a polysilicon and an n+ diffusion layer that
serves as conductor. The ViaLink antifuse (2.15(b)) is composed of a sandwich
of very high resistance layer of programmable amorphous silicon between two
metal layers. When a programming voltage is applied, a metal-to-metal link is
formed by permanently converting the silicon to a low resistance state.
The main advantage of the antifuse chips is their small area and their sig-
nificantly lower resistance and parasitic capacitance compared to transistors.
This help to reduce the RC delays in the routing. However, anti-fuse based
FPGAs are not suitable for devices which must be frequently reprogrammed,
as it is the case in reconfigurable computing. Antifuse FPGAs are normally
programmed once by the user and will not change anymore. For this reason,
they are also known as one-time programmable FPGAs.
4.1.2 SRAM
Unlike the antifuse that is used mostly used to configure the connection, a
Static RAM (SRAM) is use to configure the logic blocks and the connection
as well. SRAM-based FPGAs are the most widely used.
In an SRAM-based FPGA, the states of the logic blocks, i.e. their function-
ality bits as well as that of the interconnections are controlled by the output of
SRAM Cells (figure 2.16).
Reconfigurable Architectures 31
4.1.3 EPROM
Erasable Programmable Read Only Memory (EPROM) devices are based on
a floating gate (figure 2.18(a)). The device can be permanently programmed
by applying a high voltage (10-21 V) between the control gate and the drain
of the transistor (12 V). This causes the floating gate to be permanently and
negatively charged. This negative potential on the floating gate compensates
the voltage on the control gate and keeps the transistor closed.
In an UV (Ultra Violet) PROM, the programming process can be reversed
by exposing the floating gate to UV-light. This process reduces the threshold
voltage and makes the transistor function normally. For this purpose, the device
must be remove from the system in which it operates and plug into a special
device.
In Electrically Erasable and Programmable ROM (EEPROMs) as well as
in Flash-EPROM, the the erase operation is accomplished electrically, rather
than by exposure to ultraviolet light. A high negative voltage must therefore
be applied at the control gate. This process is faster than using a UV lamp and
the chip does not have to be removed from the system. In EEPROM-based
devices, two or more transistors are typically used in a ROM cell: one access
and one programmed transistor. The programmed transistor performs the same
function as the floating gate in an EPROM, with both charge and discharge
being done electrically.
In the Flash-EEPROMs that are used as logic tile cell in the Actel ProASIC
chips (figure 2.18(b)), two transistors share the floating gate, which stores the
programming information. The sensing transistor is only used for writing and
verification of the floating gate voltage while the other is used as switch. This
can be used to connect or disconnect routing nets to or from the configured
logic. The switch is also used to erase the floating gate.
Reconfigurable Architectures 33
4.2.1 Multiplexer
A 2:1 (2n -input-1) multiplexer (MUX) is a selector circuit with 2n inputs and
one output. Its function is to allow only one input line to be fed at the output.
The line to be fed at the output can be selected using some selector inputs. In
order to select one of the 2n possible inputs, n selectors lines are required. A
MUX can be used to implement a given function. The straihthforward way is
to place the possible results of the function at the 2n inputs of the MUX and to
place the function arguments at the selector inputs. In this case, the MUX will
work like a look-up table that will be explained in the next section. Several
possibilities exist to implement a function in a MUX, for instance by using
some arguments as inputs and other as selectors. Figure 2.19 illustrates this
case for the function f = a×b. The argument a is used as input in combination
with a second input 0 and the second argument b is used as selector.
The Shannon expansion theorem can be used to decompose a function and
implement it into a MUX. This theorem states that a given Boolean logic func-
tion F (x1 , .xn ) of n variables can be written as shown in the following equa-
tion.
F (x1 , · · · , xn ) = F (x1 , · · · , xi = 1, · · · , xn ) × xi +
F (x1 , · · · , xi = 0, · · · , xn ) × xi
F (x1 ,··· ,xn )=F (x1 ,··· ,xi =1,··· ,xn )·xi +F (x1 ,··· ,xi =0,··· ,xn )·xi
=[F (x1 ,··· ,xi =1,··· ,xj =1,..xn )·xi +F (x1 ,··· ,xi =0,··· ,xj =1,··· ,xn )·xi ]xj
+[F (x1 ,··· ,xi =1,··· ,xj =0,··· ,xn )·xi +F (x1 ,··· ,xi =0,··· ,xj =0,··· ,xn )·xi ]xj
=F (x1 ,··· ,xi =1,··· ,xj =1,··· ,xn )·xi xj +F (x1 ,··· ,xi =0,··· ,xj =1,··· ,xn )·xi xj
+F (x1 ,··· ,xi =1,··· ,xj =0,··· ,xn )·xi xj +F (x1 ,··· ,xi =0,··· ,xj =0,··· ,xn )·xi xj
ai bi ci−1 si ci
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Table 2.1. Truth table of the Full adder
Figure 2.20. Implementation of a Full adder using two 4-input one output MUX
The Actel ACT X Logic module. Multiplexer are used as function genera-
tors in the Actel FPGA devices (figure 2.21). In the Actel ACT1 device, the
basic computing element is the logic module. It is an 8-input 1-output logic
circuit that can implement a wide range of functions. Besides combinatorial
functions, the logic module can also implement a variety of D-latches. The
C-modules present in the second generation of Actel devices, the ACT2, are
similar to the logic module. The S-modules, which are found in the second and
third generation of actel devices, contain an additional dedicated fipl-flop. This
avoid the building of flip flop from the combinatorial logic as it is the case in
the logic module.
36 RECONFIGURABLE COMPUTING
Figure 2.21. The Actel basic computing blocks uses multiplexers as function generators
Example 2.3 The full adder of the previous section can be implemented us-
ing two 3-inputs, 1 output (3:1) LUT as shown in figure 2.23. The sum is
implemented in the first LUT and the carry-out in the second LUT.
The first three columns of table 2.1 represent the input values. They build
the address used to retrieve the function value (corresponding to the value
in the third column) from the corresponding LUT location. The content of
the fourth and fifth columns of the truth table must therefore be copied in the
Reconfigurable Architectures 37
corresponding LUTs as shown in figure 2.23. The sum values are copied in the
upper LUT, while the carry values are compiled in the lower LUT.
SRAM-based LUT are used in the most commercial FPGAs as function gen-
erators. Several LUTs are usually grouped in a large module in which other
functional elements like flip flops and multiplexers are available. The connec-
tion between the LUTs inside such modules is faster than connections via the
routing network, because dedicated wires are used. We consider examples of
devices using LUT as function generator, the Xilinx FPGA and those of Altera,
and we next explain how the LUT are used in those devices.
The Xilinx Configurable Logic Block. The basic computing block in the
Xilinx FPGAs consists of a LUT with variable number of inputs, a set of mul-
tiplexers, arithmetic logic and a storage element (figure 2.24).
The LUT is used to store the configuration while the multiplexers selects the
right inputs for the LUT and the storage element as well as the right output of
the block.
38 RECONFIGURABLE COMPUTING
The arithmetic logic provides some facilities like XOR-gate and faster carry
chain to build faster adder without wasting too much LUT-resources.
Several basic computing blocks are grouped in a coarse-grained element
called the Configurable Logic Block (CLB)(figure 2.25). The number of basic
blocks in a CLB varies from device to device. In the older devices like the
4000 series, the Virtex and Virtex E and the Spartan devices, two basic blocks
were available in a CLB. In the newer devices like the Spartan 3, the Virtex
II, the Virtex II-Pro and the Virtex 4, the CLBs are divided in four slices each
of which contains two basic blocks. The CLBs in the Virtex 5 devices contain
only two slices, each of which contains 4 basic blocks.
In the newer devices, the left part slices of a CLB, also called SLICEM can
be configure either as combinatorial logic, or can be use as 16-bit SRAM or
as shift register while righ-hand slices, the SLICEL, can only be configured as
combinatorial logic.
Except for the Virtex 5, all LUTs in Xilinx devices have four inputs and one
output. In the Virtex 5 each LUT has six inputs and two outputs. The LUT can
be configured either as a 6-input LUT, in which case only one output can be
used, or as two 5-input LUTs, in which case each of the two outputs is used as
output of a 5-input LUT.
The Altera Logic Array Block. Like the Xilinx devices, Altera’s FPGAs
(Cyclone, FLEX and Stratix) are also LUT based. In the Cyclone II as well
as in the FLEX architecture, the basic unit of logic is the logic element (LE)
that typically contains a LUT, a filp-flop, a multiplexer and additional logic
for carry chain and register chain. Figure 2.26 shows the structure of the logic
element in the Cyclone FPGA. This structure is very similar to that of the
Altera FLEX devices. The LEs in the cyclone can operate in different modes
each of which defines different usage of the LUT inputs.
In the Stratix II devices, the basic computing unit is called adaptive logic
module (ALM) (figure 2.27). The ALM is made upon a mixture of 4-input and
Reconfigurable Architectures 39
Figure 2.25. CLB in the newer Xilinx FPGAs (Spartan 3, Virtex 4 and Virtex 5)
3-input LUTs that can be used to implement logic functions with variable num-
ber of inputs. This insures a backward compatibility to 4-input based designs,
while providing the possibility to implement coarse-grained module with vari-
able number (up to 8) inputs. Additional modules including flip flops, adders
and carry logic are also provided.
Altera logic cells are grouped to form coarse-grained computing elements
called Logic Array Blocks (LAB). The number of logic cells per LAB varies
from the device to device. The Flex 6000 LABs contains ten logic elements
while the FLEX 8000 LAB contains only eight. Sixteen LEs are available for
each LAB in the cyclone II while the Stratix II LAB contains eight ALMs.
40 RECONFIGURABLE COMPUTING
4.3.1 Symetrical Array: The Xilinx Virtex and Atmel AT40K Families
A symmetrical array-based FPGA consists of a two dimensional array of
logic blocks immersed in a set of vertical and horizontal lines. Switch elements
exist at the intersections of the vertical and horizontal lines to allow for the
connections of vertical and horizontal lines.
Examples of FPGAs arranged in a symmetrical array-based are the Xilinx
Virtex FPGA and the Atmel (figure 2.29).
On the Xilinx devices, CLBs are embedded in the routing structure that
consists of vertical and horizontal wires. Each CLB element is tied to a switch
matrix to access the general routing structure, as shown in figure 2.30(a). The
switch matrix provides programmable multiplexers, which are used to select
the signals in the given routing channel that should be connected to the CLB
terminals. The switch matrix can also connect vertical and horizontal lines,
thus making routing possible on the FPGA.
Each CLB has access to two tri-state driver (TBUF) over the switch matrix.
Those can be use to drive on-chip busses. Each tri-state buffer has its own tri-
Reconfigurable Architectures 41
state control pin and its own input pin that are controlled by the logic built in
the CLB. Four horizontal routing resources per CLB are provided for on chip
tri-state busses. Each tri-state buffer has access alternately to two horizontal
lines, which can be partitioned as shown in figure 2.30(b). Besides the switch
matrix, CLBs connect to their neighbors using dedicated fast connexion tracks.
The routing is done on the Atmel chips using a set of busing planes. Seven
busing planes are available on the AT40K. Figure 2.31 depicts a part of the
plane with five identical busing planes. Each plane has three bus resources: a
local-bus resource (the middle bus) and two express-bus resources (both sides).
Repeaters are connected to two adjacent local-bus segments and two express-
bus segments. Local busses segments span four cells while an express bus
segments span eight cells. Long tri-state bus can be created by bypassing a
repeater.
42 RECONFIGURABLE COMPUTING
Figure 2.29. Symetrical array arrangement in a) the Xilinx and b) the Atmel AT40K FPGAs
(a) CLB connexion to the switch matrix (b) Tri-state buffer connection to horizontal lines
Figure 2.32. Row based arrangement on the Actel ACT3 FPGA Family
Locally, the Atmel chip provides a star-like connection resource that allows
each cell (which is the basic unit of computation) to be connect directly to all
its 8 neighbors. Figure 2.31 depicts direct connections between a cell and its
eight nearest neighbors.
Figure 2.33. Actel’s ACT3 FPGA horizontal and vertical routing resources
three types: input, output and long. They are also divided in more segments.
Each segment in an input track is dedicated to the input of a particular module
and each segment in the output track is dedicated to the output of a particular
module. Long segments are uncommitted and can be assigned during routing.
Each output segment span four channels (two above and two below) except
near the top and the bottom of the array. Vertical input segments span only the
channel above or the channel below. The tracks dedicated to module inputs
are segmented by pass transistors in each module row. During normal user
operation the pass transistors are inactive, which isolate the inputs of a module
from the inputs of the module above it.
The connections inside Actel FPGAs are established using antifuse. Four
types of antifuse connections exist for the ACT3: Horizontal-to-vertical (XF)
connection, Horizontal-to-horizontal (HF) connection, vertical-to-vertical (FF)
connection and fast-vertical connection (figure 2.33).
of the eight surrounding tiles (figure 2.34). The long-line resources provide
routing for longer distances and higher fanout connections. These resources,
which vary in length (spanning one, two, or four tiles), run both vertically and
horizontally and cover the entire device. The very long lines span the entire
device. They are used to route very long or very high fanout nets.
Signals between LEs or ALMs in the same LAB and those in the adjacent
LABs are routed via local interconnect signals. Each row of a LAB is served
by a dedicated row interconnect, which routes signals between LABs in the
same row. The column interconnect routes signals between rows and routes
signals from I/O pin rows.
Reconfigurable Architectures 47
Figure 2.38. Structure of a Xilinx Virtex II Pro FPGA with two PowerPC 405 Processor blocks
a large acceptance and their future is not really easy to predict. This does not
mean that the philosophy behind coarse-grained device is wrong Companies
investigating coarse-grained reconfigurable devices must also face the FPGA
competition, dominated by large companies that provides many coarse-grained
elements in their devices according to the market need.
A wide variety of coarse-grained reconfigurable devices that we classify
in three categories will be presented in this section: In the first category, the
dataflow machines, functions are usually built by connecting some PEs in or-
der to build a functional unit that is used to compute on a stream of data. In
the second category are the network-based devices in which the connection be-
tween the PEs is done using messages instead of wires. The third category are
the Embedded FPGA devices, which consist of a processor core that cohabit
with a programmable logic on the same chip.
The Processing Array Element (PAE). There exist two different kinds of
PAEs: the ALU PAE and the RAM-PAE. An ALU-PAE contains an ALU that
can be configured to perform basic arithmetic operations, while the RAM-PAE
is used for storing data. The Back-Register (BREG) provides routing chan-
nels for data and events from bottom to top, additional arithmetic and register
functions while the Forward-Register (FREG) is used for routing the signals
from top to bottom and for the control of dataflow using event signals. All
objects can be connected to horizontal routing channels using switch-objects.
DataFlow Register (DF-Registers) can be used at the object output for data
buffering in case of a pipeline stall. Input registers can be pre-loaded by con-
figuration data and always provide single cycle stall.
A RAM-PAE is similar to an ALU-PAE. However, instead of an ALU, a
dual ported RAM is used for storing data. The RAM generates a data packet
after an address was received at the input. Writing to the RAM requires two
data packet: one for the address and the other for the data to be written. Figure
2.40 shows an ALU-PAE. The structure is the same for a RAM-PAE, however
a RAM is used instead of the ALU.
Figure 2.40. The XPP ALU Processing Array Element. The structure of the RAM ALU is
similiar.
the horizontal and vertical channels a configuration bus exists, which allows
the CMs to configure the PAEs.
Horizontal buses are used to connect a PAE within a row while the verti-
cal buses are used to connect objects to a given horizontal bus. Vertical con-
nections are done using configurable switch objects which segment the verical
communication channels. The vertical routing is enabled using register-objects
integrated into the PAEs.
The results can be stored in the PE’s register file or can be sent to the output.
The configuration is done by the State Transition Controller that set a pointer
to a corresponding instruction register according to the operation mode of the
ALU. Having many configuration registers allow for storing configuration data
directly in the proximity of the PEs and allow a fast switching from one con-
figuration to the next one.
A system controller
(a) The ACM node structure (b) The ACM QS2412 Resources
An algorithmic engine that defines the node type. The node type can be
customized at compile-time or at run-time by the user to match a given
algorithm. Four types of nodes exist: The PProgrammable Scalar Node
(PSN) provides a standard 32-bit RISC architecture with 32 general pur-
pose registers, the Adaptive Execution Node (AXN) provides variable word
size Multiply Accumulate (MAC) and ALU operations, the Domain Bit
Manipulation ( DBN) node provides bit manipulation and byte oriented
operations, and the External Memory Controller node provides DDRRAM,
SRAM, memory random access and DMA control interfaces for off-chip
memory access.
The node memory for data storage at node level.
A node wrapper which hide the complexity of the network architecture. It
contains a MIN interface to support communication, a hardware task man-
ager for task managements at node level, and a DMA engine. The wrapper
envelops the algorithmic engine and presents an identic interface to the
neighboring nodes. It also incorporates dedicated I/O circuitry, memory,
memory controllers and data distributors and aggregators (figure 2.45).
terfaces accessible via the MIN for testing (JTAG) and communication with
off-chip devices (figure 2.45(b)).
1 That is a 32 bit RISC processor operating at 100 MHz with a data cache of 8K bite and an instruction
cache of 8K bite
Reconfigurable Architectures 59
5.4.4 MorphoSys
The complete MorphoSys [194] chip comprises a control processor (TinyRISC),
a frame buffer (data buffer), a DMA controller, a Context Memory (configura-
tion memory), and an array of 64 Reconfigurable Cells (RC). Each RC com-
prises an ALU-multiplier, a shift unit, and two multiplexers at the RC inputs.
Each RC also has an output register, a feedback register and a register file. A
context word, loaded from the configuration Memory and stored in the Context
Register, defines the functionality of the RC. Besides standard logic/arithmetic
functions, the ALU has other functions such as computation of absolute value
of the difference of two operands and a single cycle multiply-accumulate oper-
ation. There are a total of 25 ALU functions. The RC interconnection network
features three layers. In the first layer, all cells are connected to their four near-
est neighbors. In the second layer, each cell can access data from any other
cell in the same row or column of the same array quadrant. The third layer of
hierarchy consists of buses spanning the whole array and allowing transfer of
data from a cell in a row or column of a quadrant to any other cell in the same
row or column in the adjacent quadrant. In addition, two horizontal 128-bit
buses connect the array to the frame buffer.
Figure 2.48. Pipeline Reconfiguration: Mapping of a 5 stage virtual pipelline auf eine 3 stage
shows the mapping of the virtual blocks on the physical modules of the device.
Because the configuration of each single unit in the pipeline is independent
from the other, the reconfiguration process can be broken down to a cycle-by-
cycle configuration. In this way, part of the pipeline can be reconfigured while
the rest is computing.
Based on the pipeline reconfiguration concept, a class of reconfigurable de-
vices called PipeRench was proposed in [98] as co-processor in multimedia
applications. A PipeRench device consist of a set of physical pipeline stages
also called stripes. A stripe is composed of interconnect and processing ele-
ments (PE), each of which contains registers and LUT-based ALUs. The PEs
access operands from registered outputs of the previous stripe as well as reg-
istered or unregistered outputs of the other PEs in the stripe. All PipeRench
devices have four global busses, two of which are dedicated to storing and
restoring stripe state during hardware virtualization (configuration). The other
two are used for input and output.
5.4.7 RaPiD
The last work that we would like to considerer in this section is the one of
Ebeling et al [100] whose goal was to overcome the handicap of FPGA2 . A
one dimensional coarse-grained reconfigurable architecture called RaPiD, an
acronym for Reconfigurable Pipelined Datapath is proposed for this purpose.
The structure of RaPiD datapaths resembles that of systolic arrays. The
structure is made upon linear arrays of functional units communicating in
mostly a nearest neighbour fashion. This can be used for example, to con-
struct a harware module, which comprises different computations at different
stages and at different times resulting in a linear array of functional units that
can be configured to form a linear computational pipeline. The resulting ar-
ray of functional units is divided into identical cells which are replicated to
form a complete array. A RaPiD-cell consists of an integer multiplier, two in-
teger ALUs, and six general purpose registers and three small local memories.
Interconnections among the functional units are realized using a set of ten seg-
mented busses that run the length of the datapath. Many of the registers in a
pipelined computation can be implemented using the bus pipeline registers.
Functional unit outputs are registered. However, the output registers can
be bypassed via configuration control. Functional units may additionally be
pipelined internally depending on their complexity.
The control of the datapath is done using two types of signals: the static
control signals that are defined by the configuration memory as in ordinary FP-
GAs, and dynamic control that must be provided on every cycle. To program
2 In this case, the large amount of resources deployed to build macro instructions and the difficulty in
6. Conclusion
Our goal in this chapter was not to provided all possible details on the single
architectures of the reconfigurable computer chips presented. We rather focus
on the main characteristics in the technology as well as the operations. More
details on the single devices can be found in the corresponding datasheet. De-
spite the large amount of architecture presented in this section, the market is
still dominated by the FPGAs, in particular those from Xilinx and Altera. The
coarse-grained reconfigurable device’s market has not really taken-off so far,
despite the large amount of concept and prototypes developed in this direction.
Research and development in the architecture of reconfigurable computing
systems is very dynamic. By the time this book will be published, some con-
cept presented in this section will probably be no more actual. However, as
66 RECONFIGURABLE COMPUTING
we have seen with the FPGAs, the basic structure of reconfigurable device will
still remain the same in the future. In the case of FPGA, we will experience
some changes in the amount of input of the LUT, modifications in the I/O el-
ements and also the inclusion of various coarse-grained element in the chip.
However, LUTs will still be used as computational connected to each other
using programmable crossbar switches. I/O-elements will still be used for ex-
ternal communication. Coarse-grained device will still have the same structure
consisting of ALU-computing elements, programmable interconnections and
I/O-components. Understanding the concept presented here will therefore help
to better and faster understand the changes that will be made on the devices in
the future.
Chapter 3
IMPLEMENTATION
In the first part of this chapter, we present the different possibilities for the
use of reconfigurable devices in a system. According to the way those devices
are used, the target applications and the systems in which they are integrated,
different terminologies will be defined for the systems. We follow up by pre-
senting the design flow, i.e. the steps required to implement an application on
those devices. Because the programming of coarse-grained reconfigurable de-
vices is very similar to that of processors, we will not focus on the variety of
tool that exists for this purpose. The implementation on FPGA devices is rather
unusual in two points. First the programming is not aimed at generating a set
of instructions to be executed sequentially on a given processor. We seek the
generation of the hardware components that will be mapped at different time
on the available resources. According to the application, the resources needed
for the computation of the application will be built as components to be down-
loaded to the device at run-time. The generation of such components is called
logic synthesis. It is an optimization process whose goal is to minimize some
cost functions aimed at producing, for instance, the fastest hardware with the
smallest amount of resources and the smallest power consumption. The map-
ping of the application to the FPGA resources is a step of the logic synthesis
called technology mapping. The second unusual point with FPGAs is that the
technology mapping targets look-up tables rather than NAND-Gate as it is the
case with many digital devices. In the last part of the chapter, we will therefore
shortly present the steps required in logic synthesis and focus in more details
on technology mapping for FPGAs.
1. Integration
Reconfigurable devices are usually used in three different ways:
68 RECONFIGURABLE COMPUTING
1 Today leading edge reconfigurable devices contain million of gates with few hundreds MHz
Implementation 69
the device and communication between newly placed modules. The man-
agement of the reconfigurable device is usually done by a scheduler and a
placer that can be implemented as part of an operating system running on
a processor (figure 3.1). The processor can either resides inside or outside
the reconfigurable chip.
The scheduler manages the tasks and decides when a task should be exe-
cuted. The tasks, which are available as configuration data in a database are
characterized through their bounding box and their run-time. The bounding
box defines the area that a task occupies on the device. The management of
task execution at run-time is therefore a temporal placement problem that
will be studied in details in chapter 5. The scheduler determines which task
should be executed on the RPU and then gives the task to the placer which
will try to place it on the device, i.e. allocate a set of resources for the
implementation of that task. If the placer is not able to find a site for the
new task, then it will be sent back to the scheduler which can then decide
to send it later and to send another task to the placer. In this case, we say
that the task is rejected.
bus that is used for the data transfer between the processor and the reconfig-
urable device. In supercomputing, systems are made upon a high speed pro-
cessor from Intel or AMD and an FPGA-board attached to a bus like the PCI.
Several systems (Cray XD1, SGI, Nallatech SRC Computers) in which FPGAs
cohabits and communicate using a dedicated high speed bus, are available to
buy. In embedded systems, the processors are more and more integrated in the
reconfigurable devices and are heavily used for management purpose rather
than for computation. The RPU acts like a coprocessor with varying instruc-
tion sets accessible by the processor in a function call. The computation flow
can be summarized as shown in algorithm 2.
2 It is already possible to trigger the reconfiguration of the Xilinx FPGAs from within the device using
their ICAP-port. This might allow a self reconfiguration of a system from a processor running within the
device
3 The data might also be collected by the RPU itself from an external source
Implementation 71
can also send the data directly to an external sink. In the computation flow
presented above, the RPU is configured only once. However, in frequently re-
configured systems several configurations might be done. If the RPU has to be
configured more than once, then the body of the while loop must be run again
according to the number of reconfigurations to be done before.
3. Logic Synthesis
A function, assigned to the hardware in the hardware/software co-design
process, can be described as a digital structured system. As shown in Figure
3.4, such a digital structured system consists of a set of combinatorial logic
modules (the nodes), memory (the registers), inputs and outputs.
Implementation 75
The inputs provide data to the system while the outputs carry data out of the
system. Computation is performed in the combinatorial parts and the results
might be temporally stored in registers that are placed between the combinato-
rial blocks. A clock is used to synchronize the transfer of data from register to
register via combinatorial parts. The description of a design at this level is usu-
ally called register transfer description, due to the register to register operation
mode previously described.
For such a digital system, the goal of the logic synthesis is to produce an
optimal implementation of the system on a given hardware platform. In the
case of FPGA, the goal is the generation of configuration data which satis-
fies a set of given constraints like the maximal speed, the minimum area, the
minimum power consumption, etc. In a structured system, each combinatorial
block is a node that can be represented as a two-level function or as multi-level
function. Depending on the node representations, the two following synthesis
approaches exist:
Multi Level Logic Synthesis: In the multi-level synthesis, functions are rep-
resented using a multi-level logic. Those are circuits in which the longest
path from input to output goes through more than two gates.
Most of the circuits used in practice are implemented using multi-level logic.
Multi-level circuits are smaller, faster in most cases and consume less power
than two-level circuits. Two-level logic is most appropriate for PAL and PLA
implementations while multi-level is used for standard cell, mask-programmable
or field-programmable devices.
We formally represent a node of the structured system as a Boolean network,
i.e. a network of Boolean operators that reflects the structure and function of
the nodes. A Boolean network is defined as a directed acyclic graph (DAG)
in which a node represents an arbitrary Boolean function and an edge (i, j)
represents the data dependency between the two nodes i and j of the network.
The final implementation matches the node representation. In the second case,
the representation is technology independent, i.e the design is not tied to any
library. A final mapping must be done to the final library in order to have
an implementation. The technology independent method is most used, due to
the large set of available optimization methods available. With a technology
independent representation, synthesis for FPGA devices is done in two steps.
In the first step, all the Boolean equations are minimized, independent of the
function generators used. In the second step, the technology mapping process
maps the parts of the Boolean network to a set of LUTs.
In general, the following choices are made for the representation of a node:
Sum of products form: A Sum of Product ( SOP) is the most trivial form
to represent a Boolean function. It consists of a sum of product terms and
it is well adapted for two-level logic implementation on PALs and PLAs.
Example: f = xyz + xyz + wxy.
This representation has the advantage that it is well understood and it is
easy to manipulate. Many optimization algorithms are available (AND,
OR, Tautology, two-level minimizers). The main disadvantage is the non
representativity of the logic complexity. In fact, designs represented as
sum of products are not easy to estimate as the complexity of the design
decreases through manipulation. Therefore estimation of progress during
logic minimization on SOPs is difficult.
Factored form A factored form is defined recursively either as a single
literal or, as a product or a sum of two factored forms: a product is either
a single literal or the product of two factored forms and a sum is either a
single literal or the sum of two factored forms. c(a + b(d + e)) is a product
of the factored forms c and a + b(d + e), and a + b(d + e) is a sum of the
factored forms a and b(d + e).
Factored forms are representative of the logic complexity. In many design
styles, the implementation of a function corresponds to its factored form.
Therefore, factored forms are good estimation of complexity of the logic
implementation. Their main disadvantage is the lack of manipulation algo-
rithms. They are usually converted in SOPs before manipulation.
Binary decision diagram (BDD) A binary decision diagram (BDD) is a
rooted directed acyclic graph used to represent a boolean function. Two
kinds of nodes exist in BDDs: variable and constant nodes.
– A variable node v is a non terminal having as attribute its argument
index4 index(v) ∈ {1, ..., n} and its two children low(v) and high(v).
5 Two BDDs G and G are isomorph ⇐⇒ there exists a bijective function σ from G in G such
1 2 1 2
that: 1) for a terminal node v ∈ G1 , σ(v) = w is a terminal node in G2 with value(v) = value(w); 2)
for a non terminal node v ∈ G1 , σ(v) = w is a non terminal node of G2 with index(v) = index(w),
σ(low(v)) = low(w) and σ(high(v)) = high(w)
Implementation 79
Definition 3.2 (Primary Input, Primary Output, Node Level,Node Depth, Fan-in,
Given a boolean network G, we define the following:
1 A primary input (PI) node is a node without any predecessor.
2 A primary output (PO) node is a node without any successor.
3 The level l(v) of a node v is the length of the longest path from the primary
inputs to v.
4 The depth of a network G is the largest level of a node in G.
5 The fan-in of a node v is the set of gates whose outputs are inputs of v.
6 The fan-out of v is the set of gates that use the output of v as input.
7 Given a node v ∈ G , input(v) is defined as the set of node of G, which
are fan-in of v, i.e. the set of predecessors of v.
With input(H), we denote the set of all nodes not included in H, which are
predecessors of some nodes in H.
With the previous definition of K-feasible cones, the LUT technology map-
ping becomes the problem of covering the graph with a set of K-feasible cones
that are allowed to overlap. The technology mapping results in a new DAG in
which nodes are K-feasible cones and edges represent communication among
the cones. Figure 3.8 shows the covering of a graph with 3-feasible cones and
the resulting LUT-mapping to 3-input LUTs.
Next, we present some existing LUT-technology mapping algorithms and
explain their advantage as well as their drawbacks.
3−LUT 3−LUT
3−LUT
Figure 3.8. Example of a graph covering with K-feasible cone and the corresponding covering
with LUTs
the implementation of a given circuit. It operates in two steps: In the first step,
the original Boolean network is partitioned into a forest of trees that are then
separately mapped into circuits of K-input LUTs. The second step assembles
the circuits implementing the trees to produce the final circuit.
The transformation of the original network into a forest is done by partition-
ing each fan-out node v. Therefore, sub-network rooted at v is duplicated for
each input triggered by the fan-out nodes of v. The resulting sub-networks are
either trees or leaf-DAGs. The leaf-DAGs are converted in trees by creating a
unique instance of a primary input for each of its fan-out edges.
Mapping the trees. The strategy used by Chortle to map a tree is a combina-
tion of bin packing and dynamic programming. Each tree is traversed from the
primary inputs to the primary outputs. At each node v, a circuit referred to as
the best Circuit, implementing the cone at v extending from the node to the pri-
mary inputs of the network is constructed. The best circuit is characterized by
two main factors: The tree rooted at v and represented by a cone must contain
the minimum number of LUTs and the output LUT (the root-LUT) implement-
ing v should contain the maximum number of unused input pins. For a primary
input p the best circuit is a single LUT whose function is a buffer. Using the
dynamic programming, the best circuit at a node v can be constructed from
its fan-in nodes, because each of them is already optimally implemented. The
procedure enforces the use of the minimum number of LUTs at a given node.
The best-circuit is then constructed from the minimum number of LUTs used
to implement its fan-in nodes. The secondary goal is to minimize the number
of unused inputs of the circuit rooted at node v.
The construction of the tree is done in two steps. First a two-level decompo-
sition of the cone at v is constructed and then this decomposition is converted
into a multi-level decomposition.
84 RECONFIGURABLE COMPUTING
Multi-level Decomposition. In the second step, the first-level nodes are im-
plemented using a tree of LUTs. The number of LUTs used is minimized by
using second level LUTs that have unused pins to implement a portion of the
first-level three as shown in figure 3.10. The detailed procedure for converting
a two-level decomposition into a multi-level decomposition is given in algo-
rithm 4 and figure 3.10 provides an illustration of this process.
Implementation 85
The fact that the most filled unconnected LUTs are always selected pushes
less filled LUTs to the root of the tree being built. Therefore, the LUTs with
the most unused inputs will be found near the root of the tree.
a given node. The goal of the optimization is to pack the reconvergent paths
caused by a given input into just one LUT.
This strategy is illustrated in figure 3.11. To see are two paths going through
the blocks g and h and placed in the same LUT, thus reducing the number of
LUTs used from five to four.
Figure 3.11. Exploiting reconvergent paths to reduce the amount of LUTs used
Implementation 87
If more than one pair of fan-in LUTs share inputs, there will be several
pairs of reconvergent paths. To determine which one should be packed in the
same LUT, two approaches exist in the Chortle algorithm. The first one is an
exhaustive approach that first finds all pairs of fan-in LUTs that share inputs.
Then every possible combination is constructed by first merging the fan-in
LUTs and then proceeding with the FFD bin packing algorithm. The two-level
decomposition that produces the fewest bins and the smallest least filled bins
is retained as the solution.
A large amount of pairs of fan-in LUTs sharing the same inputs cause the
algorithm to be impracticable. To overcome this limitation, a second heuristic
called the Maximum Share Decreasing (MSD) can be used. The goal is to
maximize the sharing of inputs when fan-in LUTs (boxes) are packed into bins.
The MSD iteratively chooses the next box to be packed into bins according to
the following criteria: 1) the box hast the greatest number of inputs, 2) the box
shares the greatest number of inputs with any existing bin and 3) it shares the
greatest number of inputs with any of the remaining boxes.
The first criterion insures that the MSD algorithm works like the FFD if no
reconvergeant path exists. The second and third criteria help to place boxes
that share inputs into the same bins. The chosen box is then packed into the
bins with which it shares the most input without exceeding the bin capacity,
i.e. the number of inputs. If no more bins exist for packing the chosen bin,
then a new bin is created and the chosen box is packed into the new bin.
Figure 3.12. Logic replication at fan-out nodes to reduce the number of LUTs used
88 RECONFIGURABLE COMPUTING
The chortle algorithm exists in two versions. One with the replication of all
nodes and the other without replication. The solution that produces the less
amount of LUT is retained.
The Chortle algorithm presented in this section is known in the literature
[87] as the Chortle-crf 6
As seen in figure 3.10, the path generated by the Chorlte algorithm can be-
come very long if no effort is spent in reducing the delay. This problem is
addressed in another version of the Chrotle algorithm, the chortle-d7 . Instead
of using the bin packing to minimize the amount of LUTs used, the Chortle-d
focus in the bin packing strategy, on the reduction of the number of levels in
the final design. Because we investigate a delay optimal algorithm in the next
section, the Chortle-d algorithm will not be considered futher. The FlowMap
algorithm that we next present focuses on delay minimization using a network
flow approach, a technique that was also use in the MIS-pga algorithm of Mur-
gai et al[163].
6c is for the constructive bin packing, r for the reconvergent path and f for the logic replication.
7d stays for delay
Implementation 89
the nodes in the LUTs according to their level number. We present the two
steps in the next paragraphs.
The edge cut-size e(X, X) of (X, X) is the sum of the crossing edges ca-
pacities.
The label l(t) of a node t is defined as the depth of the LUT, which contains
t in an optimal mapping of the cone at t.
With the previous reflection, the K-LUT mapping with minimal delay is
reduced to the problem of finding a K-feasible cut with minimum height8 for
each node in the graph. The level l(t) of the node t is therefore given by the
following formula:
The following Lemma results from the previous discussion and therefore it will
be given without proof.
Lemma 3.7 The minimum depth of any mapping solution of Nt is given by:
Figure 3.14 illustrates the FlowMap labelling method. Because there is a min-
imum height 3-feasible cut in Nt , we have l(t) = 2 and the optimal 3-LUT
mapping solution for Nt is given in the figure.
The goal of minimizing the delay can be reduced to efficiently compute
the minimum height K-feasible cut for each node visited in the graph. The
FlowMap algorithm constructs a mapping solution with minimum delay in
8 It is assumed here that no cut (X, X) is computed with primary input nodes in X.
Implementation 91
time O(Km), where m is the number of nodes in the network. Further trans-
formations are required on the networks Nt in order to reach this goal. The
node labels defined by the FlowMap scheme satisfy the following property.
Lemma 3.8 Let l(t) be the label of node t, then l(t) = p or l(t) = p + 1,
where p is the maximum label of the nodes in input(t).
Proof Let t′ be any node in input(t). Then for any cut (X, X) in Nt , either
1 t′ ∈ X or
2 (X, X) also determines a K-feasible cut (X ′ , X ′ ) in Nt′ with
h(X ′ , X ′ ) ≤ h(X, X), where X ′ = X ∩ Nt′ and X ′ = X ∩ Nt′ . Those two
cases are lillustrated in figure 3.15
Figure 3.15. Illustration of the two cases in the proof of Lemma 3.8
Proof Let Ht denote the set of node in Nt that are collapsed into t′ .
⇐ If N ′ t has a K-feasible cut(X ′ , X ′ ), let X = X ′ and X = (X ′ − {t′ } ∪
Ht ), then (X, X) is a K-feasible cut of Nt . Because no node in X ′ (= X) has a
label p or larger, we have h(X, X) ≤ p − 1. Furthermore according to Lemma
3.8, l(t) ≥ p, which implies h(X, X) ≥ p − 1. Therefore h(X, X) = p − 1.
Figure 3.17. Transforming the node cut constraints into the edge cut ones
each bridging edge is one, the edge size cut in Nt′′ is equivalent to the node cut
size in Nt′ . Based on this observation, the following Lemma can be stated.
Lemma 3.10 Nt′ has a K-feasible cut if Nt′′ has a cut whose edge cut size is
no more than K.
The Ford and Fulkerson method [84] [55] can be used to check if a cut
with edge cut-size smaller or equal K exists. We first briefly provide some
more background on networks, which are important to understand the testing
procedure.
In a network N with source s and sink t a flow can be seen as a streaming
of data on different direction on the edges of the network. A node might have
data coming in and data going out through its edges, each of which has a given
capacity. Data streaming in the s-direction (resp. in the t-direction) caused a
negative (resp. positive) value of the flow at a given node. The value of the
flow in the network is therefore the sum of the flows on the network edges.
Formally, a flow f in N can be defined as a function of N × N in .
A residual value exists on an edge if, the flow at that edge has a lower value
than the edge capacity. The residual value is then the difference between the
capacity and the flow’s value. This can be added to the flow in order to saturate
the edge. The capacity of a cut in the network is defined as the sum of all
positive crossing edge capacities. Negative crossing edges do not influence the
capacity of a cut. An edge not saturated by the flow is called a residual edge.
The residual network of a network is made upon all the residual edges as well
Implementation 95
as the nodes that they connect. An augmenting path in the residual network is
a path from source s to the sink t that contains only residual edges, i.e. a path
in the residual network.
A very important relationship between a flow and a cut in a network is given
by the following corollary:
Corollary 3.11 The value of any flow in the network is bounded from
above by the capacity of any cut in the network.
This implies the following: the maximum flow value in the network is bounded
from above by the minimum cut of the network.
Based on the above observation, the notion of cut and that of residual net-
work, the famous max-flow min-cut theorem [55] of Ford and Fulkerson states
that a flow is maximum in the network if the residual network does not contain
any augmenting path. The value of the flow is then equal to the capacity of the
minimum cut.
Applying the max-flow min-cut theorem to our problem, we can state the
following: If a cut with edge cut-size smaller or equal K exists in Nt′′ , then the
maximum value of any flow between s and t′′ in Nt′′ will be less than K.
Because we are only interested in testing if the value of the cut is smaller
than K, the augmenting path of Ford an Fulkerson can be applied to compute
the maximum flow. The approach starts with a flow f , whose value is set to
0 and then, iteratively find some augmenting path P in the residual network
and increase the flow on P by the residual capacity cf (P ), that is the value
by which the flow on each edge of P can be increased. If no path from s to t
exists, then the computing stops and return the maximum flow value.
Because each bridging edge in Nt′′ have a capacity of one, each augmenting
path in the flow residual graph of Nt′′ from s to t′′ increases the flow by one.
Therefore, the augmenting path can be recursively used to check if the maxi-
mum value for a flow associated to a cut is less than K. For a given cut, if a
K + 1 augmenting paths could be found, then the maximum flow in Nt′′ has a
value more than K. Otherwise, the residual graph will not contain a (K + 1)th
path.
Testing if Nt′′ has a cut, whose value is no more than K can therefore be
done through a depth first search starting at s and including in X ′ all nodes
reachable by s. The run-time of the Ford and Fulkerson method is O(m|f ∗ |,
where |f ∗ | is the value of the maximal flow computed. In the F lowM ap al-
gorithm, this value is K, which corresponds to the number of iterations that
were performed to find the K augmenting paths. Since finding an augmenting
path takes O(m) (m being the number of edges of Nt′′ ), testing if a cut with
edge cut-size less or equal K exists can can be determined in time O(Km).
The resulting cut (X ′′ , X ′′ ) in Nt′′ induces a cut (X ′ , X ′ ) in Nt′ which in turn
induces a K-feasible cut (X, X) in Nt .
96 RECONFIGURABLE COMPUTING
Because the above procedure must be repeated for each node in the original
boolean network, we conclude that the labeling phase, i.e. computing the label
of all edges in the graph can be done in O(Kmn) where n is the number of
nodes and m the number of edges in N .
Node Mapping. In its second phase, the FlowMap algorithm maps the nodes
into K-LUTs. The algorithm works on a set L of node outputs to be imple-
mented in the LUTs. Each output will therefore be implemented as a LUT-
output.
Initially L contains all primary outputs. For each node v in L, it is assumed
that the minimum K-feasible cut (X, X) in Nv has been computed in the first
phase. A K-LUT LU T v is then generated to implement the function of v as
well as that of all nodes in X. The inputs of LU T v are the crossing edges from
X to X which are less than K, since the cut is K-feasible. L is then updated
to be (L − {v}) ∪ input(X). Those nodes w belonging to two different cut-set
X u and X v will be automatically duplicated. Algorithm 5 provides the pseudo
code that summarizes all the steps of the FlowMap presented here.
Improvements performed in the FlowMap algorithm have the goal of re-
ducing the amount of LUTs used, while keeping the delay minimal. The first
improvement is used during the maping of the nodes into the LUTs. The al-
gorithm tries to find the K-feasible cut with maximum volume. This allows
the final LUTs to contain more nodes, thus reducing the area used. The second
possibility is the used of the so called flow-pack method, which generalizes the
predecessor packing used in the DAG-Map [46]. The idea of predecessor pack-
ing is to pack a K-inputs LUT v in the same K-inputs LUT u, if v is a fan-out
free fan-in LUT of u and if the total amount the inputs of u and v is less than
K. The flow-pack method generalizes the predecessor packing method to the
set of predecessors Pu of u (including u) of the node u, provide that Pu has a
number of inputs less or equal to K.
Figure 3.18 illustrates the predecessor packing approach. The gate decom-
position presented in the Chortle algorithm can be used as well for the reduc-
tion of the LUT amount.
Figure 3.18. Improvement of the FlowMap algorithm through efficient predecessor packing
Implementation 97
4. Conclusion
In this chapter, the general use of FPGAs in different systems was explained
as well as the different steps needed to synthesize the hardware modules that
will be executed in the FPGAs at run-time. We have taken a brief tour in
logic synthesis with the focus placed on LUT technology mapping, which is
the step by which FPGAs differ from other logic devices. Our goal was to
present two algorithms that best match the optimization requirements, namely
minimal delay and minimal area. Research in FPGA synthesis keep on going
and we several LUT-based technology mapping algorithms are expected to be
98 RECONFIGURABLE COMPUTING
One of the key points for the success of microprocessors is the ease in pro-
gramming such systems. This is in part due to the maturity of compilers as
well as the operation mode of microprocessors, the Von Neumann paradigm
that allows any sequential program to be executed on the underlying hardware.
In almost three decades of progress in compilers and microprocessors, a huge
amount of algorithms have been developed, coded and deployed in high-level
languages like FORTRAN, C, C++, Java, etc... Most of those existing pro-
grams can be executed with low modifications and low porting efforts on new
platforms. With the very attractive nature of software programming, a very
large community of programmers has grown, thus providing software code for
the most existing problem. The consequence of this development is that high
expectation in programmability is placed on new hardware platforms. A new
and highly innovative hardware computing platform, providing the best archi-
tectural organization to speed-up applications will certainly fail to be adopted,
if its programmability is poor. A less competitive hardware platform in turn
will certainly succeed if the portability of existing algorithms is shown to be
easy with a small increase in the computation time. This constat has led to the
development of languages and frameworks, which allows for the compiler to
be generated for a specified hardware description [168] [187].
While a certain degree of maturity has been reached in software program-
ming, or better said in sequential programming languages and compilers, par-
allel programming has not experienced the same advancements. The parallel
implementation of a given application requires two main steps. First the ap-
plication must be partitioned in order to identify the dependencies among the
different parts and set the parts that can be executed in parallel. This partition-
ing step requires a good knowledge on the structure of the application. After
partitioning, a mapping phase is required to allocate the independent blocks
100 RECONFIGURABLE COMPUTING
1. Modelling
High-level synthesis (HLS) deals with the specification of a given applica-
tion at a very high level of abstraction as well as its implementation on a given
platform. The starting point is to specify the application in a given high-level
language or tool. For this modelling step, several possibilities exist for cap-
turing the behaviour of a systems, from the very simple Finite State Machines
(FSM) and their extensions like State charts, Control Dataflow Graphs, to very
complex tools like the Petri Nets, each of which has a different level of pow-
erfulness. We first present the dataflow graph that is used as model in this
chapter. We also consider two extensions of dataflow graphs, the sequencing
graphs and the finite state machine with datapath and explain why those two
models are not used in this context.
High-Level Synthesis for Reconfigurable Devices 101
Definition 4.3 (Latency, length, height, area, weight of nodes and edges)
Given a node vi ∈ V and its implementation Hvi as rectangular shape module
in hardware.
1 li denotes the lenght and hi the height of Hvi
2 ai = li × hi denotes the area Hvi
102 RECONFIGURABLE COMPUTING
For sake of simplicity, we will just use the notation vi to denote a node of the
graph as well as its hardware implementation Hvi .
Any program written in a high-level language can be compiled into a dataflow
graph, provide that the program is free of loops and branching instructions.
This restriction does not match with the reality, since loops and branch instruc-
tions are available in the most of the programs, for the evaluation of branching
conditions at run-time, and to decide on the segment to be executed according
to the value of the condition variables. In case of non nested loops, the body of
a loop is always a set of instructions that can be represented using a dataflow
data structure. Several extension of dataflow graph exist to capture program
with control structures and loop. We will consider two of them in this chapter:
the sequencing graphs and the finite state machines with datapath.
High-Level Synthesis for Reconfigurable Devices 103
Example 4.4 Figure 4.2 shows an example of sequencing graph with a branch-
ing node BR containing two branching paths.
According to the conditions that node BR evaluates, one of the two sub-
sequencing (1 or 2) graphs can be activated. In order to implement a loop, only
one sub-sequencing graph in which the body of the loop is implemented, will be
considered. The node BR will then evaluate the exit condition and branch to the
next node of the hierarchy level, if the condition holds. Otherwise the body of
the loop is re-entered by reactivating the corresponding sub-sequencing graph.
Figure 4.2. Sequencing graph with a branching node linking to two different sub graphs
path (FSMD)1 We adopt in this section the FSMD definitions and terminology
from Vahid et al. [203]. Thereafter, a finite Finite State Machine with Datapath
(FSMD) can be formally defined as a 7-tuple < S, I, O, V, F, H, s0 > where:
S = {s0 , s1 · · · , sl } is a set of states,
I = {i0 , i1 · · · , im } is a set of inputs,
O = {o0 , o1 · · · , on } is a set of outputs,
V = {v0 , v1 · · · , vn } is a set of variables,
F : S × I × V → S is a transition function that maps a tupel (states, input
variable, output variable) to a state,
H : S → O + V is an action function that maps the current state to output
and variable,
s0 is an initial state.
FSMD have some fundamental differences with traditional finite state ma-
chines. First, the transition function operates on arbitrary complex data type
like in high-level programming language; second the transition and action
functions may include arithmetic operations rather than just Boolean opera-
tions. The arithmetic operations and the complex data types implicitly define
a datapath structure in the specification.
The transformation of a program into a FSMD is done by transforming the
statements of the program into FSMD states. The statements are first classified
in three categories:
the assignment statements: For an assignment statement, a single state is
created that executes the assignment action. An arc connecting the so cre-
ated state with the state corresponding to the next program statement is
created.
the branch statements: For a branch statement, a condition state C and
a join state J both with no action are created. An arc is added from the
conditional state to the first statement of the branch. This branch is labeled
with the first branch condition. A second arc, labeled with the complement
of the first condition ANDed with the second branch condition is added
from the conditional state to the first statement of the branch. This process
is repeated until the last branch condition. Each state corresponding to the
last statement in a branch is then connected to the join state. The join state
1 In the original definition from Gasky[91], the extension is done on a finite state machine FSM in order
to support more complex data types and variables as well as complex operators.
High-Level Synthesis for Reconfigurable Devices 105
is finally connected to the state corresponding to the first statement after the
branch.
and the loop statements: For a loop statement, a condition state C and a join
state J, both with no action are created. An arc, labeled with the loop con-
dition and connecting the conditional state C with the state corresponding
to the first statement in the loop body is added to the FSMD. Accordingly
another arc is added from C to the state corresponding to the first statement
after the loop body. This arc is labelled with the complement of the loop
condition. Finally an edge is added from the state corresponding to the last
statement in the loop to the join state and another edge is added from the
join state back to the conditional state.
The transformation steps of a given program to a FSMD is illustrated in
figure 4.3.
Example 4.5 Let us model the greatest common divisor GCD of two num-
bers, using an FSMD for as, explained [203]. The sequential version of the
GCD is given in algorithm 6 and the corresponding FSMD is shown in figure
4.4.
The loop state (C1) and the branching state within the loop are white and
other states are grey. Also we labeled the states with the action to be taken in
106 RECONFIGURABLE COMPUTING
Figure 4.4. Transformation of the greatest common divisor program into a FSMD
number of instances. In the GCD case, the types of resources needed are com-
parators, subtractors, registers and multiplexers. Two registers are instantiated
and two multiplexers are instantiated while only one subtractor and one com-
parator are used. Instead of using just one subtractor and two multiplexers,
we could choose to use two subtractors and no multiplexers. The next step
after the allocation is the binding. In this step, each operation is mapped to
an instance of a given resource. Since many operators can be mapped to the
same instance of a resource, a schedule is used in the third step to decide on
which operator should be assigned a given resource at a given period of time.
Formally, allocation binding and scheduling can be defined as follow:
Figure 4.5. The datapath and the corresponding FSM for the GCD-FSMD
Example 4.9 Let’s illustrate this major difference with using dataflow graph
for the computation of the functions x = ((a×b)−(c×d))+((c×d)−(e−f ))
and y = ((c × d) − (e − f )) − ((e − f ) + (g − h)) as shown on figure 4.6.
in the same step, although enough basic resources are available. But, those
resources were used to implement one adder and one subtractor instead of two
subtractors. The adder cannot be used in the first step, because it depends on
a subtractor that must be first executed. The minimum execution delay is four
steps if we use chaining3 in the schedule (figure 4.7).
Figure 4.7. HLS of the graph in figure 4.6 on a an architecture with one instances of the
resource types +, ∗ and −
3 Chaining operations means sequentially executing a set of operations in the same time slot, provided
The second major difference concerns the control of the computation steps
of a given application. In general high-level synthesis, the application is spec-
ified using a structure that encapsulates a datapath (set of computational re-
sources and their interconnections) and a control part. The synthesis process,
then allocates the resources to operators at different time according to a com-
puted schedule. This temporal resource assignment is controlled by a separate
part of the system for which a synthesis is required. In reconfigurable devices,
a set of hardware modules implemented as datapath normally compete for exe-
cution on the chip. Instead of a separate control modules to set the control lines
of the datapath modules, a processor is used to control the selection process of
the hardware modules by means of reconfiguration. The same processor is
also in charge of activating the single resources in the corresponding hardware
accelerators. For the modelling of the datapath, a dataflow graph is usually
sufficient, since loops and branch control is left to the processor. While the
general high-level synthesis has to control each single operator, the high-level
synthesis for reconfigurable devices deals with a set of operators with differ-
ent execution delay at a time. This means that for a given application, the
resources on the device are not allocated to only one operator, but to a set of
operators that must be placed at the same time and remove at the same time.
With this, an application must be partitioned in sets of operators. The parti-
tions will then be successively implemented at different time on the device.
This process, called temporal partitioning allows an application to be sequen-
tially computed, by allowing a temporal sharing of resource among different
112 RECONFIGURABLE COMPUTING
sets of operators. We next present the temporal partitioning problem and some
of the solution approaches developed to solve it.
4 Hardware module.
High-Level Synthesis for Reconfigurable Devices 113
The rest of the chapter is based on the assumption that an underlying dataflow
graph model is available. We therefore redefine the schedule on the basis of
the dataflow graph.
5 A crossing edge is an edge which a connect one component in a partition with another component out
of the partition
High-Level Synthesis for Reconfigurable Devices 115
use of the device area. Temporal partitioning normally targets non-partial re-
configuration devices. It has the advantage that the resulting partition can be
implemented in just one circuit. Therefore, floorplanning efforts are left to the
synthesis tools. However, when components with shorter run-times are placed
in the same partition with other components with longer run-time, those with
the shorter components remains idle for a longer period of time, resulting in a
waste of the device resources. The wasted resource of a given partition can be
formally defined as follows:
Figure 4.11 graphically illustrates the waste resource of a partition. The run-
time of the partition is determined by the component v1 . The shaded area
defines the overall wasted resource of the partition.
tween the number of edges in E over the number of all edges which can be
built with the nodes of G.
For a given subset V ′ of V , the connectivity of V ′ is defined as the relation
between the number of edges connecting the nodes of V ′ over the set of all
edges which can be built with the nodes of V ′ .
The connectivity of a set provides a means to measure how strongly the
components of a set are connected. High connectivity means a strongly con-
nected set, while low connectivity reflects a graph in which many modules are
not connected together. The connectivity may be used to define how good a
partitioning algorithm, whose goal is the minimization of the communication
cost, has performed. We formally define the quality of a partitioning as fol-
lows:
Definition 4.18 (Quality) Given a dataflow graph G = (V, E) and
Ppartitioning
a P = {P1 , ..., Pn } of G, we define the quality Q(P ) = n1 ×
n
i=1 (con(Pi )) of P as the average connectivity over all the partitions Pi , 1 ≤
i≤n.
Most of the target architectures for which temporal partitioning is used are
made upon a reconfigurable device connected to a host processor by a bus
like, for instance the PCI6 . The number of bus lines is limited. Therefore, the
communication has to be time-multiplexed on the bus, if many data have to
be transported on the bus. This happens for example when a partition has to
be replaced. Recall that the communication between the partitions is done by
a set of registers inside the reconfigurable devices. All the temporary data in
those registers have to be saved in the communication memory before recon-
figuration. The device is then reconfigured and the data are copied back into
the reconfigurable device registers. Minimizing the communication overhead
can be done by minimizing the weighted sum of crossing edges among the
partitions. This will also minimize the set of registers needed to communicate
between the generated partitions, thus reducing on one side the size of of the
communication memory and the communication time on the other side. This
goal is likely to be reached if highly connected components are placed in the
same partition.
After a partitioning by a given algorithm, the quality of the partition will
determine if the algorithm performed well or not. If a graph is highly connected
and the partitioning algorithm performs with low quality, then there will be
more edges connecting different partitions and therefore more data exchange
among the partitions. But if the graph is highly connected and the partitioning
algorithm performs with high quality, then the components in the partitions
are highly connected and therefore there will be fewer edges connecting the
partitions.
Example 4.19 Figures 4.12 and 4.13 illustrate the connectivity of a graph
and the quality of the algorithm that was used for the partitioning. In figure
4.12 a graph with a connectivity of 0.24 is partitioned by a first algorithm,
which produces a quality of 0.25. The same graph is partitioned by another
algorithm in figure 4.13 with a quality of 0.45. In the first case, we have 6
edges connecting the two partitions, while there are only two edges connecting
the partitions in the second case. The second case is, therefore, better than the
first case for data communication between the partitions.
G: P0 P1
8
1 2 2
3
1
3 7 10
9 7
4 8 6
4 6
5 9
10
5
Figure 4.12. Partitioning of the graph G with connectivity 0.24 with an algorithm that pro-
duces a quality of 0.25
P0 P1
G:
8 8
1 2 1 2
3 7 3 7
9 4 9
4 6 6
5
10 10
5
Figure 4.13. Partitioning of the graph G with connectivity 0.24 with an algorithm that pro-
duces a quality of 0.45
can be grouped in four different categories. In the first category are the list-
scheduling based methods. The second category encounters exact methods,
which use integer linear programming equations for capturing the optimiza-
tion problem. The solution of the equations provides the exact solution of the
temporal partitioning problem. The third category covers network flow based
algorithms while in the fourth category, we have spectral method based on
the computation of eigenvalues of a matrix derived from the dataflow graph.
The methods in the first, third and fourth categories can also be grouped un-
der the umbrella of recursive bipartition, because the partitions are iteratively
generated through the bipartition of a remaining set of components.
highest end time among all its predecessors. The ASAP algorithm idealizes
the binding process by assuming an unlimited amount of available resource,
and assigns each operation as soon as of is ready to be executed, therefore pro-
viding the lowest starting time of the tasks in the graphs. The pseudo code for
the ASAP-algorithm is provided in algorithm 7 for the resulting schedule ς.
Example 4.20 Assuming a latency of 100 clocks for the multiplication and
50 clocks for the addition as well as for the subtraction, an example of ASAP-
scheduling is provided in figure 4.14. The nodes are labeled with their number
as well as their starting time as computed by the algorithm. The label on
the edges represent the delay computation delay of the previous node. The
data transmission delay is neglected. The number on the nodes represent their
starting times.
criteria, it is possible that all operators in the ready list are on a critical paths,
which means that their mobility is zero. As consequence, the complete depth
of each operators is increased by one, thus increasing the latency of the graph’s
execution.
Example 4.22 We consider the graph of figure 4.16 on the left side, in which
each node is labelled with its priority. The nodes must be scheduled on a
resource set made upon an adder and a multiplier. With the priority of a node
defined as its depth, the list scheduling is shown on the right side.
Figure 4.16. An example of list scheduling using the depth of a node as priority
for each node v of the graph. The probability p(v, t) is zero outside the mo-
bility interval of v and equal the reciprocal of the mobility within the mobility
interval. Formally, we have:
(
1
if t ∈ [ASAP (v) ALAP (v)]
f (n) = mob(v)+1 (2.1)
0 otherwise
where mob(v) = ALAP (v) − ASAP (v).
A distribution function d(t, k) is calculated as the sum of probabilities of all
the operations with a resource type k. Formally, we have:
X
d(t, k) = p(v, t) (2.2)
{v∈V,α(v)=k}
This can be plotted into a graph called the distribution graph that indicates
the concurrency of similar operations over the schedule steps. Whenever many
operators compete for a fewer amount of resources, the operations that produce
a global increase of concurrency in the graph are selected and assigned the
resources.
Example 4.23 Consider once again the graph of figure 4.6 whose ASAP
and ALAP time is computed in figures 4.14 and 4.15. Nodes v1 , v2 , v5 , and
v8 have each a mobility of zero. Therefore p(v1 , i) = p(v2 , i) = p(v5 , i) =
p(v8 , i) = 0, ∀i ∈ [0 , 200].
As stated earlier, the scheduling is done on the basis of the force concept,
whose role is to attract or repel operators in that scheduling step. Two types
of forces are defined: the self-force and the predecessor-successor force. The
self-force is the one that relate an operation to different control steps in which
it can be scheduled. For a given node v the self-force is formally defined as
follows:
ALAP (v)
1 X
self-force(v, t) = d(t, k) − d(m, k) (2.3)
mob(v) + 1
m=ASAP (v)
n
X
tDF G = k × CH + (tPi ) (2.4)
i=1
High-Level Synthesis for Reconfigurable Devices 127
The sum of the assignment index of a given variable over all partitions is
exactly one, which means that there is exactly one value i for which the y
variable is one. This is the index i of the partition in which v is placed.
2 Precedence constraint: This constraint is used to control the assignment
of data dependent nodes of the dataflow graph to partitions. A node v that
is data dependent from a node u must be placed in a partition with index
bigger or equal than that of the partition into which the node u is placed.
This constraint is captured by the following equation:
n
X n
X
∀(u, v) ∈ E, i × yui ≤ i × yvi (2.7)
i=1 i=1
The two sums define the index of the partition in which the components are
placed.
3 Resource constraint: The resource constraint defines the constraint on the
architecture used. This can be limited to the reconfigurable device or to a
complete system into which the reconfigurable device is integrated. For a
reconfigurable device, the area constraint as well as the constraint on the
number of terminals must be specified.
The area constraint states that the total amount of resources assigned to a
given partition must not exceed the amount of available resource on the
chip. Recall that the computation resources in a reconfigurable device are
defined in term of area occupy on the device. The constraint can therefore
be expressed in term of device area: the total area assigned to a given parti-
tion must not exceed the device area. This is formally defined by equation
2.8 X
∀Pi ∈ P, a(v) ≤ a(H) (2.8)
v∈Pi
The terminal constraint defined in equation 2.9 states that the total number
of input and output in a partition must not exceed the total number of pins
on the device.
X X
∀Pi ∈ P, wuv + wuv ≤ p(H) (2.9)
(u∈Pi )∧(v ∈P
/ i) (u∈P
/ i )∧(v∈Pi )
where wuv is the width of the edge (u, v) and Ms is the size of the commu-
nication memory.
Resource constraint assuming a device with a size of 200 LUTs, and 100
LUTs for the multiplication, 50 LUTs each for the addition, the comparison
and the multiplexer,
P we have:
Partition P1 : 7u=1 yu1 = (100 + 50 + 50) ≤ a(H) = 200
P
Partition P2 : 7u=1 yu2 = (100 + 50 + 50) ≤ a(H) = 200
P
Partition P3 : 7u=1 yu3 = (100) ≤ a(H) = 200
The general problem of the ILP approaches for partitioning a DFG is the
computation time, which grows drastically with the size of the problem in-
stance. The algorithm can be applied only to small examples. Branch and
bound strategies are some times used to limit the search space. In this case,
132 RECONFIGURABLE COMPUTING
the bounds can be computed using the ASAP/ALAP approach previously de-
scribed. To overcome this problem, some authors reduce the size of the model
by reducing the set of constraints in the problem formulation, but the num-
ber of variables and precedence constraints to be considered still remains high.
Reconfigurable devices are no longer those tiny devices which could not hold
more than four multipliers. Their sizes have increased very fast in the past
and this will continue in the future. Temporal partitioning algorithms should
therefore be able to partition very large graphs (graphs with thousands of
nodes). Trying to formulate all the precedence constraints with the ILP ap-
proach can drastically increase the size of the model, thus making the algo-
rithm intractable.
1 s(Pi ) ≤ s(H) and p(Pi ) ≤ p(H), i.e. Pi must not be further partitioned.
High-Level Synthesis for Reconfigurable Devices 133
2 With the ordering the relation of definition 4.13, only one of the following
condition should hold:
(Pi ≤ P̃i+1 )
(P̃i+1 ≤ Pi )
Pi and P̃i+1 are not in relation
Those conditions insure that no cycles exists between Pi and P̃i+1 The net-
work flow method is used in each step of the recursive bipartitioning process
to compute a cycle free bipartition. At each step, the following processing
operations are applied to the rest graph P̃i
1 P̃i is first transformed into a network graph P̃i′ by introducing two new
nodes into the graph P̃i . The first one is an input free source node, whose
outputs are connected to all the primary inputs. In the same way, a sink node
is inserted in the dataflow graph and all the primary outputs are connected
as input to it.
2 A second transformation is done on the resulting network P̃i′ in order to
generate the graph P̃i2 .
′ ′
For an edge (v1 , v2 ) ∈ P̃i × P̃i two edges e1 = (v1 ′ , v2 ′ ), with a
capacity of c1 = 1 and e2 = (v2 ′ , v1 ′ ) with a capacity of c2 = ∞, are
added to P̃i2 .
′ ′ 2
For a multi-terminal edge in P̃i × P̃i , a bridging node is added to P̃i .
An edge weighted with 1 connects the source node with the bridging
2 2
node in P̃i × P̃i . For each sink node in the multi terminal net, an
edge weighted with ∞ is added between the bridging edge and the sink
nodes and between the sink nodes and the source node.
four steps for transformation and partitioning using the network flow approach
are illustrated on figure 4.20.
The min-cut max-flow theorem of Ford and Fulkerson is a powerful tool to
minimize the communication in a cut in polynomial time. However, the model
is constructed by inserting a great amount of nodes and edges in the original
graph. The resulting graph P̃i2 may grow too big. In the worst case, the number
of nodes in the new graph can be twice the number of the nodes in the original
graph. The edges number of additional edges also grows dramatically and
become difficult to handle.
Figure 4.20. Transformation and partitioning steps using the network flow approach
We start with the first problem by considering the one dimensional7 version
as defined by Hall in [107] and illustrated in figure 4.21 [195]. (The subfig-
ure (a) shows an example of a directed graph with four nodes and three edges
borrowed from [195], while subfigure (b) shows the one-dimensional place-
ment on a line. The small black squares represent the centers of the logic cells.
Subfigure (c) shows the two-dimensional placement of the same nodes. The
method does not take account of sizes of the logic cells or actual location of
logic cell connectors. The scaling of the computed position (by the width and
height of the node bounding box). In subfigure (d) a complete layout is made
by placing the logic cells on valid locations.
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
1 2
3 2 4 1
Node
3 4
b)
a)
0.6
3 1
0.4
0.2
0.0
Node
Bounding box
3 1
−0.6 −0.4 −0.2
2 4 2 4
−1 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1
c) d)
7 Placement on a line
High-Level Synthesis for Reconfigurable Devices 137
To avoid the trivial case in which xi = 0 for all i, we impose the following
condition (normalization):
XT X = 1 (2.12)
We assume that the non interesting solution xi = xj (for all i, j ∈ {1, .., n})
is to be avoided. This leads to all components placed at the same location.
Next, we define the connection matrix, the degree matrix and the Laplacian or
disconnection matrix of G as follows:
For two nodes vi and vj of the DFG connected by an edge the matrix the
connection matrix will have an entry one in line i and column j. The degree
matrix is a diagonal matrix. An entry in the diagonal (line i, column i) cor-
responds to the number of nodes adjacent to vi . The Laplacian matrix is the
difference between the degree and the connection. Hall has prooved in [107]
that:
r = X T BX (2.13)
Since B is positive semi-definite (B ≥ 0) and B is of rank |V | − 1, whenever
G is connected [107], the initial problem is now reduced to the following:
(
minimize r = X T BX with B ≥ 0
(2.14)
subject to X T X = 1
(
minimize R = X1 T BX1 + X2 T BX2 + ... + Xk T BXk
(2.20)
subject to X1 T X1 = X2 T X2 = ... = Xk T Xk = 1
(Xi defines the coordinates of the nodes of V in the i-th dimension) has to be
solved. Analog to the 1-dimensional case, the Lagrange multiplier method will
be applied with the k (each for one dimension) Lagrange multipliers λ1 , λ2 , ...., λk .
The solution are the eigenvectors associated to the k smallest non zero Eigen-
values λ1 , λ2 , ...., λk . This approach is known in the literature as spectral
method. Spectral methods have been widely used in the past for partitioning
and placement [6] [7] [118] [44] [74] [73]. It’s run-time is dominated by the
computation of the eigenvalues, which can be done using various methods on
High-Level Synthesis for Reconfigurable Devices 139
different architectures. The most used algorithm for computing the eigenval-
ues of a matrix is the Golub-Kahan method [99] that needs O(n3 ) on a single
processor for an n by n matrix. Using O(n) processors, the eigenvalues can
be compute with a parallel version of the Hestenes method [200] [172] [119]
[38] in O(n2 S) where S is the number of so called sweeps [119][38]. Brent
and Luk [38] conjectured that S = log(n) and therefore S ≤ 10 in general.
For sparse and quadratic matrices, the eigenvalues can be computed in O(n1.4 )
using the more efficient Lanczos method [99, 6].
1 0 0 0 0 0 0 0 0 −1 0 0 0 0
0 1 0 0 0 0 0 −1 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 −1 0 0
0 0 0 1 0 0 0 0 0 0 −1 0 0 0
0 0 0 0 1 0 0 0 0 0 −1 0 0 0
0 0 0 0 0 1 0 0 0 0 0 −1 0 0
0 0 0 0 0 0 1 0 0 0 0 0 −1 0
0 0 0 0 0 0 0 3 −1 −1 0 0 0 0
0 0 0 0 0 0 0 −1 3 0 −1 −1 0 0
−1 0 0 0 0 0 0 −1 0 3 0 0 −1 0
0 0 0 −1 −1 0 0 0 −1 0 3 0 0 0
0 0 −1 0 0 −1 0 0 −1 0 0 3 0 0
0 0 0 0 0 0 −1 0 0 −1 0 0 3 −1
0 0 0 0 0 0 0 0 0 −1 0 0 −1 1
In this example, the primary inputs and primary outputs were taken into ac-
count in the building of the matrices. This is useful in that it allows to consider
the placement of components in the vicinity of the pins that they use. How-
ever the eigenvectors only define the position of the components in the different
dimensions.
Figure 4.24. Derived partitioning from the spectral placement of figure 4.23
target directed graphs in this work, some modifications have to be done on the
original KL algorithm.
The fundamental idea behind the KL-algorithm is the definition of a cut of
a bisection as well as the notion of the gain of moving a vertex from one side
of the bisection to the other. For an undirected graph, a cut is defined as the
weighted sum of all the edges crossing from one partition to another. By mov-
ing a node from one partition into the other, the number of crossing edges is
also modified and the value of the cut changed. The KL-algorithm allows a
series of moves, which reduce the bisection cut. If the gain of moving a vertex
is positive, then making that move will reduce the total cost of the cut in the
partition. During one iteration of the KL-algorithm, nodes are moved from
one side of the bisection and locked on the other side. The cost of swapping
unlocked nodes in opposite parts is then computed and the nodes with the best
gain (greatest decrease or less increase of the cut) are swapped. If all the nodes
are locked, the lowest cost partition is set to the current computed partition, if
it improves the cost of the cut. One iteration of the KL-algorithms is called
a pass. After one pass, all the nodes are unlocked and a new pass is com-
puted. The iteration terminates if a pass produces no further improvement on
the cut. For a more detailed description of the KL-methods and it’s extension
by Fiduccia and Mattheyses, refer to [137] [83] [144].
The KL-method works on undirected graphs and does not differentiate be-
tween the the direction of edges. It does not matter if an edge crosses from
the first partition to the second partition or vice versa. In temporal partition-
ing, the graph are directed due to precedence constraints, thus the original
KL-algorithm has to be modified to fit our needs. To better explain how the
High-Level Synthesis for Reconfigurable Devices 143
one, thus producing a cycle of improvement and alteration on the cost of the
cut. To avoid this we apply two instances of the KL-algorithm on the same
bisection in parallel. The objective is to have EPi ,P̃i+1 = ∅ on the first compu-
tation path and EP̃i+1 ,Pi = ∅ on the second one. Obviously the cost of the cut
is set to |EPi ,P̃i+1 | on the first path and |EP̃i+1 ,Pi | on the second path. After one
pass on each path, we check if the objective has been reached on one path. If
this is the case, then the result is set to be the partition generated on that path.
Otherwise, a new pass is computed on the two computation paths. The gain of
moving a node is defined differently on the two computation paths.
On the first path where the goal is to have |EPi ,P̃i+1 | = 0, the gain of
moving a node j from Pi to P̃i+1 is IPi (vj ) − EP̃i+1 (vj ) and the one of
moving a node k from P̃i+1 to Pi is EP̃i+1 (vk ) − IPi (vk ).
On the second path where to goal is to have |EP̃i+1 ,Pi | = 0, the gain of
moving a node j from Pi to P̃i+1 is EPi (vj ) − IP̃i+1 (vj ) while the gain of
moving a node k from P̃i+1 to Pi is IP̃i+1 (vk ) − EPi (vk ).
The gain defined on each computation path is the same like the one defined
in the original KL-algorithm. The modified version of the KL-algorithm pre-
sented here will produce the desired result on one path. Because the targeted
graphs are acyclic dataflow graphs, there exists a partition in which all the
edges cross from the first one to the second. Such a partition is provided for
example by a list-scheduling algorithm.
flip flops outputs. In logic engine mode, the device emulates a large design in
many microcycles. In each microcycle, resources are allocated to a new con-
figuration. A similar architecture has been proposed by Scalera et al [188].
Data pipes are used here to exchange data between different configurations
also called contexts. A data pipe contains a plurality of Context Switching
Logic Arrays (CSLA) which can be used to process two 16-bit words. An in-
coming context can then pick its input data where its predecessor left off by
acquiring the intermediate data deposited on the rightmost portion of the pipe
and processing it in a pipeline from right to left. Unfortunately, the methods
developed have remained in a conceptual stage. Neither time multiplexing nor
context switching FPGAs have ever been commercialized.
A more practical approach have been proposed in [30] to reduce the recon-
figuration time. It is based on the observation that traditional list scheduling
based temporal partitioning algorithms can produce a series of configurations
based on the same set of operators if the components are "well ordered"8 in
the list. For example, for two consecutive configurations ζi = {C1 , ..., Cki }
and ζi+1 = {C1 ′ , ..., Cki+1 ′ } representing two partitions Pi and Pi+1 , one can
be the subset of the other one, i.e either ζi ⊆ ζi+1 , or ζi+1 ⊆ ζi . If one
of those two situations arises, then the reconfiguration overhead can be re-
duced by implementing the two partitions Pi and Pi+1 in one configuration
ζnew = ζi ∪ ζi+1 . The components of ζnew will the be shared among the two
partitions Pi and Pi+1 . That means, the modules required for both configu-
rations are placed on the device and wired in two different ways. Each way
corresponds to one configuration. Figure 4.26 shows a partitioning of a graph
in two partitions P0 and P1 . The set of components required to implement P1
is a subset of the set of components required to implement P0 .
With the use of multiplexers on the inputs of the operators in and the use
of selection signals to connect the corresponding signals to the module inputs,
it is then possible to implement the connections defined in configuration ζi as
well as those defined in ζi+1 together. Switching from the configuration ζi
to the configuration ζi+1 can be done by setting the corresponding value of
the selection signals. The device is therefore "reconfigured" without changing
the physical configuration. We call this process configuration switching. With
configuration switching there is no need to save the registers of the FPGA in
the processor address space during reconfiguration because no register is al-
tered. With this, configuration switching will help to reduce the data exchange
between the processor and the reconfigurable.
Configuration switching can be extended to a series of configurations ζ1 , ..., ζr
by extracting the smallest amount of common operators needed to implement
8 A well order defines the way components with the same level number should be place in the list before
partitioning.
146 RECONFIGURABLE COMPUTING
Figure 4.26. Partitioning of a graph in two set with a common sets of operators
Figure 4.28. Implementation of configuration switching with the partitions of figure 4.27
Although configuration switching can save time and reduce the data trans-
fer between the processor and the reconfigurable device, its implementation
usually requires additional resources. If the amount of additional resource be-
comes too big, then it makes no sense to have a smaller number of components
implemented, and a larger logic area on the device just to realize the switch.
The tradeoffs between the numbers of additional resources and the number of
configurations to be implemented has to be chosen carefully. The search for
a tradeoffs between the number of configurations to reside in the device and
the complexity of the final design is an optimization problem that we do not
address in this book, namely the exploration of the design space.
148 RECONFIGURABLE COMPUTING
3. Conclusion
High level synthesis for reconfigurable devices has benefit from the various
method provided in the past for the general high-level synthesis problem. The
main difference between the two approaches is on the freedom of choice on
the operator type in the case of reconfigurable device. However, one must deal
with temporal partitioning, which allows the reuse of chip resource over time.
Time multiplex reuse of resources was a great matter of concerns for a decade,
as the capacity of FPGAs was very small to accommodate one the complete
functions to be implemented. Meanwhile, the size of FPGAs has grown a lot
and it is quite difficult to find those functions to be temporal partitioned in or-
der to exploit chip resources. The interest on temporal partitioning methods
has considerably decrease as consequence of this growth in the size of FPGAs.
This does not mean that temporal partitioning is useless. We still have areas
of computation like rapid prototyping where reconfigurable device can provide
a very cheap alternative to the very expensive existing systems. Rapid proto-
typing systems are usually done with expensive machines containing a set of
FPGAs. Applications are then partitioned among the FPGAs, which compute
each part of the function to be implemented. With temporal partitioning, a few
number of FPGAs can be used by several parts of the applications that must
not be computed at the same time.
Chapter 5
TEMPORAL PLACEMENT
In the last chapter, we presented the high-level synthesis problem for recon-
figurable devices and some solution approaches. The result is a set of partitions
that are used to reconfigure the complete device. While the implementation of
single partitions is easy, the amount of waste resources in partitions can be very
high. Recall that the waste resource of a component is the amount resources
occupied by that component multiplied by the time where the component is
idle on the device.
Wasting resources on the chip can be avoided if any single component is
placed on the chip only when its computation is required and remains on the
device only for time it is active. With this, idle components can be replaced by
new ones, ready to run at a given point of time. Exchanging a single compo-
nent on the chip means reconfiguring the chip only on the location previously
occupied by that component. This process is called partial reconfiguration
in contrast to full reconfiguration where the full device must be reconfigured,
even for the replacement a single component. In order to be exploited, the par-
tial reconfiguration must be technically supported by the device, which is not
the case for all available devices. While most of the existing devices support
full reconfiguration, only few are able to be partially reconfigured.
For a given set of operations to be executed, the resource allocation on the
device is a time dependant process in which not only the placement of the com-
ponents on the device is defined, but also the time slot in which the execution
of the task must be performed. The time dependant placement of task on the
device is called temporal placement.
Temporal placement can be graphically illustrated through an arrangement
of rectangular boxes in a 3-dimensional container whose base is defined by
the surface of the device and the high is the time axis. Each box represents a
150 RECONFIGURABLE COMPUTING
surface on the defined by its length and height for a time slot 0 to 55, which
corresponds to its computation latency. Component v5 starts its execution at
time 90 and occupies among others, part of the device that was perviously
occupied by component v4 . Component v8 occupies a part of the device for the
whole computation time.
Temporal placement has the advantage of being highly flexible and efficient
in term of device utilization, however it also has a major drawback. Efficient
temporal placement algorithms are difficult and cost intensive. On-line place-
ment requires to solve some computational intensive problems at run-time in a
fraction of millisecond. Those are: the efficient management of the free space,
the selection of the best site for a new component and the management of com-
munication. For a new module to be placed at run-time, it is not sufficient to
only compute the best placement site. The communication between the mod-
ules running on the chip must also be considered. He communication between
modules running on the chip and the external world must be take in to account
as well. A reconfigurable device must provide a viable on-line communication
mechanisms to help establish the communication among modules on the chip
at run-time. This problem is not easy to solve and requires most of the time,
some prerequisites on the device architecture.
Although communication aspects like the distance among components are
considered in some of the methods provided in this chapter, we do not deal
with the technical realization of communication here. Chapter 6 is devoted
to this topic. We assume that communication among modules placed on the
device is somehow possible.
The first part of this chapter address the off-line temporal placement while
the second part deals with the on-line temporal placement. In both cases we
start with some definitions related to the part being addressed and then we
present existing approaches to solve the corresponding problem.
For a given node vi , the values px (vi ), py (vi ) and pt (vi ) denote the coordi-
nates of the node vi in a 3 dimensional vector space. px (vi ) and py (vi ) are the
coordinates of vi on the device H, while pt (vi ) defines the starting time of vi
on H.
We next present some solution approaches for the off-line temporal place-
ment problem. We first introduce a simple incremental method based on the
first-fit/best-fit concept. In the second method, a clustering of components in
blocks that can be placed together on the chip is done. Finally, we introduce an
152 RECONFIGURABLE COMPUTING
Example 5.3 Figure 5.2 illustrates the firs-fit temporal cluster placement
on a set of ten clusters. We assume that the precedence constraints are sat-
154 RECONFIGURABLE COMPUTING
The first-fit and best-fit approaches provide fast method to compute a tem-
poral placement solution. However, no effort is spent on the efficiency of the
space management. Integer linear programming can be used as exact method
to solve the temporal placement problem. In this case, a set of constraint equa-
tions that must be fulfilled by each solution must be formulated. Solving the
equations will then provide a solution to the temporat placement problem. Be-
cause the computation of a ILP-solution is computational intensive, integer lin-
ear programming is usually suitable only for for small size problems. An exact
method that we next presented was proposed in [199] [82] to solve the tempo-
ral placement for large size problems. Equations are formulated in a similar
way as in integer linear programming. However, a branch and bound strategy
is used in order to reduce the search space and allow much large problems to
be solved.
Temporal Placement 155
2 Strip Packing Problem (SPP): Given a set of boxes and a base (x, y), the
problem of finding the minimum heigh h container with size (x, y, h) that
can hold all the boxes is called the the Strip Packing Problem (SPP). The
analogy with the temporal placement is to find the minimum run-time t = h
for a set of components given a reconfigurable device with size (x, y).
With the optimal solution of the BMP, one can select the best reconfigurable
device, in term of size, on which a set of tasks can be computed in a limited
amount of time. This device is optimized for the given set of task and for the
given run-time. In reconfigurable computing, the main task is not necessary to
choose an optimal device for a fixed set of tasks, but to have a fixed device on
which various set of task can be optimally implemented. The BMP is therefore
not the right tool to be used.
The SPP that we next focus on, best matches the requirements of having
only one device on which different set of tasks can be implemented at different
time. In its original version, the goal of the SPP was to minimize the overall
computation delay. This may be changed according to the objectives seek.
As state earlier in this section, integer linear programming can be used to
formulated all constraint that must be fulfilled for a given soltion, as a set of
equation to be solved. However, this approach can be used only for small size
problems. The concept of packing classes that we next present, was used in
[199] [82] as a mean to define a feasible solution for the SPP.
156 RECONFIGURABLE COMPUTING
The first restriction states that in a direction i, the total length of components
placed on the device and pairwise overlapping in the perpendicular dimension
to i, should not exceed the length of the device in dimension i. To better un-
derstand this restriction consider a perpendicular line to direction i that run
through the device. All the components that are cut by the line pairwise over-
lap in the perpendicular direction to i. The total sum of their width in the i
direction should therefore not exceed the line segment which is in the device,
i.e. the size of the device in dimension i. Placing many such lines at the dif-
ferent coordinate points in direction i of the device will help to capture all the
restrictions stated by the first condition.
With the second restrictions, we avoid placements in which components
overlap in all dimensions. A placement can only be valid if for any pair of
components, at least one dimension must exists in which the two components
do not overlap.
Example 5.7 The two dimensional placement of figure 5.3 is a valid packing
with the corresponding packing classes G1 and G2 . In figure 5.4, we have an
example of invalid packing for two reasons. First, part of the component v5 is
placed with a part out of the device and second, components v3 and v4 overlap
in all directions.
Using the pure packing classes as previously defined help to defined the
placement aspect of the temporal placement problem, without caring about the
constraints in the dataflow graph. Because those constraints must be preserved
in a valid temporal placement, a modification of the packing classes must be
done in order to insure that a component depending on another one starts its
execution only after the component on which it depends has completed he’s
158 RECONFIGURABLE COMPUTING
execution. The modification that we present here are the orientation of the
packing classes.
Example 5.10 Figure 5.5 provides a 3-D placement and its characteriza-
tion through interval graphs, complement graphs and the orientation of the
packing class associated to the placement. Shown is only the graph related to
the dimensions x and y.
Figure 5.5. 3-D placement and corresponding interval graphs, complement graphs and or-
einted packing
The method presented in this section target problems well specified at compiled-
time with no change on the computational flow at run-time. In many systems,
in particular real-time ones, the set of task as well as their interconnectivity is
not known at compiled-time. Run-time or on-line placement possibility must
be provided to deal with unpredictable event that create tasks, which must be
executed at run-time. The next sections deals with this issue.
and it is not successful in doing that, the scheduler must decide on what to do
with the task. It can later try again to run the task on the RPU or it can decide
to let the task run with a lower speed on the CPU. In any case, not being able
to satisfy a placement request creates additional delay in the execution of the
program. In order to avoid this penalty, care should be taken on the design of
a the placer. The placer should be able to place as much tasks as possible on
the device in order to avoid delay penalties. This goal can be reached if the
amount of wasted resource is kept small. We first state the on-line placement
problem and then present some solution approaches.
Definition 5.11 (On-line Placement) Given a reconfigurable process-
ing unit H at time t with a given configuration Ct . Find an optimal position
for a new incoming task v such that v does not overlap with any running com-
ponent in the configuration Ct .
As defined in the mathematics, the goal when solving an on-line problem
does not only consists of providing an optimal method at a given time step, but
a method that is optimal for a sequence of computation not fixed in advance.
The method should be developed to cope with unknown parameters that may
arise in the future. A non optimal partial solution can be preferred to an optimal
one at time t, if this partial sub-optimal solution will contributes on a global
optimal in the future. In this section, we limit the on-line definition to the
stepwise optimization of a partial problem in a given time slot, whiteout caring
about the global optimal solution.
We present in the next sections two approaches for on-line placement of
incomming components on the device. While the first approach manages the
free space on the device by keeping track of all empty rectangles from which
one will be selected to place the new module, the second one keeps only track
of the occupied area on the device and try to place a new component in such a
way that no overlap with a running module occurs.
of wasted space in the resulting layout. Once again, we start with definitions
and then continue with solution approaches.
Based on the empty rectangle concept, a strategy to solved the on-line tem-
poral placement problem was proposed in [18]. The method, which has the
name "Keeping All Maximum Empty Rectangles (KAMER)", permanently keeps
track of all MERs. Whenever a request for placing a component v arrives, the
list of MERs is searched for a rectangle that can accommodate v. Because it
is possible to have many MERs in which v can fit, strategies like the first-fit
or best-fit are used to select one rectangle from the set of possible free rect-
angles. Once a rectangle is chosen, the candidate points that can be chosen
as reference point for placing the new component are those that do not allow
an overlap with the external part of the rectangle. In [18], the authors use the
bottom left point as the reference one to place the component.
Temporal Placement 163
The great advantage of the KAMER approach is that all the free space on
the device is captured, thus providing a good quality2 . This happens at the cost
of efficiency in the computation time. As the following example shows, the
number of empty rectangle does not grow linear with the number of compo-
nents included. Whenever a new component is placed on large, depending on
the configuration.
(a) The configuration contains 11 modules (b) The number of modules grows to 14 after
befor the insertion of a new task the insertion of a new task v4
Figure 5.7. Increase of the number of MER through the insertion of a new component.
Also the number of free rectangles can be drastically reduced after the removal
of a module running on the chip. The insertion of components on the chip as
well as their later removal creates large fluctuations on the number of empty
rectangles to be managed, thus increasing the complexity of the alforithm. In
[116], the run-time of the free rectangle placement is shown to be O(n2 )
In order to avoid the quadratic run-time of the KAMER, a simple heuristic
was proposed in [18], with a lower quality and a linear run-time. The strategy
consist of keeping only the non overlapping empty rectangles, thus reducing
the number of empty rectangles to be managed. This allows a linear-run time
at the cost of the quality. The problem with this approach is that several non
2 If a free space exist in which the module can be placed, then there must be a free rectangle that can
accommodate that module and therefore, the MER approach will find a placement
164 RECONFIGURABLE COMPUTING
Because the non overlapping empty rectangles are not necessary maximal,
a module may exists that could fit onto the device, but cannot be placed due to
a a bad non overlapping representation.
As shown in figure 5.9, whenever a new component v1 is placed in a non
overlapping rectangle, two possibilities exist to split that rectangle. An hor-
izontal split that is done using segment Sa and and a vertical split using the
segment Sb .
Choosing to select one the two split directions may have a negative inpact
on the placement of the next components. Assuming for example that the al-
gorithm keeps selecting the horizontal split, after the placement of module v1
(figure 5.9), the free rectangles left are (A,D) and (C,E). Assumes now that a
new module, whose width is the same as (C,E) but with a height slitghly big-
Temporal Placement 165
ger than that of (C,E) is chosen. The algorithm will not be able to place the
component, although enough free space is available on the chip to accomodate
the new module. Bazargan et al. proposed several strategies for choosing one
split direction with the goal is to favor those splits that create quadratic space
on the chip. In order to increase the quality of the non overlapping free rect-
angle heuristic, a strategy is proposed in [197] that simply consist of delaying
the split decision for a number of step later. This may allow a bad choice that
could have done earlier to be avoid.
While the KAMER algorithm always find a rectangle to place a new com-
ponent, if one exists, the position to place the component within the rectangle
must be selected from a set of points whose number is the area of the rectangle
in worst case. An optimal algorithm must choose one point from the set of
possible points according to some optimization criteria. In [116], the bottom
left position is choosen. Because no relation exists between the rectangle to be
placed and those already placed, an arbitrary position can be choosen for the
new rectangle. In reconfigurable devices where the communication between
pairs of tasks one one side and between a task and the device boundary on the
other side plays an important role, components should be placed in such a way
that the communication will be efficiently realized. If a position in the middle
of the selected rectangle is better adapted for optimizing the gooal seeked, then
the component should be placed there, no mater if the number of empty rectan-
gle increases. An optimal algorithm should considere any single point where
the placement is possible and develop a strategy to choose the best position.
This approach is followed by the next algorithms that we present. Instead of
managing the free space, the strategy consists of managing the occupied space
on the device and use the set of running components for computing the best
position to place the incomming component.
2 Select the best site to place the component according to a set of given cri-
teria.
Assuming that a set of possible placement sites is identified, several criteria
can be used to choose between the feasible positions. Here, we considere the
communication cost betwen the task and its environment as the objective to
be minimized. The connections between two different components as well
as those between modules on the device and the device boundrary are of great
importance. While the first one allow two module to communicate together, the
second one allows a module within the device to communicate with a module
out of the device.
The straightforward way to solve this subproblem is to use a brute force
Algorithm. For each new component c to be placed, the brute force algorithm
solves the first subproblem by scanning all the positions on the device. For
each position p = (xp , yp ), it checks if an overlapping will occur between c
and a placed module, if the component c is placed at location p.
Having solved subproblem through the scanning of all possible positions,
the optimal placement position is computed from the set of feasible positions
by computing the placement cost for each of the locations found in the first
step and then select the best one as the optimal solution.
The Brute force requires O(H × W × n) time to solve subproblem 1. H
being the height, W the width of the reconfigurable device and n is the number
of running tasks on the hardware. This approach is not practicable for large
reconfigurable devices.
Without loss of generality, we will considere that components are placed
relatively to their lower left positions. Later, we will use the middle of the
device as reference point. We next provide some definitions that are important
to understand the placement strategy explained in this section.
Definition 5.15 (IPR relative to placed modules) For a new com-
ponent v to be placed on the device and a placed component v ′ , the Impossible
Placement Region (IPR) Iv′ (v) of v relative to v ′ is the region on the chip,
where v cannot be placed without overlapping with v ′ .
region of the device where v cannot be placed without overlapping with the
external area of the device.
Having defined the impossible placement region relative to the device and
that relative to other components on the chip, we can now define the overall
impossible placement region of a component on a device with a running set of
components.
The IPR of v relative to v ′ is the sum of the augmented margin and the area
of v ′ .
The computation of the IPR relative to the device is slightly different from
that of a component. Instead of computing a left side and a bottom margin,
we compute a right side and an upper margin with. The right side margin has
the width wv − 1, while the upper margin has the height hv − 1. Figure 5.11
168 RECONFIGURABLE COMPUTING
Figure 5.11. Impossible and possible placement region of a component v prior to its insertion
Now that we know how to compute the region (set of points) where the
component can be placed, the next sub-problem consists of choosing the best
location for the new component. The position where the new component must
be placed should be a feasible position that minimizes the ommunication cost.
A simple approach consists of scanning all the possible placement positions,
and to compute the communication cost for each position and then selection the
optimal one. This straightforward but inneficient approach requires O(|P P R|∗
n) where n is the number of placed components and |P P R| is the size (number
of point) of the possible placement region. This can be very large according to
the current configuration of the device.
Alternatively, we can first compute the point popt which gives us the optimal
placement cost. If popt is located within the PPR of v, then we have the solu-
tion. Otherwise, the optimal position is not in the PPR and we should choose
the closest point to popt which is located in the PPR and select it as the optimal
placement position.
Temporal Placement 169
n−1
X wn wi hn hi
min{ (((xn + − xi − )2 + (yn + − yi − )2 ) × win )} (4.1)
2 2 2 2
i=1
In Equation 4.1, xn and yn are variables and other parameters are fixed. Be-
cause xn and yn are positive and independent from each other, we can replace
equation 4.1 by the two equations 4.2 and 4.3.
n−1
X wn wi
min{ ((xn + − xi − )2 × win )} (4.2)
2 2
i=1
n−1
X hn hi
min{ ((yn + − yi − )2 ) × win )} (4.3)
2 2
i=1
The minimums can be computed through the partial derivative of the sums in
equations 4.2 and 4.3. We therefore have for xn
170 RECONFIGURABLE COMPUTING
Pn−1 wn wi 2
∂{ i=1 ((xn + 2 − xi − 2 ) × win )}
=0 (4.4)
∂xn
This leads to the optimal value:
Pn−1
win (xi + wi − wn
2 )
xn = i=1 Pn−1 2 (4.5)
i=1 win
Figure 5.12. Nearest possible feasible point of an optimal location that falls within the IPR
In the worst case, we may face the situation, where several IPRs overlap
such that moving out of the IPR related to a component bring us in the next
IPR of another component. In this case we simple keep moving out of the IPR
untill we get a point in the PPR. However the price of doing such a sucessive
move can be very high. A situation could be constructed such that we have to
go through all IPRs, thus creating a uadratic run-time O(n2 ). Because we need
O(n) time to compute all the IPRs and O(n2 ) in the worst case for moving out
of all IPRs if the computed optimal point falls within one of them, the run-time
of the algorithm is quadratic.
Algorithm 11 provides a pseudo code for the approach presented here.
Temporal Placement 171
Example 5.19 Figure 5.13 shows an example of recursive move out of the
IPRs. The optimal point falls in an IPR relative to module A. The four nearest
points inserted in the list are 1, 2, 3 and 4. The closet point to the optimal is 4
that is next selected but falls in the IPR of module B.
172 RECONFIGURABLE COMPUTING
The points 5, 6 and 7 are inserted into the list. The closest point to the
optimum after the removal of the point 4 is the point 1 that is selected and
maintained as solution, because it is in the PPR.
The approach presented here has the drawback of moving from IPR to IPR
untill a valid placement point is found. In the normal case, we need only a
few number of steps. However the worst case still remain and in the case it
happens, the penalty is a quadratic run-time. A better characterization of the
total set of IPRs can help to improve the efficiency as well as the quality of the
method. This is the topic of the next section.
Example 5.20 In figure 5.14 the result of the transformation for a new com-
ponent that must be placed is shown. All points out of the impossible placement
regions are feasible placement locations from which the optimal one must be
selected.
Figure 5.14. Expanding existing modules and shrinking chip area and the new
174 RECONFIGURABLE COMPUTING
Among all feasible locations, the points on the contour of the occupied
space, i.e those at the boundary of the free space and the occupied space, form
a set from which the nearest to optimal point will be selected, if the optimal
point falls in an occupied region. This help to avoid the recursive jumping
out of the IPRs of the components as presented in the previous section. The
computation of the contour is therefore at the center of our interest.
Figure 5.15. Characterization of IPR of a given component: The set of contours (left), the
contour returned by the modified CUR (middle), the contour returned by the CUR (right)
In the following we will describe CUR and the modification done in [2].
Temporal Placement 175
previous section, the Manhattan Distance is used. The Euclid distance between
two components defines the shortest distance between two points. However,
this shortest distance may be a diagonal line that cannot be realized as signal
on the chip. The Manhattan Distance captures the smallest routing distance
between two components on the chip. Such a route is made only upon vertical
and horizontal segments weighted by width of the communication segments.
This can still be achieved in time Θ(n log n), making use of local optimal-
ity properties, the occupied space manager, and another application of plane
sweep techniques.
1
2
m 2
Figure 5.16. Placement of a component on the chip (left) guided by the communication with
its environement (right).
Because we are dealing with the Manhattan metric, this can be reformulated to
with
k
X
cx (x) = wij |xi − x|,
i=1
and
k
X
cy (y) = wij |yi − y|.
i=1
li li
min{cx (xi ) : xi ∈ [ , Hl − ]}+
2 2
y hi hi
min{c (yi ) : yj ∈ [ , Hh − ]}.
2 2
the sum of the required bandwidths to the left minus the sum of the required
bandwidths to the right and
X X
∇cy (y) = wij − wij ,
yi <y yi >y
the sum of the required buswidths to the bottom minus the sum of the required
buswidths to the top.
178 RECONFIGURABLE COMPUTING
Median
Figure 5.17. Computation of the union of contours. The point on the boundary represent the
potentially moves from the median out of the IPR.
All points of the first type can be found by intersecting the contour of the
occupied space with the median axes lx = {(xmed , y) : y ∈ [0, Hl ]} and
ly = {(x, ymed ) : x ∈ [0, Hh ]}. In these points one of the gradients ∇cx and
∇cy vanishes. We cannot move in the direction of a better solution because
that way is blocked by either a vertical or a horizontal segment of the contour.
The second type of points are some of the vertices of the contour. These
points are the intersections of horizontal and vertical segments forming an in-
terior angle of π2 pointing in the direction of the median. In these points neither
of the gradients vanishes. Either of the directions indicated by the gradient are
blocked by contour segments.
By simply inspecting all the local optima one finds the one closest to the
global optimum. In the next subsection we describe how this can be done
efficiently.
i.e., for every vertex of the contour and every intersection point of the contour
with one of the median axes. Let L denote this set of points. Computing the
communication cost for a single point takes O(n), so evaluating all objective
values in a brute-force manner would take O(n2 ) time. However, by means of
two more plane sweeps, we can achieve a complexity of O(n log n).
For this purpose, we observed that communication cost for the x- and y-
coordinate of the contour segments can be computed separately, then add the
precomputed values for every point of L. The crucial step is to use the fact that
we only need to compute the communication cost for the leftmost x-coordinate
and for the bottommost y-coordinate; the other values can be obtained by doing
appropriate fast updates during the plane sweep. B
5. Conclusion
This chapter has provided some of the most important approaches for tem-
poral placements. While very solid mathematical backgrounds were developed
and simulated, a physical investigation on reconfigurable chips has failed so
far. This is mainly due to the lack of devices providing partial reconfiguration
facilities, a precondition for the implementation of temporal placement algo-
rithms. One and almost the only device type to provide partial reconfiguration
is very limited and does not allows for evaluating the large set of available al-
gorithms. Besides the difficulty to designed for partial reconfiguration, many
restriction are done on the use of the device resources during and after the
reconfiguration. Some hope were placed on coarse-grained reconfigurable de-
vices. However, they have failed so far to take-off.
Chapter 6
ON-LINE COMMUNICATION
Several methods were presented in the last chapter for the placement of
components at run-time on a reconfigurable device. The last two ones use the
communication costs between a component to be placed and its environments,
i.e the set of modules on the chip and out of the chip that exchange data with
this component, as a mean to select the best position for the new component.
While this communication costs, defined as the average minimum distance of
the new component to its neighbor is useful to guide the placement process,
it does not tell us how data should be exchanged among the different compo-
nents on the chip. In this chapter, we provide some answers to this question by
presenting some approaches to enable the communication at run-time between
modules on the chip. The approaches can be classified on different categories,
depending on the way the communication is realized. We can cite the direct in-
terconnection, the communication over third party, bus-based communication,
the circuit switching approach and network on chip oriented communication.
1. Direct Communication
Direct communication paradigm allows modules placed on the chip to com-
municate using dedicated physical channels, configured at compile-time. The
configuration of the channels remains until the next full reconfiguration of the
device. A configuration defines the set of physical lines to be used, their di-
rection, their bandwidth and speed as well as the terminal, i.e the components
that are connected by the lines. Component must be designed and placed on
the device in such a way that their ports can be connected to the predefined
terminals. Feedthrough channels must also be available in each component to
allow signal used by modules aside the component to cross the components.
182 RECONFIGURABLE COMPUTING
The main disadvantage of this approach is the restriction imposed on the de-
sign of components. For each component, dedicated channels must be foreseen
to allow signals that are not used by this component in its placement location
to cross. This increases the amount of resources needed to implement the com-
ponent. Also, the placement algorithm must deal with additional restrictions
like the availability of signals in a given location. This increases its complexity
and makes the approach only possible an off-line temporal placement, where
all the configurations can be defined and implemented at compile time.
a bus and a user programm running on the host processor. The module inputs
and outputs are controlled by registers that are mapped into the address space of
the processor. This approach can be used not only to allow the communication
between a reconfigurable module and a user program, but also between several
reconfigurable modules connected together via a bus. The system is controlled
by an operating system, whose role is to manage the device resources, control
the reconfiguration process and allow the communication to happen between
the components temporally placed on, or removed from the device. All mod-
ules willing to send a message must first copy those messages in their sending
register. Thereafter, the operating system copies the message from those reg-
isters to the input register of the destination module. In [211] [212] the central
module is not an operating system running on a separate processor, but a set of
fixe resources on the device. It provides dedicated channels to access periph-
eral devices and also to connect direct neighbor modules. The communication
between non adjacent modules is done via fixed resources implemented on the
device.
3. Bus-based Communication
The communication between the reconfigurable modules on a given device
can also be done using a common bus. In order to avoid the bus resources
to be destroyed at run-time by components dynamically placed on the device,
predefined slots must be available, where the modules can be placed at run-
time, such that no alteration of the bus is possible. On the predefined slots,
connection ports must be available to dynamically attach the placed component
to the bus. While the predefinition of locations where to place components
allows a simplicity of the placement algorithms, it is not flexible at all. Using
a common bus reduces the amount of resources needed in the system since
only one medium is required for all component. However, the additional delay
increase by the bus arbitration can drastically affect the performance of the
system. The approaches of Walder et al. In [211] [212] as well as that of
Brebner are both based on restricted bus in which no arbitration module is
requirement to manage the bus access.
4. Circuit Switching
Introduced in the 80’s under the name reconfigurable massively parallel
computers [146], [205], circuit switching is the art of dynamically establishing
a connection between two processing elements (PE) at run-time, using a set
of physical lines connected by switches. The system consists of a set of pro-
cessing elements arranged in a mesh. Switches are available at the column and
line intersection to allow a longer connection using the vertical and horizontal
lines at an intersection point. In this way two arbitrary processing elements
184 RECONFIGURABLE COMPUTING
Figure 6.2. Drawback of circuit switching in temporal placement Placing a component using
4 PEs will not be possible, although enough free resources are available
The previous example has shown how circuit switching might increase the
device fragmentation in devices allowing a 2-D placement. In device allowing
only a 1-D placement like it is the case of Xilinx VirtexII FPGAs, which are
only column wise reconfigurable, circuit switching can be used to connect a
few number of modules, usually, 2 to 8, and allow dynamic communication to
be established between the components running onto the device at run-time.
5. Network on Chip
Many well reputed authors [21] [117][60] have predicted that wiring mod-
ules on chip will not be a viable solution in the billion transistor chips in the
future. Instead, they proposed Network on Chip (NoC) as a good solution
to support communication on System on Chip in the future. NoCs encounter
many advantages (performance, structure and modularity)toward global signal
wiring. A chip employing a NoC consists of a set of network clients like DSP,
memory, peripheral controller, custom logic that communicate on a packet base
instead of using direct connection. As shown on figure 6.6, several modules
(network client) placed at fixed locations on the chip, can exchange packets in
the common network. This provides a very high flexibility, since no route has
to be computed before allowing components to start communicating. Compo-
nents just send packets and they don’t care on how the packets are routed in the
network. Network on chip (NoC) is viewed as the ultimate solution to avoid
problems which will arise due to the growing size of the chip. The Quicksilver
chip [89] can be seen as an example of NoC-based architecture.
A generic NoC architecture is characterized by the number of routers, each
of which is attached to processing elements in the array, the bandwidth of
On-line Communication 189
the communication channels between the routers, the topology of the network,
and the mechanism used for packet forwarding. The topology of the network is
defined through the arrangement of routers and processor on the device and the
way those processors are connected together. One of the most used topology
is the 2-D mesh network, because it naturally fits the tile-based architecture on
the chip.
Compared to typical macro networks, a network on chip is by far more
resource limited. To minimize the implementation cost, the network should be
implemented with little area overhead. This is especially important for those
architectures composed of tiles with fine level granularity as it is the case in
FPGAs. Thus, instead of having huge memories (e.g. SRAM or DRAM) as
buffer space for the routers like in the macro network, only few registers are
available to be used as buffers for on-chip routers. This leads to a much simpler
model with little overhead, compared to its macro network peer.
We next take a look on the design and implementation of major NoC com-
ponents, which are the routers and the processing elements.
It consists of two counters to count the written and read registers. One write
enables signals allow data to be written into the registers and one read enable
signal to notify about the readiness of the data from the registers. Two signals
are also used as flags to inform about the fullness and emptiness of the FIFO. If
full, then no more data can be written to the FIFO. An empty FIFO means that
no data is available in the FIFO registers. If a common clock is used for reading
and writing the data then the FIFO is said to be synchronous. Otherwise, there
will be two different clocks for reading and writing the data. In this case,
we said that the FIFO is emphasynchronous. The FIFO can be parameterized
with the data width (number of bits in a register) and FIFO depth (number of
registers in a FIFO).
The first part is the packet address, which is used by the router control to
determine the direction where to send the packet. The controller has an address
decoder that decodes the address into (x,y) coordinate of destination router
or PE. In the simple case of XY-Routing as we will see later, a comparator
is used to compare the (x,y) coordinate of the destination PE to that of the
router in order to compute the direction (LOCAL, EAST, WEST, SOUTH, and
NORTH) where the packet will be sent. The packet is then sent out by writing
in the input FIFO of the corresponding neighbor FIFO if this is not full. If the
FIFO in the neighbor router is full, then the router can decide to take some
action. It can for instance block all incoming packets or send the packet in
another direction to decongest a given data line.
Figure 6.10. Arbiter to control the write access at output data lines
which the PE is connected. The wrapper reads the packet from the FIFO only
when there is a packet in that FIFO. In the implementation, the processing
element is instantiated as functional block within the wrapper.
5.3.2 Latency
The latency consists of the time needed to setup a route and the time need
to transfer the payload to destination. In circuit switching, route setup time
On-line Communication 195
other acknowledgment through the same route to the source. The disadvantage
of this technique is the time required to establish a dedicated link from source
to destination. It can be advantageous when the time to set up the path is min-
imal, compared to the transfer time of the messages, and when long messages
are present, especially continuous data streams.
5.4.2 Store-and-Forward
At each node, the packets are stored in memory and the routing informa-
tion examined to determine which output channel to direct the packet. This is
why the technique is referred to as store-and-forward (SAF). Additionally, by
transferring an entire packet at a time, the latency for a packet is the number of
routers through which the packet must travel multiplied by the time to transfer
the packet between the routers.
and deadlock problems can occur. Therefore, other techniques, such as virtual
channels, are needed. An advantage of virtual channels is the ability to share a
single physical channel.
1 If Xrouter < Xdest, the packet ist forwarded in the east direction
2 If Xrouter > Xdest, the packet ist forwarded in the west direction
3 If Xrouter = Xdest and Yrouter > Ydest, the packet is sent to the south of
the current router
4 If Xrouter = Xdest and Yrouter < Ydest, the packet is sent to the north of
the current router
5 If Xrouter = Xdest and Yrouter = Ydest, the packet is sent to the local PE
198 RECONFIGURABLE COMPUTING
i.e. the resources needed by the component is no more than that available on
a PE. In this case, we place the component on one PE and attach it to the cor-
responding router to allow a communication with other components. In the
second case, the component does not fit on one PE, i.e. more than the amount
of resources offered by a PE is necessary to implement the component. In this
case, the component will be splitted in pieces, each of which can fit on a PE.
The communication between the pieces placed on different PE will used the
router network for communication. In other words, the communication within
a component boundary must be done using packets that are sent and routed in
the network. This increase the complexity of the module and more resources
are wasted than needed. This situation is best illustrated in the figure 6.12.
A
PE PE PE PE
2
3
1
PE PE PE PE
6
PE PE PE PE
such a way that the device is surrounded by a ring of routers. This arrange-
ment increases the flexibility and, as we will see later, is a prerequisite for a
accessibility of placed components.
A further requirements on the structure of the network is that routers should
be able to notify their neighbors about their activity. This can be done using
an additional emphactivation line that is set to one if the router is active, i.e.
no component is placed on top of that router and zero if a component is using
that router in its internal implementation. In this case, the activation line is
controlled by the component, which notifies the routers around on the states of
the routers that it internally uses.
pair of components abuts or a component abut the device boundary. Let con-
sider the first case. The second one can be handle in a similar way. Either
the two components overlap or at least one component uses some routers on
its internal boundary (this is illustrated in figure 6.14) . The first case is im-
possible because only overlapping free placements are valid. The second case
contradicts our requirement of the theorem, thus completing the proof.
Example 6.5 Consider the placement of figure 6.14 with two abutting com-
ponent. The first component attached to router A is not implemented according
to the guideline of theorem 6.4, since router are used in its internal boundary.
The second component attached to router B is developed according to the rule
of theorem 6.4. No route is available between those two component, because
the only routers available are consumed by the first component.
While in a static NoC, each router always has four active neighbor routers1 ,
this is not always the case in the DyNoC presented here. Whenever a compo-
nent is placed on the device, it covers the routers in its area. Since those routers
cannot be used, they are deactivate. The component therefore set the activation
signal to the neighbor routers to notify them not to send packets in its direction.
Upon completing its execution, the deactivated routers are set to their default
state. A routing algorithm used for common NoC cannot work on the DyNoC,
without modification. An improvement or an extension of the existing routing
algorithms should be done in such a way that packets may be able to surround
the components on their way to the destinations.
1 Routers around the chip does not have four neighbour. The package pins can be considered as further
neighbors where the routers can sent a message. This leads to the availability of the four neighbor routers
everywhere on the chip.
2 The decision where to send a packet is taken locally.
On-line Communication 205
The strategy is based on the XY-routing, because of its simplicity, its effi-
ciency and its deadlock freeness, the XY-routing algorithm presented in section
5.5 was adapted for the DyNoC. Recall that in a full mesh, the XY-routing is a
deadlock free, shortest path routing algorithm that first routes packets in X di-
rection to the correct X-coordinate and then in the Y direction until the correct
Y-coordinate. The adapted XY-routing, the emphS-XY-Routing (Surrounding
XY-routing) operates in three different modes:
emphThe N-XY (Normal XY) mode. In this mode the router behave as a
normal XY router.
emphThe SH-XY (Surround horizontal XY) mode. The router enter this
mode when its left neighbor or its right neighbor is deactivated.
emphThe SV-XY (surround vertical XY) mode. The router enter this mode
when its upper neighbor or its lower neighbor is deactivated.
The normal mode is available when all the neighbors of a routers are active.
In this case, it is the XY-routing which is used.
Ping-Pong
1
Game
5
6
Obstacle
Component
Routing
Path1 Routing
Path 2
Destination
Component
In the two cases, a path will be always available from the source to destina-
tion. With the XY-routing used as underlying routing technique, we conclude
that each packet sent will reach its destination in a finite number of steps.
Theorem 6.6 With very high probability, the S-XY algorithm presented here
is deadlock free and livelock free.
Proof We need to first prove that there is always a path from packet to desti-
nation. With this, the only possibility to have a livelock is that the packet move
permanently through the network and never reach its destination, despite the
availability of a path.
Second we must prove that each packet will reach its destination after a
finite number of steps. The first requirement is guaranteed through theorem
6.4.
We now assume that a packet never reach its destination. This will happen
only if the packet is blocked or if the packet is looping in a given region. Be-
cause a path always exists from one active router to all other active routers,
no packet can be blocked in the network, i.e. a packet is looping. Since this
situation is not possible in the normal XY, it can only arise in the surrounding
phase. When a packet is blocked in a given direction, it takes the perpendicular
direction. This is done untill the last router on the component boundary which
208 RECONFIGURABLE COMPUTING
is at one corner of the module to be surrounded. From there on, the normal
XY-routing resumes. A looping of a packet around a component is therefore
not possible.
It may be possible to construct a placement sequence of components that
blocks a single packet whenever that packet moves in a given direction, forcing
it to take the perpendicular direction. Such a sequence may leads to a packet
that will loop in a given region. However, doing this will only block a single
packet and other packets will move to their destination as previously described.
This leas to a very low live locking because only one packet out of an infinite
number is live locking. However such placement sequences are very unlikely
to happen in the reality.
In the S-XY routing, the direction where to send a packet whenever an ob-
stacle is encountered is fixed for all routers a priori. In this particular case,
packets are always sent to the right, whenever they are blocked in the vertical
direction. This may leads to extreme long routing paths like that of figure fig-
ure 6.18. Those extreme paths are not usual, but may occur as result of some
placements.
01 01 00 00
01 01
01 01
01 01
00 00
00 00
01 01 00 00
max(wi , hi ). The half perimeter of the component is not added here, since it
is already in the sum H + W . Only the maximum between the components
height and its width is considered. This additional delay is unpredictable and
depends on the temporal placement of components on the device.
In a static NoC where each processing element is assigned a fixed place, the
delay is also fixed. Packet communication can be seen as pipelined compu-
tation in which the length of the pipeline is the packet delay from source to
destination. Once the pipeline if filled, one packet is received at the destination
on each clock.
In the DyNoC, we have the same situation, however delays created by newly
placed components create new situations for which the path must first be filled
with packets. It is even possible that some packets sent later reach their desti-
nation before other, which were sent earlier, but which had to go through some
obstacles. This makes the problem difficult for a formal analysis.
VirtexII-1000 VirtexII-6000
A/M/S(8 bit) 8% /4% / 77.2 1% /0% / 77.2
A/M/S(16 bit) 12% 7% / 75.4 2% /1% / 75.4
A/M/S(32 bit) 21% / 12% / 77.3 3% /2% / 74.9
A/M/S(64 bit) 46% / 28% / 70.1 7% /4% / 73.7
VirtexII-1000 VirtexII-6000
A/M/S(TLC) 53% /32% / 100.2 8% / 5% / 100.2
A/M/S(CG) 47% 33% / 91.2 20% /11% / 121.9
(2,1) (3,1)
(1,1) (3,1)
(1,1) (2,1)
LV LV
TC TC
(2,2)
(3,2)
(2,2) (3,2) (1,2)
(1,2)
(2,3)
(2,3) (3,3)
(1,3)
(1,3) (3,3) VGA
VGA
As we can see, the size of the routers are very large, making the used of a
NoCs very difficult on FPGAs. This is somehow due to the fact that the routing
resources in the router are built on top the programmable resources available
(LUT and routing matrices) in an FPGA. A direct access to the available re-
sources in the FPGA will reduce the amount of the resources need to a min-
imum. However, this can be done only if the vendors provides the necessary
knowledge to control the resources on a very low level.
7. Conclusion
In this chapter, we have addressed the communication paradigm needed to
allow components dynamically placed at run-time on a reconfigurable devices
to communicate with their environment. The most promising approach is the
Network on Chip, which has attracted a lot of attention in the research com-
munity in the last decade. A large amount of work is available, each with its
own level of complexity. The DyNoC is an extension of the NoC paradigm.
In our opinion, it provides the best NoC-based approached to solve the dy-
namic need of communication in temporal placement. We have also presented
a circuit switching approach and show why it is not a good solution in tem-
poral placement. However, the 1-D RMB circuit switching approached can be
used to connect a small number of components in predefined location on a 1-D
reconfigurable device.
Chapter 7
∗ This chapter and the correponding Appendix was prepared by Christophe Bobda and Dominik Murr
214 RECONFIGURABLE COMPUTING
components, which share a common column with that component. From the
Virtex 4 and upwards, the frames does not spans a complete column any more,
but only a complete tile. Therefore, the reconfiguration of a component affects
only those components in the same blocks, which share a common column
with the reconfigurable module.
It is then the matter of the partial reconfiguration design flow to produce the
partial bitstream, i.e. the set of data required to configure the needed frames
and therefore to allow the replacement of a component on the chip. The ex-
tracted partial bitstreams are then used to move the device from one configura-
tion state to the next one without reconfiguring the whole device.
Two possibilities exist to extract the partial bitstream representing a given
module. A constructive approach which allows us to first implement any single
component separately using common development tools. The so developed
components are then constrained to be place at a given location using bounding
box and position constraints. The complete bitstream is finally built as the sum
of all partial bitstreams. This approach is the one followed by the modular
design and the early access design flows. The second possibility consist of
first implementing the complete bitstreams separately. The fix parts as well
as the reconfigurable parts are implemented with components constrained at
the same location in all the bitstreams, which differs from each other only on
the reconfigurable parts. The difference of two bitstreams is then computed to
obtain the partial bitstream needed to move from one configuration to the next
one.
Example 7.1 Consider for instance the two designs of figure 7.1, where the
full designs are named Top1 and Top2. The fix part consists of the module F-
Module 1, F-Module 2, and F-Module 3, which are placed at the same location
in the two designs. In Top1 , the reconfigurable module is R-Module 1, which
is placed at the same location, with the same bounding box constraints as the
reconfigurable module R-Module 2 in Top2.
The addition of R-Module 2 to the fix parts of the two bitstreams generate
the full bitstream Top2. The same is true for R-Module 2 and Top1. Also,
as shown on the figure, adding the two bitstreams Top1 and R-Module 2 will
produce the bitstream Top2, because R-Module 2 will replace R-Module 1 in
Top1. The substraction of Top2 from Top1 produce the partial bitstream for
R-Module 1.
The JBits API is constructed from a set of java classes, methods and tools
which can be used to set the LUT-values as well as the interconnections. This
is all what is required to implement a function in an FPGA. JBits also provides
functions to read back the content of a FPGA currently in use.
The content of a LUT can be set in JBits using the set function:
connect(outpin, inpin)
This function uses the JBits routing capabilities to connect the pin outpin to
the pin inpin anywhere inside the FPGA. outpin and inpin should be one of the
CLB terminals.
Connecting the output of a LUT to a CLB terminal can be done with the
function.
Partial reconfiguration design 2 217
Figure 7.2. Routing tools can route the same signals on different paths
(design entry and synthesis, initial budgeting, active implementation, and as-
sembly) for which a viable directory structure must be use.
We recommend to use the ISE version 6.3, which is on our opinion, the
most stable version for implementing partial reconfiguration with the modular
design flow.
The HDL Design directory (HDL): This directory is used to store the HDL
descriptions of the modules and that of the top level designs. The HDL
description can be in VHDL, Verilog or any other hardware description
language.
Figure 7.3. The recommended directory structure for the modular design flow
Signals that connect modules to each other and to I/O pins must be defined
in the top-level design.
besides those constraints, area group must be defined for each component
and specific properties for the partial reconfiguration design flow must be
added to the to the area group of components.
2 IOB constraints: The Input Output Block (IOB) must be constrained within
the columnar space of their associated reconfigurable modules. This condi-
tion is important to avoid the connection between a module and a pin to be
destroyed during the reconfiguration of another component. Also, all IOBs
must be locked down to exact sites.
3 Global logic constraint: All global logic like clocks, power and ground
lines must be constrained in the top level. No global logic should be locked
in the module topX.ucf file.
The output NGD file is named after the top-level design and contains im-
plementation information for both the top-level design and the individual
module. The -uc option ensures that the constraints from the local topX.ucf
file are annotated.
Partial reconfiguration design 5 223
3 Map the module to the library elements using the following command.
map topX.ngd
In this step, the logic of the design only with expanded active module is
mapped.
4 The place and route of the library elements assigned to the module is done
by invoking the placer and router with:
The -w option ensures that any previous versions of the produced files de-
sign name routed.ncd are overwritten. If the area specified for the module
cannot contain the physical logic for the module because it is sized incor-
rectly, then resizing must be done again and the .ucf file generated during
the initial budgeting phase must be regenerated with the changes made.
The entire initial budgeting phase should be performed again. This would
then imply that new .ucf files would need to be copied to each module and
that each module needs to be reimplemented. If everything went fine, then
the module is now placed and routed at the correct location in the given
bounding box.
5 Run trace on the implemented design to check the timing report in order to
verify that the timing constraints are met.
This command creates the appropriate module directory inside the PIMs
directory. It then copies the local, implemented module files, including the
.ngo, .ngm and .ncd files, to the module directory inside the PIMs direc-
tory and renames the .ngd and .ngm files to module name.ncd and mod-
ule name.ngm. The -ncd option specifies the fully routed NCD file that
should be published.
224 RECONFIGURABLE COMPUTING
1 In order to incorporate all the logic for each module into the top-level de-
sign, run ngdbuild as follows:
Ngdbuild generates an topX.ngd file from the top-level topX.ucf file, the
top-level .ngo file, and each PIM’s topX.ngo file.
2 The logic for the full design can then be mapped as follows:
map topX.ngd
MAP uses the topX.ncd and topX.ngm files from each of the module direc-
tories inside the PIMs directory to accurately recreate the module logic.
3 The full design can then be placed and routed using the following com-
mand:
The place and route command, par, uses the topX.ncd file from each of the
module directories inside the PIMs directory to accurately reimplement the
module logic.
4 In order to verify if timing constraints were met, the trace command must
be invoked as follows:
Figure 7.4. Limitation of a PR area to a block (dark) and the actual dimensions (light)
Figure 7.5. Scheme of a PR application with a traversing bus that may not be interrupted
design flow. Now, traversing signals must not be guided through a bus macro,
a tedious handwork.
Partial reconfiguration design 7 227
Figure 7.6. Improved directory structure for the Early Access Design Flow
Finally the design is assembled in merges and full and partial bitstreams are
generated.
all design modules, instantiated as black boxes with only ports and port
directions specified and
Again, top-level modules include no logic except clock generation and dis-
tribution circuits.
device can be
xc2vp - for a Virtex-II Pro,
xc2v - for a Virtex-II or
xc4v - for a Virtex-4 FPGA.
direction is either one of
r2l - right-to-left
l2r - left-to-right
b2t - bottom-to-top (Virtex-4 only)
t2b - top-to-bottom (Virtex-4 only).
synchronicity may be
sync - synchronous or
async - asynchronous
width is either
wide - spanning four CLBs or
narrow - reaching over two CLBs.
A sample name is busmacro xc2vp r2l async narrow.nmc.
The direction of the macro is to be seen geographically: right-to-left means
data flowing from the right side of a border to the left side. If the right side
is inside a partially reconfigurable area, the bus macro is an output bus macro,
otherwise it is an input bus macro. The same applies to the other cases with
left-to-right and the bottom-to-top and top-to-bottom bus macros. The usage is
shown in picture 7.7.
Figure 7.8 shows the difference between a wide and a narrow type bus
macro. The classification concerns only the range of CLBs that one of these
bus macros bridges. Both kinds offer eight signal lines to cross a boundary.
230 RECONFIGURABLE COMPUTING
The unoccupied CLBs between wide bus macros can be used by user logic
or other bus macros. They can also be nested up to three times in the same y-
row as depicted in figure 7.9. With the old bus macros, only four signals could
traverse the boundary at this position whereas the new macros put through up
to 24 lines. The three staggered bus macros do not have to be of the same type
or direction.
(a)
(b)
Figure 7.8. Narrow (7.8(a)) and wide (7.8(b)) bus macro spanning two or four CLBs
the x-coordinate of the CLBs on which the bus macros is locked to, has to
be divisible by two
the y-coordinate has also to be divisible by two
Virtex-4 devices require coordinates divisible by four due to the location of
block RAMs and DSPs
bus macros have to traverse a region boundary. Each of the endpoints has
to completely lie in a different area
Further actions for the Initial Budgeting like setting the timing constraints
follow the same guidelines as the Modular Design. One must keep in mind
that signals that will not interfere with a module can cross that module without
having to run through a bus macro.
To close the Initial Budgeting Phase, the top-level design has to be imple-
mented. In directory pr project/top, the command
ngdbuild -uc system.ucf -modular initial system.ngc
has to be invoked. pr project/static now budgets the static portion of the de-
sign. The .nmc files for the used bus macros have to be placed in this directory.
The necessary commands are:
ngdbuild -uc ../top/system.ucf -modular initial ../top/system.ngc
map system.ngd
par -w system.ncd sytem base routed.ncd
The routed outputs are placed in pr project/top respectively pr project/static.
The process yields another file, static.used, that contains a list of routes within
the partially reconfigurable regions that are already occupied with static de-
sign connections. This is used as an input for the placement and routing of the
partially reconfigurable modules later on. The file allows for guiding signals
through PRMs without using bus macros.
232 RECONFIGURABLE COMPUTING
4.5 Activation
In the Activation Phase the partially reconfigurable modules are implemented.
First the previously generated static.used is copied to each of the modules di-
rectories (e.g. pr project/ reconfigmodules/ prm a1) and renamed to arcs.exclude.
Each module is then run through the known sequence of ngdbuild, map and
par as follows for the prm a1. In directory pr project/ reconfigmodules/ prm a1,
do the following:
ngdbuild -uc ../../top/system.ucf -modular module -active \\
prm component name ../../top/system.ngc
map system.ngd
par -w system.ncd prm des routed.ncd
Note that prm component name is the component name not the instance
name of the module. par automatically looks for and incorporates the arcs.exclude
file.
1 Merge each PRM of one reconfigurable area with the static design
pr verifydesign and pr assemble generate intermediate designs with
just the one PRM contained and partial bitstreams that can be used to re-
configure the FPGA at this particular region afterwards. The necessary
commands are
merges/prmB1withA1/pr verifydesign \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA1/pr assemble \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA2/pr verifydesign \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA2/pr assemble \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
234 RECONFIGURABLE COMPUTING
merges/prmB2withA1/pr verifydesign \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA1/pr assemble \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA2/pr verifydesign \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA2/pr assemble \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
With the iterations several files that will be created are equivalent to each
other. In this case for partially reconfigurable area B
prmB1withA1/prm b1 blank.bit,
prmB1withA2/prm b1 blank.bit,
prmB2withA1/prm b1 blank.bit and
prmB2withA2/prm b1 blank.bit
are equal. Equivalent bitstreams for module b1 are
Four variations of the full bitstream are generated representing the four
combinations of the two partially reconfigurable modules in this example:
In practice only one full bitstream is needed to load the FPGA. The recon-
figuration can then be conducted with the partial bitstreams. Which full
bitstream to take is incumbent on the user.
Figure 7.11. Two patterns that can be created with modules of Animated Patterns. Left: the
middle beam is moving. Right: the diagonal stripes are moving.
5.2 Video8
The second sample design is called Video8 and is used for the code and
command execution examples in the following guide. It deploys the Digilent
VDEC1 extension card [70] for the Virtex-II Pro XUP Development Board
[228] to digitize an analog video stream from a camera. The extension card
is connected to the XUP via a proprietary slot and is technically connected
through an I2 C Bus.
As shown in figure 7.12, the data stream is taken into the system by video in.
The module converts to the RGB color space and hands over to a filter, employ-
ing e.g. a Sobel-implementation for edge detection. This filter module will be
reconfigurable at the end of this tutorial. The filtered data is then processed in
vga out to be send to a VGA compliant screen.
The embedded hardcore processor on the FPGA is used to configure the
video digitizer card and the ICAP on start up. It is then used to run a simple
program to choose which partial bitstream to load next. The in- and output is
brought to the user’s console on a remote PC through the serial port.
reconstruct the initial design to place the reconfigurable parts and also
therewith connected module instantiations in the top-level. On the on
hand the resulting code is much easier to understand and modules can
be developed by different engineers according to the modular design
flow straight forward. On the other hand the facilitation that comes for
example from using EDK for the basic design might be outweighed
by the lot of work involved when extracting and reconnecting all the
necessary modules. This militates all the more regarding the synthesis
process. The placement tool places modules according to their connec-
tivity with each other and the constraints given in the UCF not accord-
ing to their actual position in the hierarchy. Thus reconstructing the
design might be useless since the placement is done independently.
In the Video8 non pr example video in and vga out are directly affili-
ated to the partially reconfigurable module to-be. As suggested in fig-
ure 7.13 these two modules at least would have to be instantiated in
238 RECONFIGURABLE COMPUTING
the top-level module. Additional difficulties arise from the use of the
OPB to connect the filter module. Either the developer has to adapt the
filter module and provide other means of connection to the rest of the
system or the OPB has to be brought up to the top-level as well. This
again affords work and other modules to be changed accordingly.
Figure 7.13. Reconstructing example Video8 to place the partially reconfigurable part and
connected modules in the top-level.
The other solution is to only replace the old instantiations of the par-
tially reconfigurable modules with instantiations on the top-level and
amend the entities that formerly hosted the modules: Ports have to be
added to redirect the input data to the partially reconfigurable modules
up the module hierarchy to its new location at the top-level module and
the output data back down to be fed into the system again. Figure 7.14
depicts the new structures. The big advantage with this approach is the
minimal adaption work to be done. An initial EDK design may be used
as is and can easily be changed.
entity video_in is {
...
Figure 7.14. Moving a partially reconfigurable module to the top-level design on the example
of Video8
...
-- signals out to the filter:
R_out_to_filter <= R1;
G_out_to_filter <= G1;
B_out_to_filter <= B1;
h_counter_out_to_filter <= h_counter1;
v_counter_out_to_filter <= v_counter1;
valid_out_to_filter <= valid1;
-- processed signals in from the filter:
240 RECONFIGURABLE COMPUTING
R2 <= R_in_from_filter;
G2 <= G_in_from_filter;
B2 <= B_in_from_filter;
h_counter2 <= h_counter_in_from_filter;
v_counter2 <= v_counter_in_from_filter;
valid2 <= valid_in_from_filter;
...
docm ctlr and iocm ctrl are the controllers for data and instructions
to be stored in on-chip block RAMs. The program that controls the
reconfiguration of the FPGA in this case is to large to fit in a smaller
ram area. It can generally be put into the DDR-RAM as well but in
this special design, the framebuffer for the VGA output is placed in
the DDR-RAM, consuming a great share of the bandwidth of the bus
it is connected to. Thus placing the program in the DDR-RAM as well
generates unwanted distortions in the VGA output.
2.4. Rewriting the Control Software
The program deployed to control the non-pr design has to be extended
by the initialization of the ICAP module and a mean to choose which
partial bitstream to load next.
For the sample design, the control program main.c contains just the
initialization of the video decoder card. The initialization routine for
the ICAP and the SystemACE has to be added as well as a menu that
prompts the user for the next partial bitstream.
2.5. Export to ISE
The generated design has to be synthesized to be a submodule to the
top-level design. For this EDK supports the export of a project to
the ISE where detailed options for the synthesis can be set. Alterna-
tively, the module can be synthesized using xst. To export the project
in Project|Project-Options|Hierarchy and Flow the follow-
ing settings have to be made as follows:
Design is a sub-module with name system i
use project navigator implementation flow (ISE)
don’t add modules to an existing ISE-project
tailed information on how and why to use bus macros see sections 3.3 and
4.3. An examples of the resulting UCF can be found on the book web page.
4. Synthesis
The synthesis of the partially reconfigurable project takes place in the sub-
folder synth of the project base folder. Here, corresponding subfolders for
static and reconfigurable modules and the top-level module can be found.
It should be mentioned again, that any submodules, including the partially
reconfigurable ones, have to be synthesized without automatically adding
I/O-buffers. Thus the option -IOBUF has be set to no. I/O-buffers only have
to be included in the top-level module. The exported project contains the
system as a submodule system i of type system and should already in-
clude the needed ports for the data flow to and from the reconfigurable mod-
ule. system i is ordered under a generated super module system stub
and can therefore directly be included in a top-level module. These ports
should be created by EDK when entering and wiring external ports as de-
scribed above. If not so the generated HDL files describing and instantiat-
ing entity system resp. system i (system.vhd resp. system stub.vhd)
have to be extended with the necessary ports.
For the sample design folders, the reconfigurable modules (mod *) as well
as a top and an edk directory must be created. There is no separate static
module, except the EDK system. Therefore, no static folder is required.
The ports to add to entity system are the same as when changing the former
super module of the one that should be partially reconfigurable.
5. Appending the UCF Taking a UCF built for example from EDK, several
amendments have to be made to let the partial reconfiguration work. As
stated before the definition of the area to place the partial and static blocks
and the location of the bus macros have to be declared in the UCF.
To set a region to be partially reconfigurable the UCF is extended by lines
like
"pblock_fixed_area" RANGE=SLICE_X46Y16:SLICE_X87Y149;
Omitting MODE=RECONFIG tells the parser that the block will be fixed. Al-
though the UCF is a simple text file and can be edited manually, tools like
PlanAhead can be a big help with their architecture dependant graphical
user interfaces.
6. The Build Process
The build process starts with the synthesis of all the involved modules. The
submodules should not include I/O-buffers, which must all be included only
in the top-level module. The initial budgeting is partly done already, and
must be complete now. All other phases, from the activation to the final
assembly follow the steps described in section 4.
void main() {
unsigned 32 res;
// Interface definition
interface port_in(unsigned 32 var_1
with {busformat="B<I>"}) Invar_a();
// Addition
res = Invar_a.var_1 + Invar_b.var_2; }
The first part of the code is the definition of the interfaces for communication
with other modules and the second part realizes the addition of the input values
coming from the input interface, and the output values are sent to the output
interface. We need not to provide the implementation of the subtractor, since
it is the same like that of the adder, but instead of adding we subtract. Having
implemented the two modules separately, each module will be inserted in a
separated to-level. Up to the use of the adder or subtractor, the two top-levels
are the same. The design can now be reconfigure later to change the adder
against the subtractor. Connecting the module in a top-level design is done as
shown in the following code segment.
unsigned 32 operand1, operand2;
unsigned 32 result;
with {busformat="B<I>"};
void main() {
operand1 = produceoperand( 0);
result = my_adder.Res;
dosethingwithresult(result);
}
According to the top-level design being implemented, the adder will be re-
placed by a subtracter. The next question is how to keep the signal integrity
of the interfaces. Since there is no bus macro available in Handel-C, bus-
macros provided by Xilinx must be used as VHDL-component in a Handel-C
246 RECONFIGURABLE COMPUTING
7. Platform design
One of the main factor that have hindered the implementation of on-line
placement algorithms developed in the past and presented in chapter 5, is the
development process of reconfigurable modules for the Xilinx platforms. As
we have seen in the previous section, partial reconfiguration design is done
with many restrictions, which make a systematic development process for par-
tial reconfiguration very difficult. Each module placed at a given location on
the device is implicitly assigned all the resources in that area. This includes
the device pins, clock managers and other hard macros components like the
embedded multipliers and the embedded BlockRAMs.
Figure 7.15. Modules using resources (pins) not available in their placement area (the com-
plete column) must use feed-through signals to access those resources
Most of the FPGA platforms available on the market do not provide solu-
tion to the problems previously mentioned. Many systems on the market offer
various interfaces for audio and video capturing and rendering, for communi-
cation and so forth. However each interface is connected to the FPGA using
dedicated pins in a fix location. Modules willing to access a given interface
like the VGA must be placed in the area of the chip where the VGA signals
available such that they can be assigned those signals among other resources.
This makes a relocation of module at run-time impossible. Relocation pro-
vides the possibility to have only one bitstream for a module representing the
implementation of that module for a given location. At run-time, the coordi-
nate of the module is modified to match the location assigned to the module
by the on-line placer. Without relocation, each module must be compile for
each possible position on the device where it can be later placed. A copy of
the component bitstream must therefore be available for each location on the
chip. The amount of storage needed for such a solution does not make such an
approach attractive.
A research platform, the Erlangen Slot Machine (ESM) that we next present,
was developed at the university of Erlangen to overcome the drawbacks of
existing platform, and therefore allow unrestricted on-line placement on an
FPGA platform. Through the presentation of the ESM, we also hope to high-
light some of the requirements in the platform design for reconfigurable com-
puting.
The goals that the ESM-designers had in mind while developing a new plat-
form was to overcome the deficiency of existing FPGA-Platforms by provid-
ing:
A new highly flexible FPGA Platform in which each component must not
be fixed all the time at a given chip location.
248 RECONFIGURABLE COMPUTING
While the tooling aspect is important, it is not part of the basic requirements
for designing the platform itself. We therefore just focus on the architectural
aspect of the platform design in the next sections.
through the reconfigurable module must be used, in order for the VGA
module to access the pins cover by the reconfigurable module.
Similar situation are not only present on the Celoxica boards. The XESS
boards [56], the Nallatech boards [165], the Alpha boards [8] face the same
limitations.
On the XF-Board [177] [211] from the ETH in Zurich, the connection to
the peripherals are done on the side of the device. Each module access
its peripheral through an operating system (OS) layer implemented on the
left and right part of the device and between the component swapped in
and out. This approach provides only a restricted solution to the problem,
since all modules must be implemented with a given number of feed-trough
lines and interfaces to access the OS layer for communication with the pe-
ripheral. The intermodule communication as well as the communication
between a module and its peripheral is done via buffers provided by the
OS. This indirect communication will affect the performance of the sys-
tem. Many other existing platforms like the RAPTOR-Board [132], Celox-
ica RC1000 and RC2000 [43] are PCI systems, which require a workstation
for operation. The use in stand-alone systems as it is the case in embedded
is not possible.
The previous limitations was the primary motivation in the design of the
Erlangen Slot Machine (ESM), whose concept we next present.
slots, each of which have access to the SRAM on the top part and to the
crossbar at the bottom.
The name Erlangen Slot Machine was choosen, because it was developed
at the University of Erlangen-Nuremberg in Germany, but mainly due to
this organization in slots. This modular organization of the device simpli-
fies the relocation, primary condition for a viable reconfigurable computing
system. Each module moved from one slot to another one will find the same
resources there. The architecture is illustrated in figure 7.17. The Baby-
Board is logically divided into columns of 2 CLBs called micro slots. The
micro slots are the smallest allocatable units in the system. Each module
must therefore be implemented in a given number of micro slots. Due to
the number of pins needed to access one the external SRAM, each module
willing to used an SRAM module must be implemented in a multiple of 3
micro slots.
The CPLD is used to download the Spartan II’s configuration from the flash
on power-up. It also contains board initialization routines for the on board
PLL and the Flash.
The configuration program for the main FPGA is implemented in the Spar-
tan II. Due to its small size, this reconfiguration program could be directly
implemented in the CPLD, which could configure the main FPGA at power
on. But a much larger device was chosen to increase the degree of freedom
in the reconfiguration management. As state before, the relocation is an
important operation of the ESM. This can be done in different ways: The
first approach is to keep a bitstream for each possible module and each pos-
sible position in memory. However, the size of the Flash cannot allow us to
implement this solution. The second approach is the on-line modification
of the coordinate of the resources used in the module’s bitstream to match
the new position. This modification can be done for example through a
manipulation of the bitstream file with the Java program JBits [102] previ-
ously presented. However file manipulation is a time consuming operation
which will increase the reconfiguration time. The last and most efficient
solution if to compute the new module’s position while the bitstream is be-
ing downloaded, using a dedicated circuitry. This requires a much larger
hardware code than the simple reconfiguration. This is why the Spartan II
was chosen.
Partial reconfiguration design 22 253
7.0.5 Memory
6 SRAM banks with 2 MBytes each are vertically attached to the board on
the top side of the device, thus providing enough memory space to six different
slots for temporal data storage. The SRAMs can also be used for shared mem-
ory communication between neighbor modules e.g for streaming applications.
They are connected to the FPGA in such a way that the reconfiguration of a
given module will not affect the access to other modules.
FPGA
M1 M2 M3 M1 M2 M3
FPGA
FPGA
M1 M2 M3 M1 M2 M3
Crossbar
one side with its own frequency, while the receiver will read the data at the
other end with its own frequency.
First, the crossbar switch can now be implemented in the FPGA rather than
on a separated device. In this case, we need to attach all the peripheral on
one side of the block in which the crossbar is implemented. Modules can
then be connected on the other side of the block. We can even think of a
system in which all components are attached around the chip. The crossbar
can the be implemented as a ring, distributed module around the chip.
The distribution of the resources, like the external RAM must not be done in
a column wise manner anymore. Resources should now be homogeneous
spread around the device, in order to allow different modules, which are
placed on different blocks to access their own resources.
Figure illustrate the enhancements on the Erlangen Slot Machine, using the
Virtex 4 and Virtex 5 FPGAs as previously explained.
Figure 7.21. Possible enhancement of the Erlangen Slot Machine on the Xilinx Virtex 4 and
Virtex 5 FPGAs
Despite the advantages provided by the new class of Virtex devices, a great
disadvantage that results from this architectural enhancement is the communi-
cation: for the columnwise reconfigurable devices, the 1-D circuit was a sim-
ple and efficient possibility to enforce communication among different mod-
ules running on the FPGA. In a two dimensional reconfigurable device as it
258 RECONFIGURABLE COMPUTING
is the case with the Virtex 4 and Virtex 5, the difficulty of implementing the
communication increases. In 2-D we have a quadratic growth in the amount
of resources needed to implement the communication infrastructure. I.e., the
amount of resources needed to implement a 2-D communication infrastruc-
ture is not only twice the amount needed to implement a 1-D communication
infrastructure but four times.
9. Conclusion
We have presented in this chapter the different possibilities to design for
partial reconfiguration on Xilinx devices. Our goal was not to rewrite a com-
plete manual on partial reconfiguration, since several description on this exist
[128] [227]. The Xillinx manuals as well as the work of Sedcole [190] provide
very good descriptions on using the modular design flow and the early access.
our primary motivation was to provide a king of tutorial, based on our experi-
ence and for which a workable design, not too complex, but also not so easy,
exists. The designs as well as all the scripts need for compiling are available
to download from the book’s web page. The difficulty in designing for partial
reconfiguration can be reduced, if the target platform is well designed. One of
such platform is the Erlangen Slot Machine that was presented, with the goal
to emphasize the challenges in designing such a platform. The ESM is how-
ever strongly influenced by the column wise reconfigurable Virtex. The price
we pay for flexibility is very high. With the advent of new Virtex 4 and virtex
5, enhancements can be made in the design in order to increase the flexibility,
while reducing the costs.
One of the main problem remain the communication between the module
at run-time. While this problem is somehow better solved in a 1-D reconfig-
urable device through the use of circuit switching and dedicated channels on
the module, its extension on a 2-D reconfigurable device is not feasible due
to the amount of resources needed. Viable communications approaches like
the DyNoC was presented in chapter 6 however, with the amount of resources
needed by those approaches, their feasibility is only possible if manufacturers
provide coarse-grained communication element in their devices.
Chapter 8
Developments in the field of FPGA have been very amazing in the last two
decades. FPGAs have move from tiny devices, with few thousands of gates,
only able to implement some finite state machines and glue-logic to very com-
plex devices with millions of gates as well as coarse-grained cores. In year
2003, a growth rate of 200% was observed in the capacity of the Xilinx FPGA
in less than 10 years, while in the meantime, a 50% reduction rate in the power
consumption could be reached with the prices also having the same decrease
rate. Other FPGA vendors have face similar development and this trend is
likely to continue for a while. This development, together with the progress
in design tools, has boosted the acceptance of FPGAs in different computation
fields. With the coarse-grained elements like CPU, memory, arithmetic units,
available in recent FPGAs, it is now possible to build a complete system con-
sisting of one or more processors, embedded memory, peripherals and custom
hardware blocks in a single FPGA. This opportunity limits the amount of com-
ponents that must be soldered on a printed circuit board in order to get a FPGA
system working.
In this chapter, we present the different possibilities that exist to build those
systems consisting of one or more processors, peripherals and custom hard-
ware component.
We start with an introduction in system on programmable chip, and then
we present some of the various elements usually needed to build those sys-
tems. We then present at the end of the chapter a design approach for adaptive
multiprocessor systems on chip.
1. Introduction to SoPC
A system on chip (SoC) is usually defined as a chip that integrates the major
functional elements of a complete end product. Figure 8.1 presents the achieve-
260 RECONFIGURABLE COMPUTING
Figure 8.1. Integration of PCB modules into a single chip: from system on PCB to SoC
1.1 Processor
Processors are the central processing and coordination units in system on
programmable chips. Besides the coordination of the complete system, they
are in charge of collecting the data from different peripheral modules or from
the memory, process those data and store them into the memory or sent them
to the peripheral modules. Also the processor initializes the peripherals as
well as the dedicated hardware modules on the chip and manages the memory.
The most widely used processors are those used on Xilinx and Altera devices,
since those two companies control the largest part of the programmable device
market. Besides the processors offered by those two companies, other platform
independent implementations exist. We list some of the available processors
next.
System on a Programmable Chip 261
Harvard architecture with separate 32-bit address bus and 32-bit data bus
seven Fast Simplpex Links (FSL) that can be used to connect the processors
to custom hardware
a 32-bit single precision Floating Point Unit (FPU) in the IEEE-754 format
debug logic
The MicroBlaze soft processor core is available as part of the Xilinx Embed-
ded Development Kit (EDK), which includes a comprehensive set of system
tools to design an embedded application in a Xilinx FPGA.
262 RECONFIGURABLE COMPUTING
Separated instruction and data cache, whose size, associativity, and replace-
ment policy, can be configured
1.2 Memory
A system on programmable chip needs memory for storing instructions and
data. Memory elements are available in small amount on almost all modern
FPGAs. This is usually used to build the caches directly on the chip. For
applications that require more memory, external SRAM or DRAM must be
used. On-chip memory can be built from the memory elements, the block
RAMs, or from the LUTs. The use of the LUT in the second case to built
memory has two drawbacks: First the amount of available computing resources
decreases, thus reducing the complexity of the logic that can be implemented
on the chip. Second, the memory built with LUTs is distributed across the
chip, meaning that chip interconnection must be used to connect the spread
LUTs. This leads to decrease in performance in the memory access. On-chip
block RAMs are usually dual-ported. They therefore provide a nice possibility
to integrate a custom hardware module, in particular those working in two
different clock domains. The memory is used in this case to synchronize the
module with each other.
System on a Programmable Chip 265
1.3 Peripheral
Peripherals are used in the system for communication with the external
world and for debugging purpose. The peripheral components are usually pro-
vided by the board manufacturers as ready to the module that can be inserted
in the SoPC-design. Because of their lower frequency, compare to the one of
processors, peripherals should be connected to the low-speed bus, which in
turn can be connected to the processor bus through a bridge. The communi-
cation between the processor and the available peripheral can be done either
through the I/O mapping paradigm or through the memory map paradigm. In
the first case, the processor addresses each peripheral using special instructions
that directly address the peripheral on the bus. In the second the peripheral is
assigned an address space in the memory and the communication happens in
this space with the processor writing in control register that can be read by the
peripheral. The peripheral responds by writing in status registers that can be
read by the processor.
1.5 Interconnection
The interconnection provides the mechanism for integrating integrate all the
components previously described in a workable system on programmable chip.
The interconnection mechanism provides the communication channels, the in-
terface specification, the communication protocols and the arbitration policy
on the channels. The design of the interconnection infrastructure depends on
the target application. In traditional system on chip, where the infrastructure
is fixed at chip production, the communication infrastructure must be as gen-
eral as possible to serve all class of applications that can be implemented on
the chip. In programmable devices, the flexibility can be used to modify the
interconnection infrastructure according to the type of application to be imple-
mented. In this case, each application can first be analyzed in order to derive
266 RECONFIGURABLE COMPUTING
and implement the best communication infrastructure for its computation. De-
spite the great interest in network on chip in the last couple of years, intercon-
nection on chip is dominated by the SoC communication paradigm which is in
most of the case bus-based. Leading existing solutions were previously devel-
oped for SoCs before adapted to SoPCs. The general connection mechanism
is provided in figure 8.2
It usually consists of two different buses. A high performance bus that is
used by the processor to access the memory and a slow bus used to connect
the peripherals. High performance dedicated hardware modules are connected
to the high performance bus, while low performance custom hardware compo-
nents are connected to the low performance bus. A bridge is used to allow for
communication to happen between two modules attached on the two different
busses. Besides those two buses, several possibilities exists to directly connect
the component. Dedicated lines or crossbar switches can be used, even in co-
habitation with the two previously described buses. The two well established
bus systems in the world of programmable system on chip are the CoreConnect
from IBM and the ARM AMBA.
the On-Chip Peripheral Bus (OPB), which is a secondary bus that can be
used to decoupled the peripherals from the PLB in order to avoid a lost of
system performance. Peripherals like serial ports, parallel ports, UARTs,
GPIO, timers and other low-bandwidth devices should be attached to the
OPB. Access to the peripherals on the OPB bus by PLB masters is done
through a bridge, which is used as a slave device on the PLB and as master
on the OPB. The bridge performs dynamic bus sizing, in order to allow
devices with different data widths to communicate.
Figure 8.3 illustrates the implementation of the OPB. Note that no tri-state
is required. The address and date buses are instead implemented using
a distributed multiplexer. This is a common technique for implementing
buses in programmable logic devices. All the master inputs to the bus are
ORed and the result is provided to all the slaves. The arbitration module
defines which master is granted the bus. All other masters must then place
a zero signal on their output. The ORed is then used to write the value of
the bus master on the bus.
System on a Programmable Chip 267
the Device Control Register (DCR) Bus to allow lower performance status
and configuration registers to be read and written. It is a fully synchronous
bus that provides a maximum throughput of one read or write transfer every
two cycles. The DCR bus removes configuration registers from the memory
address map, reduces loading and improves bandwidth of the processor
local bus.
Figure 8.3. Implementation of the OPB for two maters and two slaves
nels, they allow multiple masters, and support split transactions and burst trans-
fers. The AMBA 2 interconnection infrastructure consists of two main buses:
the Advance Peripheral Bus (APB), used to connect the slower peripherals.
A bridge is used to interface between the AHB and APB buses and allow
low peripheral to be accessed from components hung on the AHB.
using the large amount of available chip area to implement several processors,
thus creating the so called chip multi processing systems.
processors. The Intel Pentium D and the AMD Dual Core Athlon 64 Processor
are other examples of multiprocessor on chip.
The work in [167] proposed an adaptive chip-multiprocessor architecture,
where the number of active processors is dynamically adjusted to the current
workload needed in order to save energy while preserving performance. The
proposed architecture is a shared memory based, with a fix number of embed-
ded processor cores. The adaptivity results in the possibility of switching the
single processors on and off.
Programmable logic devices in general and FPGA in particular have experi-
enced a continuous growth in their capacity, a constant decrease in their power
consumption and permanent reduction of their price. This trend, which has
increased interest in using FPGAs as flexible hardware accelerators, is likely
to continue for a while.
After an unsuccessful attempt in the 90’s to use FPGA as co-processor in
computing system, the field FPGA in high-performance computing is experi-
encing a renaissance with manufacturers like Cray who now produce systems
made upon high speed processors couple to FPGAs that act as hardware accel-
erators [58]. However, actual FPGA-solutions for high-performance comput-
ing still use fast microprocessors and the results are systems with a very high
power consumption and power dissipation.
A combination of the chip multiprocessor paradigm and flexible hardware
accelerators can be used to increase the computation speed of applications. In
this case, FPGAs can be used as target devices in which a set of processors
and a set of custom hardware accelerators are implemented and communicate
together in a network.
FPGA manufacturers like Xilinx have been very active in this field by pro-
viding ready to use components (soft of hard-cores processors, bus-based in-
tercommunication facilities, and peripherals) on the base of which, multipro-
cessor on chip can be built. The Virtex II Pro FPGA for example provides up to
four PowerPC processors and several features like memory and DSP-modules.
The adaptivity of the whole system can be reached by modifying the com-
putation infrastructure. This can be done at compile-time using full reconfig-
uration or at run-time by means of partial device reconfiguration. This results
in a multiprocessor on-chip, whose structure can be adapted at run-time to the
computation paradigm of a given application.
Unfortunately, the capabilities of those devices are not exploited. FPGA
devices like the Xilinx Virtex Pro are usually used only to implement a small
hardware component. The two available Power PC processors remains most
the time unused. This is in part due to the lack of support for the multiprocessor
design.
In this section, we investigate the use of Adaptive Multiprocessor on Chip
(AMoC) and present a design approach for such systems. We also present an
System on a Programmable Chip 271
One of the main tasks in the design and implementation of a system like
that of figure 8.4, is the interconnection network. Contrary to other modules
(processors, peripherals, memory controllers and some custom hardware) that
272 RECONFIGURABLE COMPUTING
for the whole class of communication media. It comprises buses, networks and
point-to-point connections. Periphery and HWAccelerator differ only in their
type of connections. While a HWAccelerator is used for internal components,
Periphery is used to describe modules with external connections.
Having specified the system, the description may be functionally simulated
to observe the system behaviour, and afterwards be validated.
In the last step, the final system can be created by supplying a concrete
system description for the target platform.
When transforming an abstract system description into a concrete one, trace-
ability must be maintained. Therefore, in each of both steps, the respective
XML input file is validated against its document type definition to prevent in-
valid component specifications.
Subsequently, the given specification is processed by a Java application con-
sisting of three consecutive phases. In phase one, an XML parser extracts re-
278 RECONFIGURABLE COMPUTING
[ht]
be run on each processor, the configuration of the operating system and the
generation of the files needed for system start-up.
Like in the Automatic generation of the hardware infrastructure, we have
a SW-specification, which may be -as the HW-specification- provided by the
user or it can be the result of an analysis and code generation process. The SW-
specification describes whether a given processor is to be run with a standalone-
application or if an operating system is foreseen for that processor. For each
standalone-application additional information is given about the task it exe-
cutes; the task allocation results from splitting the application to be computed
System on a Programmable Chip 281
laze processors on the Xilinx ML310 board, featuring a VirtexII Pro 30 FPGA,
were built. The primary goal was to test the bandwidth performance of a sys-
tem with multiple rings and a router for the ring interconnection. The design,
in a first non optimized implementation, reveals an average real bandwidth of
36 Mbytes/s. This includes the complete transaction need to code, send and
decode a message.
In order to have the system running with a real life application, the singu-
lar value decomposition (SVD) on the ML310-Board is implemented, with the
number of processors varying from one to eight, and with matrices varying
from (4 × 200) to (128 × 200). The performance increase due to the use of
multiprocessors was shown to be almost linear in the number of micro proces-
sors.
System on a Programmable Chip 283
Since the bulk of the computations in the SVD is the computation of the dot-
products of the column pairs, which is a good candidate for hardware imple-
mentation. Custom multiply accumulate (MAC) modules for computing those
column-pair multiplications were designed and directly attached connected to
the MicroBlaze processors via the fast simplex link (FSL) ports (figure 8.12, in
order to increase the performance of the whole system.
2.7 Adaptivity
As state at the beginning of the chapter, the adaptivity can be used for several
purposes. According to the application, a physical topology consisting of a
given arrangement of routers and processors may be built to match the best
computation paradigm of that application. The topology can be modified, even
at run-time to better adapt to the changes observed in the application.
On the other hand, the adaptivity of the system can be reached by using
reconfigurable hardware accelerators, provide that the target device supports
partial reconfiguration.
Designing partially reconfigurable systems with unlimited capabilities is a
complex engineering task that requires a lot of hand work. In order to avoid this
and keep the thing very simple, a template for partial reconfiguration can first
be created for the target device. In this template, fixed locations must be se-
lected, where the reconfiguration is allows to take place. It acts as place holder
in order to accommodate reconfigurable modules at run-time. Any hardware
accelerator can be plugged on the hardware block at run-time by means of par-
284 RECONFIGURABLE COMPUTING
tial reconfiguration. This step can even be performed from within the device
using the appropriated port like the ICAP in the case of Xilinx devices.
3. Conclusion
In this chapter, we have addressed the system on programmable chip paradigm
and provided some advantages of using those architectures. The main compo-
nents needed to build such a system were presented, with a focus on the leading
manufacturers. The goal was not a comparative study on the different system
on programmable chip infrastructure. We rather wanted to provide the user a
brief view on the existing solution.
In the second part of the chapter, we have presented a design approach for
multiprocessor systems on FPGAs. Using the reconfiguration, it is possible to
modify the hardware infrastructure to adapt it to the best computation paradigm
of the application being computed.
We have presented a framework to specify and implement a complete sys-
tem without knowledge of the vendor tools. Obviously, all what is feasible
with the PinHat tool is also feasible with current vendor tools. However, the
complexity and the difficulty of using those tools do not allow any new comer
to generate design with it. The goal here was to hide the complexity in the de-
sign of multiprocessor, which is very limited in the current tool to the user and
allow him to focus on the efficient analysis and partitioning of its application.
We believe that multiprocessing on FPGA has a great potential to speed-up
computation in embedded system. In order to facilitate their use, systems like
the PinHat are welcome to hide the complexity of the hardware generation and
let the designer focus on the analysis of the application and the definition of
the best adapted structure for its computation.
The adaptivity of the whole system is provided through the use of partial
reconfiguration in order to exchange running hardware modules at run-time.
However in order to overcome the difficulty of designing for partial reconfigu-
ration with the current tool, a template base approach is recommended, which
foresee predefined location for placing components at run-time.
Chapter 9
APPLICATIONS
1. Pattern Matching
Pattern matching can be defined as the process of checking if a character
string is part of a longer sequence of characters. Pattern matching is used in
a large variety of fields in computer science. In text processing programs like
Microsoft Word, pattern matching is used in the search function. The pur-
pose is to match the keyword being searched against a sequence of characters
that build the complete text. In database information retrieval, the content of
different fields of a given database entry are matched against the sequence of
characters that build the user request. Searching in genetical database also use
pattern matching to match the content of character sequence from a database
entries with the sequence of characters that build a given query. In this case
the alphabet is built upon a set of a genomic characters. Speech recognition
and other pattern recognition tools also use pattern matching as basic opera-
tions on top of which complex intelligent functions may be built, in order to
better classify the audio sequences. Other application using pattern matching
are: dictionary implementation, spam avoidance, network intrusion detection
and content surveillance.
Because text mining, whose primary role is the categorization of documents,
makes a heavy use of pattern matching, we choose in this section to present
the use of pattern matching in text mining and to point out the possible use of
reconfiguration.
Documents categorization is the process of assigning a given set of docu-
ments from a database to a given set of categories. The categories can either
be manually defined by a user, or it can be computed automatically by a soft-
ware tool. In the first case, the categorization is supervised and in the second
cased we have an unsupervised categorization. In categorization, the first step
usually consist of indexing the collection of available documents. This is usu-
ally done through the so called vector space representation. In this model a
document is represented as a vector of key words or term present in the doc-
ument. A complete collection of n documents over a list of m keywords is
then represented by a term by documents matrix A ∈ Rm × Rn . An entry aij
in A represents the frequency of the word i in the document j. The term by
documents matrix is then used for indexing purpose or statistical analysis like
LSI (Latent Semantic Indexing) [67]. Building a terms by documents matrix
is done by scanning the documents of the given collection in order to find the
appearance of key words and return the corresponding entry in the matrix for
each document. Pattern matching is therefore used for this purpose.
The first advantage of using the reconfigurable device here is the inherent
fine-grained parallelism that characterizes the search. Many different words
Applications 287
can be searched for in parallel by matching the input text against a set of words
on different paths. The second advantage is the possibility of quickly exchange
the list of searched words by means or reconfiguration.
Pattern matching was investigated in different work using different approaches
on FPGAs [52] [85] [104] [151] [101] [180] [34], each with a different over-
head in compiling the set of words down to the hardware and different capac-
ities. We define the capacity of a search engine as the number of words that
can de searched for in parallel. A large capacity also means a high complexity
of the function to be implemented in hardware, which in turn means a large
amount of resources. The goal in implementing a search engine in hardware is
to have a maximal hardware utilization, i.e. as many words as possible that can
be searched for in parallel. We present in this section various hardware imple-
mentation of the pattern matching, each with its advantages and drawbacks.
1 With dictionary, we mean here the set of keyword compiled into the reconfigurable device
288 RECONFIGURABLE COMPUTING
Figure 9.1. Sliding windows for the search of three words in parallel
This method however has a main drawback, which is the redundancy in the
amount of register fields used to stored the words as well as the number of
comparators. Redundancy in the sliding windows approach reduce the capac-
ity of such implementation, thus making its use not competitive. In the case of
the Xilinx Virtex FPGA XCV300 with a maximum of 6144 flip flops, a sliding
windows of length 10 needs 10 × 8 = 80 flip flops to store one target word in
the device. Consider that the device placement and routing permits an utiliza-
tion of 80%. The number of words which can be folded in such an FPGA will
theoretically be in the range of 60, which is small compared to the number of
words which can be folded in the same divide with a more efficient implemen-
tation. A nice approach would be to just perform the comparison on a given
alphabet consisting of the set of common character and to use the result in the
evaluation of each single word. This would lead to a reduce amount of register
fields and a reduce amount of comparators.
ing technique is used. A hash table in memory is used to store the value of a
presence bit for a particular target word. An entry zero in this table means that
the word is not present and a value one means that the word will probably be
present. A hash register (with a length of 22 bit in the SPLASH implementa-
tion), which is initially set to zero, is incrementally modified by the incoming
superbytes. At the end of a word marked by a delimiter, the hash register is
set to zero to allow the next word to be processed. The hash register is used to
address the hash table of 222 pseudo-random mappings of words. The modifi-
cation of the hash register happens as follows: When a non delimiter superbyte
is encountered the contents of the 22-bit hash register is updated by first XOR-
ing the upper 16-bit of the register with the incoming superbyte and a value
of a hash function. The modified value of the hash register is then circularly
shifted by a fixed number of bits. Upon reaching the end of a word, the content
of the register is used to address the hash table and determine the value of the
presence bit. To reduce the likelihood of false hits a number of independent
hash functions is calculated for each word of the text, with the stipulation that
each hash function lookup of a word must result in a hit for that word to be
counted as a target word.
Estimations on performance of the SPLASH text searching have been done
mainly on the basis of the critical path length returned by the place and route
tool. Communication between the host computer and the boards like memory
latency have not been taken in consideration.
Figure 9.2. FSM recognizers for the word "conte": a) sate diagram, b) transition table, c) basis
structure the hardware implementation: 4 flip flops will be need to code a 5 × 6 transition table
share a common prefix use a common path from the root corresponding to the
length of their common prefix. A split occurs at the node where the common
prefix ends. In the hardware implementation of a group of words with a com-
mon prefix, common flip flops will be used to implement the common path
(figure 9.3a)).
Figure 9.3. a) Use of the common prefix to reduce the number of flip flops of the common
word detector for "partir", "paris", "avale","avant". b) implementation without use of common
prefix and common comparator set
For each character in the target alphabet only one comparator is needed.
The comparison occurs in this case, once at a particular location in the device.
292 RECONFIGURABLE COMPUTING
Figure 9.4. Basis structure of a FSM-based words recognizer that exploits the common prefix
and a common set of characters
one leaf of the automaton is reached and the hit output is set as to the index of
the matched target word.
Figure 9.5. Processing steps of the FSM for the word "tictic"
2 Can the state machine recognize the word ’issip’ in a file containing the word Mississippi?[52]
294 RECONFIGURABLE COMPUTING
2. Video Streaming
Video streaming is the process of performing computations on video data,
which are streamed through a given system, picture after picture.
Implementation of video streaming on FPGA have attracted several researchers
in the past, resulting in the building of various platforms [39] [115] [198] [72]
[207] [140] [201] for various computations. Two main reasons are always
state as motivation for implementation video streaming in FPGAs: the perfor-
mance, which results from using the exploitation of the inherent parallelism in
the target applications in order to build a tailored hardware architecture and the
adaptivity, which can be used to adjust the overall system by exchanging parts
of the computing blocks with better adapted ones.
In [193], Richard G.S discusses the idea of parameterized program gener-
ation of convolution filters in an FPGA. A 2-D filter is assembled from a set
of multipliers and adders, which are in turn generated from a canonical serial-
parallel multiplier stage. Atmel application notes [13] discuss 3x3 convolu-
tion implementations with run-time reconfigurable vector multiplier in Atmel
FPGAs. To overcome the difficulties of programming devices with classic
Hardware Description Languages(HDL) like VHDL and Verliog, Celoxica has
develop the Handel-C language. Handel-C is essentially an extended subset
of the standard ANSI-C language, specifically designed for use in a hardware
environment. The Celoxica Handel-C compiler, the DK1 development envi-
ronment includes a tool, the PixelStream, for an easy generation of video pro-
cessing functions for FPGA implementations. PixelStream offers a set of ba-
sic modules and functions that can be (graphically) combined to build a more
complex datapath in FPGA. The resulting design can be mapped to one of the
Celoxica FPGA development boards.
34 million if less than 255 words are compiled into the word recognizer.
Applications 295
We will not focus in this section on the details of video or image process-
ing. A comprehensive description of video processing algorithms and their
hardware implementation on FPGA is provided in [153].
The Sonic architecture [115] is a configurable computing platform for accel-
eration and real-time video image processing. The platform consists of plug-
in processing elements (PIPEs), which are FPGA daughter card that can be
mounted on a PCI-board plugged on a PCI-slot of a workstation. A PIPEflow
bus exists to connect adjacent PIPES, while a the available PIPE provide global
connection to the PIPES. Sonic’s architecture exploits the reconfiguration ca-
pabilities of the PIPES to adapt part of the computation flow at run-time.
The Splash [97] architecture was also used in video processing. Its systolic
array structure makes it well suited to image processing.
The ESM platform introduced in section 7.0.2 presents an optimal pipelined
architecture for the modular implementation of video streams. Its organization
in exchangeable slots, each of which can be reconfigured at run-time to per-
form a different computation, makes it a viable platform in video streaming.
Moreover, the communication infrastructure available on the platform provides
an unlimited access of modules to their data, no matter on which slot they are
placed, thus increasing the degree of flexibility in the system.
The computation on video frames is usually performed in different steps,
while the pictures stream through the datapath. It therefore presents a nice
structure for a pipelined computation. This has led to a chained architecture
on the base of which most video streaming systems are built. The chain con-
sist of a set of modules, each of which is specialized for a given computation.
The first module on the chain is in charge of capturing the video frames, while
the last module output the processed frames. Output can be rendered on a
monitor or can be sent as compressed data over a network. Between the first
and the last modules, several computational modules can be used according
to the algorithm implemented. A module can be implemented in hardware or
in software according to the goals. While software provides a great degree of
flexibility, it is usually not fast enough to carry the challenging computations
required in video applications. ASICs can be used to implement the computa-
tional demanding parts, however ASIC does not provide the flexibility needed
in many systems. On a reconfigurable platform, the flexibility of a processor
can be combined with the efficiency of ASICs in order to build a high perfor-
mance flexible system. The partial reconfiguration capability of reconfigurable
devices provides the possibility to replace a given module on the chain at run-
time.
Most of the video capture modules provide the frames to the system on a
pixel by pixel manner, leading to a serial computation on the incoming pixels.
Since many algorithms need the neighborhood of a pixel to compute its new
value, a complete block must be stored and processed for each pixel. Cap-
296 RECONFIGURABLE COMPUTING
turing the neighborhood of a pixel is often done using a sliding window data
structure with varying size. The size of the sliding window corresponds to that
of neighbor region of a pixel needed for the computation of the new value. As
shown in figure 9.6, a sliding window is a data structure used to sequentially
capture the neighborhood of pixels in a given image.
A given set of buffers (FIFO) is used to update the windows. The number
of FIFOs vary according to the size of the window. In each step, a pixel is read
from the memory and placed in the lower left cell of the window. Up to the
upper right pixel which is disposed, i.e. outputted, all the pixels in the right
part of the window are placed at the queue of the FIFOs one level higher. The
processing part of the video is a normal image processing algorithm combining
some of the basic operators like:
Median filtering
Basic Morphological operations
Convolution
Edge detection
the sliding windows and passes it to the third module, which processes the
pixel and saves it in its own memory or directly passes it to the next module.
This architecture presents a pipelined computation in which the computational
blocks are the modules that process the data frames. RAMs are used to tempo-
rally store frames between two modules, thus allowing a frame to stream from
RAM to RAM and the processed pictures to the output.
This is also true for a median operator which cannot be replaced by a Gauss
operator by just changing the parameters. The network capture module and
the camera capture module require two different algorithms for collecting the
pixels and bringing them in the system. The second possibility consists of re-
placing the complete module at run-time with a module of the same size, but
different in its structure, while the rest of the system keeps running. Recon-
figurable devices in general and FPGAs in particular fulfill this requirements.
Many available FPGAs can be partly configured, while the rest of the system
is running. Moreover many FPGAs provide small size on-chip memories, able
to hold part of a frame (the so called region of interest). It is therefore possible
to perform many computations in parallel on different regions of interest, thus
increasing the performance and flexibility of video applications.
3. Distributed Arithmetic
Distributed arithmetic (DA) is certainly one of the most powerful tool for
the computation of the product of two vector products, one of which is con-
stant, i.e it consists of constant values. DA exploits the nature of LUT-based
computation provided by the FPGAs by storing in a LUT, all the possible result
for a set of variable combinations. Computation at run-time only consists of
retrieving the results from the LUT, where they were previously stored. The el-
ements of the variable vector are used to address the look-up table and retrieve
partial sums in a bit-serial (1-BAAT, 1 Bit At A Time) manner. Investigations
in distributed arithmetic have been done in [215] [158] [223] [224] [33] [95]
[63] [62] [31].
One of the notorious DA-contribution is the work of White [215]. He pro-
posed the use of ROMs to store the precomputed values. The surrounding logic
to access the ROM and retrieve the partial sums had to be implemented on a
separate chip. Because of this moribund architecture, distributed arithmetic
could not be successful. With the appearance of SRAM based FPGAs, DA
became an interesting alternative to implement signal processing applications
in FPGA [158] [223] [224]. Because of the availability of SRAMs in those
FPGAs, the precomputed values could now be stored in the same chip as the
surrounding logic.
In [33], the design automation of DA architecture was investigated and a
tool was designed to help the user in the code generation of a complete DA
design, and to perform investigations on the various tradeoffs. Also, an initial
attempt to implement DA for floating-point numbers numbers in order to in-
crease the accuracy, was presented. However, the amount of memory required
to implement such a solution makes it applicable only on small examples.
The idea behind the distributed arithmetic is to distribute the bits of one
operand across the equation to be computed. This operation results in a new
Applications 299
one, which can then be computed in a more efficient way. The product of two
vectors X and A is given by the following equation:
n
X
Z =X ×A= (Xi × Ai ) (3.1)
i=0
then add to the value in the accumulator. After w steps, the results is collected
from the accumulator.
Many enhancements can be done on a DA implementation. The size of the
DALUT for example can be halved if only positive values are stored. In this
case, the sign bit, which is the first bit of a number, will be used to decide if the
retrieved value should be added to or subtracted from the accumulated sum.
On the other hand, it is obvious that all the bit operations, i.e. the retrievals of
a value from the DALUT, are independent from each other. The computation
can therefore be performed in parallel. The degree of parallelism in a given
implementation depends on the available memory to implement the DALUTs.
In the case where w DALUTs and datapaths can be instantiated in parallel,
the retrieval of all partial sums can be done in only one step, meaning that the
complete computation can be done in only one step insted of w as in the serial
case.
In general, if k DALUTs are instantiated in parallel, i.e. for a computation
performed on a k-BAAT basis, then w/k steps are required to retrieved all the
partial sums, which can be directly accumulated, thus leading to a run-time of
w/k steps. Figure 9.9 shows a datapath for the computation of the DA.
The input values are segmented in different fields, each of which is assigned
to a datapath for the retrieval of the partial sums from the corresponding DA-
LUT. The retrieved values from the k DALUTs are shifted by the correspond-
ing values and added to the accumulated sum. After w/k steps, the result can
be collected from the accumulator.
Applications 301
for each number. The integer part is separated from the fractional part using
a point. Because the point is only an imaginary and not physically available
in the number representation, operations on fixed-point numbers are not differ-
ent than those on normal integer numbers. The datapath previously presented
can be used without modifications for for real numbers represented as fixed-
point. Representing real numbers as fixed-point is advantageous in the compu-
tation speed and the amount of resources required to implement the datapath.
However, the ranges of fixed-point representations as well as their precisions
is small compared to those of a floating-point representations. Therefore we
would like to handle real numbers as floating-point.
In this section we present a concept first investigated in [33] for handling
real numbers, represented as floating-point in the IEEE 754 format, using dis-
tributed arithmetic. In IEEE 754 format, a number X is represented as follows:
call the first DALUT which stores the exponents the EDALUT and the sec-
ond DALUT which stores the mantissas the MDALUT. The size of EDALUT,
size(EDALU T ) as well as that of MDALUT, size(M DALU T ) are given in
equations (3.7).
e xn m xn
e x1 m x1
e x0 m x0
eA1 mA1
eA1 + 1 mA1 * 2
Control
LUTs 1
eA2 mA2
LUTs 2 e Fi m Fi
eA2 + 2|E| mA2 * 2|M|
+
eAn mAn
e Zi m Zi
LUTs n
eAn + 2|E| mAn * 2|M|
Figure 9.10. Datapath of the distributed arithmetic computation for floating popin numbers
bility, the user is provided the area and speed of the design. Real numbers can
be handled either as fixed-point or as floating-point in the IEEE 754 format
with the technique previously defined. The width of the mantissa as well as
that of the exponent has to be provided. For the final design, the tool generates
a description in the Handel-C language.
3.4 Applications
In this section, we present one application, the implementation of recur-
sive convolution algorithm for time domain simulation of optical multimode
intrasystem interconnects that was substantially speeded-up through the use of
distributed arithmetic. Another application that has benefit from the imple-
mentation as DA is the adaptive controller. Because section 4 is devoted to
adaptive controller, we explain only the first application here.
x 1
y 1 e 1
y 2 1
x 2
y 2 e 2
x e y a 1
y a e a
Figure 9.11. An optical multimode waveguide is represented by a multiport with several trans-
fer paths.
For this equation, different tradeoffs were investigated in the framework pre-
sented in [33], for generation and evaluation of DA-trade-off implementation.
A Handel-C code were generated and the complete design was implemented
on a system made upon the Celoxica RC1000-PP board equipped with a Xilinx
Virtex 2000E FPGA and plugged into a workstation.
4. Adaptive Controller
In this section, we investigate the next field of application of reconfigurable
devices, namely the implementation of adaptive controllers, also identified here
as multi-controller, using partial reconfiguration.
We will first introduce the multi-controller concept. Thereafter, we inves-
tigate its implementation using the distributed arithmetic. Finally, the imple-
mentation of the whole design using partial reconfiguration is shown.
Applications 307
Figure 9.12. Screenshot of the 6-parallel DA implementation of the the recursive convolution
equation on the Celoxica RC100-PP platform
4 At a given time, the active controller is the one which controls the plant.
308 RECONFIGURABLE COMPUTING
surements of physical values of the plant. The strategy of the supervisor can
vary from simple boolean combinations of the input values to very complex
reasoning techniques [206].
Multi−Controller
Supervisor
i CM 1
CM i
Plant MUX
CM n
state vector of the controller. The matrices A, B, C and D are used for the
calculation of the outputs based on the inputs.
x1 (k + 1) a11 ... a1s b11 ... b1p x1 (k)
. . .. . . .. . .
. . . . . . . .
. . . . . .
s + 1)
x (k as1 ... ass bs1 ... bsp x (k)
y (k)
1
= c
11 ... c1n d11 ... d1p
s
u (k)
1
(4.3)
. . .. . . .. . .
. .. . . . . . .
. . . . .
yq (k) cq1 ... cqs dq1 ... dqp uq (k)
| {z } | {z }| {z }
z M v
Figure 9.14. Adaptive controller architecture. Left the one slot implementation, and right the
two slot implemenation
into the reconfigurable slot, which has predefined interfaces to the supervisor
and to the plant.
The problem with this approach is the vacuum, which arises on switching
from one module to the next one. Because only one slot is available, the re-
configuration of this slot will place the plant in a "floating state" on reconfig-
uration, where it is not really controlled. In order to avoid this situation, the
approach developed in [64] was the use of two slots. One active slot an a pas-
sive one (figure 9.14 right). The active slot is in charge of controlling the plant,
while the reconfiguration takes place in the passive slot. Whenever the super-
visor decides to replace a controller module, the reconfiguration will first take
place on the passive slot. Thereafter, the control of the plant will be given to
the configured module, which becomes the active one, while the former active
becomes passive.
tough off few years ago. Applications like e-commerce, e-government, virtual
private network, on-line banking must provide a high degree of security.
Over years, a large variety of standards like Triple-DES, Advanced Encryp-
tion Standard AES), Data Encryption Standard (DES), RSA, OpenPGP, Ci-
pherSaber, IPsec, Blowfish, ANSI X9.59, CAST, RC4 and RC6, have been
developed to provide high security. With this large variety of standards and the
customized implementation possibilities for each standard, cryptography can
be seen as one of the most versatile application domains of computer science.
Depending on criteria like speed, degree of flexibility, and degree of security,
single implementations of cryptography application were developed either as
software or as intellectual property component.
In [221] the advantages of using reconfigurable hardware in cryptography
are listed. The author focus on the traditional advantage of flexibility and
performance. The flexibility is so far important because it offers the possi-
bility to used the same hardware to switch from one algorithm to the next
one at run-time, according to factors like the degree of security, the compu-
tational speed, the power consumption. Also, according to some parameters,
a given algorithm can be tuned. Moreover, algorithms that has been broken
and where the security is no more insured can be changed by means of re-
configuration. The system can easily be upgraded to include new standards,
developed while the system was already deployed. The corresponding algo-
rithm can therefore be compiled and included in the library of bitstreams for
the device configuration. The second advantage provided by reconfigurable
hardware, namely the performance can be used to efficiently implement the
components, by using the inherent parallelism and building efficient operators
for computing boolean operation on a very large amount of data. This results
on a large throughput and a cost efficiency. Experiments reported a throughput
of 12 GBit/s for an implementation of the block cipher AES on an FPGA Virtex
1000, using 12,600 slices and 80 RAMs [90], while an ASIC implementation,
the Amphion CS5240TK [150] clocked at 200 MHz, could provided twice
the throughput of the FPGA solution. The same algorithm, implemented on a
DSP TI TMS320C6201 provided a throughput of 112.3 Mbits/s [222], while a
throughput of 718.4 Mbit/s could be reached on a counterpart implementation
on a on a Pentium III [10].
The general architecture of an adaptive cryptographic engine proposed by
Prasanna and Dandalis [61] [178], basically consists of a database to hold the
different configuration that can be downloaded at run-time onto the device,
like an FPGA for instance, to perform the computation and a configuration
controller to perform the reconfiguration, i.e. downloading the corresponding
bitstream form the database into the FPGA. Each bitstream represent a given
algorithm implementing a given standard and tuned with some parameters ac-
cording to the current user’s need.
312 RECONFIGURABLE COMPUTING
single chip. Also, in contrast to the adaptive architecture for a control system
presented in figure 9.14, the loader module resides into the device. Depend-
ing on the implementation chosen, the loader can reside inside or outside the
device. If a Xillinx Virtex chip is used and the configuration takes place via
the ICAP port, then the loader, which is the ICAP module in this case, is au-
tomatically available inside the device. However, if the configuration happens
through the normal SelectMap port, then we need an external loader module
for collecting configuration data from the database and copy them on the con-
figuration port.
In figure 9.15, the architecture is logically divided in two main blocks. A
fix one, which remains continuously on the chip. It consist of the parts, which
are common to all the cryptographic algorithms in general or common to the
algorithms in a given class only. On the figure, we show only one reconfig-
urable slots, however, it can be implemented as set of configurable blocks,
each of which can be changed by mean of reconfiguration to realize a given
customized standard.
The last point concerns the development of building blocks that will be com-
piled in bitstreams to be downloaded into the device at run-time. A designer
is no more required to focus on the hardware implementation of the crypto-
graphic algorithm. A lot of work was done in this direction and the results are
available. We need mostly to focus on the architecture of the overall system
and find out how a viable partitioning can be done, according to the reconfigu-
ration scheme.
Most of the work have focussed in various implementations of a given ap-
proach like the RSA [185] in [155] [88] [154] [232] [50] or the implementa-
tions described in [78] [19] [20] [47] [145] [152] [15], mostly parameterizable
and based on the Elliptic Curve approach [138] [157]. Generators for pro-
ducing a customized description in a hardware description language have been
developed for example in [47]. This can be used to generate the variety of
configurations that will be used at run-time to move from one implementation
to the next one.
The ideal realization of a software defined radio would push the ADC and
DAC as close as possible to the antenna. The practical realization is however,
as explained in [217], very challenging.
high-level programming tools. The capacity of FPGAs was too small to allow
real migration of software components in hardware. Furthermore, the perfor-
mance bottleneck posed by the slow communication between host processor
and FPGAs often busted speed-ups. Finally, there were no useful software-like
description languages that could have encourage software designers to write
programs for such machines and implement efficient libraries for reusable data
stream management. Programming CCMs meant tedious low-level hardware
design, which was not attractive at all.
The last couple of years brought a pivotal change. FPGA’s logic densities
increased dramatically, now allowing for massive bit-level parallelism. Avail-
able high-level languages like Handel-C [125] and ImpulseC [175] as well
as progress in compilation technology makes the logic resources available to
high-level application programmers.
Boosted by the recent engagement of companies such as Cray [58], SRC
Computers [59], and ClearSpeed [51], Nallatech [165], and SGI [191], CCMs
are experiencing a renaissance. Despite the use of new high-speed interfaces
and efficient datastream management protocols between the components in the
new systems, the new architectures are built on the same models like the old
ones in the 90s. It usually consist of a set of FPGAs, on the same board with
one or more processors. The processors control the whole systems and con-
figured the FPGAs at run-time with dedicated function to be accelerated in
hardware.
While considerable progress have been done, the rentability of the ma-
chines provided by the firms previously mentioned still have to be proven.
The rentability cannot be evaluated anymore only on the base of speed-ups ob-
served in some class of applications. It is not sure that a comparison made in
terms of performance/$, performance/J, and performance/sqm will favor fur-
ther deployment of those systems. Craven and Athanas recently provided in
[57] a study on the viability of the FPGA in supercomputers. Their study is
based on the cost of realizing floating-point into FPGAs. The cost include not
only the performance gain, but also the price to pay. The authors came to the
conclusion that the price to pay for a FPGA-based supercomputer is to high for
the marginal speed-ups.
High-precision Floating-point arithmetic is usually the bulk of computa-
tion in high-performance computing. Despite the progress done in develop-
ing floating-point units in hardware, Floating-point computation and FPGA is
still a difficult union. The advantage provided by FPGA in customized im-
plementation of floating-point as described Shirazi [192] is unlikely to help
here because the datapath must provide customization that consume the largest
amount of resources. Coarse-grained elements like the embedded multipliers
in FPGAs or coarse-grained reconfigurable device provide the best prerequi-
sites for the use of reconfigurable device in supercomputers.
Applications 317
8. Conclusion
We have presented in this chapter a couple of applications that can take
advantage of the flexibility as well as performance of reconfigurable devices.
For each application targeted, we placed the focus mostly on the architectural
organization because the goal was not to necessary present the detailed im-
plementation of an application. However, we have presented a comprehensive
description of the pattern matching application, while keeping the other pre-
sentation short. This is due to the low attention paid on pattern matching in
hardware in available textbooks. With the large amount of literature in other
presented topics like image processing and control, we choose not to focus in
details of those applications.
A lot of work have been done in the last two decades and a large number
of applications was implemented in FPGA, which is the main representative
of reconfigurable devices. We could not present nor cite all available imple-
mentations here. We have rather focussed on few ones where we could show
the usefulness of reconfigurability. Finally, we would like to emphasize that
despite two decades of research, the existence of "killer applications" could
not be shown for reconfigurable device, thus limiting the acceptance of recon-
figuration in the industry. The existence of a killer application, will certainly
boost the development of reconfigurable devices, leading to new class of de-
vices, tools and programmers. Despite this missing step, research keep going
and the support of the industry is needed more than it have ever be.
318 RECONFIGURABLE COMPUTING
References
[1] G. R. AB, MicroBlaze Processor Reference Guide: Embedded Development Kit EDK
8.2i, 2005, http://www.gaisler.com. [Online]. Available: http://www.gaisler.com
[2] A. Ahmadinia, C. Bobda, S. Fekete, J. Teich, and J. van der Veen, “Optimal routing-
conscious dynamic placement for reconfigurable computing,” in 14th International
Conference on Field-Programmable Logic and Application, ser. Lecture Notes in
Computer Science, vol. 3203. Springer-Verlag, 2004, pp. 847–851, available at
http://arxiv.org/abs/cs.DS/0406035.
[3] A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, “A new approach for on-line place-
ment on reconfigurable devices,” in Proceedings of the 18th International Parallel and
Distributed Processing Symposium (IPDPS) / Reconfigurable Architectures Workshop
(RAW), 2004.
[4] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, J. Teich, S. P. Fekete, and J. C. van der
Veen, “A practical approach for circuit routing on dynamic reconfigurable devices,”
in RSP ’05: Proceedings of the 16th IEEE International Workshop on Rapid System
Prototyping (RSP’05). Washington, DC, USA: IEEE Computer Society, 2005, pp.
84–90.
[5] C. Alpert and A. Kahng, “Geometric embeddings for faster and better multi-way netlist
partitioning,” 1993.
[6] C. J. Alpert and A. B. Kahng, “Multi-way partitioning via spacefilling curves and dy-
namic programming,” in Design Automation Conference, 1994, pp. 652–657.
[7] C. J. Alpert and S.-Z. Yao, “Spectral partitioning: The more eigenvectors, the better,” in
Design Automation Conference, 1995, pp. 195–200.
[8] ADM-XRC-II Xilinx Virtex-II PMC, Alpha Data Ltd., 2002, http://www.alpha-
data.com/adm-xrc-ii.html.
[18] K. Bazargan, R. Kastner, and M. Sarrafzadeh, “Fast template placement for reconfig-
urable computing systems,” In IEEE Design and Test - Special Issue on Reconfigurable
Computing, vol. January-March, pp. 68–83, 2000.
[19] M. Bednara, M. Daldrup, J. Shokrollahi, J. Teich, and J. von zur Gathen, “Tradeoff anal-
ysis of fpga based elliptic curve cryptography,” in Proc. of the IEEE International Sym-
posium on Circuits and Systems (ISCAS-02), Scottsdale, Arizona, U.S.A, May 2002.
[20] M. Bednara, M. Daldrup, J. von zur Gathen, J. Shokrollahi, and J. Teich, “Reconfig-
urable implementation of elliptic curve crypto algorithms.” in IPDPS, 2002.
[21] L. Benini and G. Micheli, “Network on chips: A new soc paradigm,” IEEE Computer,
January 2001.
[22] J. L. Bentley, “Multidimensional binary search trees used for associative searching,”
Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
[24] J. L. Bentley and D. Wood, “An optimal worst case algorithm for reporting intersections
of rectangles,” IEEE Trans. Comput., vol. C-29, pp. 571–577, 1980.
[26] N. B. Bhat and D. D. Hill, “Routable technologie mapping for lut fpgas,” in ICCD ’92:
Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI
in Computer & Processors. Washington, DC, USA: IEEE Computer Society, 1992,
pp. 95–98.
[27] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, “Linear time bounds for
median computations,” in Proc. Fourth Annual ACM Symposium on Theory of Comput-
ing, 1972, pp. 119–124.
[29] C. Bobda and N. Steenbock, “A rapid prototyping environment for distributed recon-
figurable systems,” in 13th IEEE International Workshop On Rapid System Prototyp-
ing(RSP’02), Darmstadt Germany. IEEE Computer Society, 2002.
[30] C. Bobda, “Synthesis of dataflow graphs for reconfigurable systems using temporal par-
titioning and temporal placement,” Dissertation, Universität Paderborn, Heinz Nixdorf
Institut, Entwurf Paralleler Systeme, 2003, ? 35,-, ISBN 3-935433-37-9.
[34] C. Bobda and T. Lehmann, “Efficient building of word recongnizer in fpgas for term-
document matrices construction,” in Field Programmable Logic and Aplications FPL
2000, R. Hartenstein and H. Grünbacher, Eds. Villach, Austria: Springer, 2000, pp.
759–768.
[35] C. Bobda, M. Majer, D. Koch, A. Ahmadinia, and J. Teich, “A dynamic noc approach
for communication in reconfigurable devices,” in Proceedings of International Con-
ference on Field-Programmable Logic and Applications (FPL), ser. Lecture Notes in
Computer Science (LNCS), vol. 3203. Antwerp, Belgium: Springer, Aug. 2004, pp.
1032–1036.
[37] G. J. Brebner, “A virtual hardware operating system for the xilinx xc6200,” in FPL ’96:
Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart
Applications, New Paradigms and Compilers. London, UK: Springer-Verlag, 1996,
pp. 327–336.
[38] R. P. Brent and F. T. Luk, “The solution of singular-value and eigen-value problems on
multiprocessor arrays,” SIAM J. Sci. Stat. Comput., vol. 6, no. 1, pp. 69–84, 1985.
[41] J. M. P. Cardoso and H. C. Neto, “An enhance static-list scheduling algorithm for tem-
poral partitioning onto rpus,” in IFIP TC10 WG10.5 10 Int. Conf. on Very Large Scale
Integration(VLSI’99). Lisboa, Portugal: IFIP, 1999, pp. 485 – 496.
[46] K.-C. Chen, J. Cong, Y. Ding, A. B. Kahng, and P. Trajmar, “Dag-map: Graph-based
fpga technology mapping for delay optimization,” IEEE Design and Test of Computers,
vol. 09, no. 3, pp. 7–20, 1992.
[50] M. Ciet, M. Neve, E. Peeters, and J.-J. Quisquater, “Parallel fpga implementation of
rsa with residue number systems - can side-channel threats be avoided? - extended
REFERENCES 323
[52] W. Cockshott and P. Foulk, “A low-cost text retrieval machine,” IEEE PROCEEDINGS,
vol. 136, no. 4, pp. 271–276, July 1989.
[53] J. Cong and Y. Ding, “Flowmap: an optimal technology mapping algorithm for delay
optimization in lookup-table based fpga designs.” IEEE Trans. on CAD of Integrated
Circuits and Systems, vol. 13, no. 1, pp. 1–12, 1994.
[54] J. Cong and Y. Ding, “Combinational logic synthesis for lut based field programmable
gate arrays,” ACM Trans. Des. Autom. Electron. Syst., vol. 1, no. 2, pp. 145–204, 1996.
[57] S. Craven and P. Athanas, “Examining the viability of fpga supercomputing,” EURASIP
Journal on Embedded Systems, vol. 2007, pp. Article ID 93 652, 8 pages, 2007,
doi:10.1155/2007/93652.
[58] Cray, “Cray xd1 supercomputer,” GNU’s home page. [Online]. Available: http:
//www.cray.com/products/xd1/
[60] W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnection net-
works,” in Proceedings of the Design Automation Conference, Las Vegas, NV, Jun.
2001, pp. 684–689.
[61] A. Dandalis and V. K. Prasanna, “An adaptive cryptographic engine for internet protocol
security architectures,” ACM Trans. Des. Autom. Electron. Syst., vol. 9, no. 3, pp. 333–
353, 2004.
[62] K. Danne, “Distributed arithmetic FPGA design with online scalable size and perfor-
mance,” in Proceedings of 17th SYMPOSIUM ON INTEGRATED CIRCUITS AND SYS-
TEMS DESIGN (SBCCI04). ACM Press, New York, NY, USA, 7 - 11 Sep. 2004, pp.
135–140.
[68] A. DeHon, “DPGA-coupled microprocessors: Commodity ICs for the early 21st
century,” in IEEE Workshop on FPGAs for Custom Computing Machines, D. A. Buell
and K. L. Pocek, Eds. Los Alamitos, CA: IEEE Computer Society Press, 1994, pp.
31–39. [Online]. Available: citeseer.ist.psu.edu/dehon94dpgacoupled.html
[70] Digilent Inc., “The vdec1 video decoder board.” [Online]. Available: http:
//www.digilentinc.com/Products/Detail.cfm?Prod=VDEC1
[71] F. Dittmann, A. Rettberg, and F. Schulte, “A y-chart based tool for reconfigurable sys-
tem design,” in Workshop on Dynamically Reconfigurable Systems (DRS). Innsbruck,
Austria: VDE Verlag, 17 Mar. 2005, pp. 67–73.
[74] W. E. Donath and A. J. Hoffman, “Algorithms for partitioning of graphs and computer
logic based on eigenvectors of connection matrices,” IBM Technical disclosure Bulletin,
vol. 15, no. 3, pp. 938–944, 1972.
[82] S. P. Fekete, E. Köhler, and J. Teich, “Optimal fpga module placement with temporal
precedence constraints,” Technische Universität Berlin, Tech. Rep. 696.2000, 2000.
[83] C. M. Fiduccia and R. Mattheyses, “A linear-time heuristic for improving network par-
titions,” in Proceedings of the 19th Design Automation Conference, 1982, pp. 175–181.
[84] L. Ford and D. Fulkerson, Flows in Networks. Princeton University Press, 1962.
[87] R. J. Francis, J. Rose, and K. Chung, “Chortle: a technology mapping program for
lookup table-based field programmable gate arrays,” in DAC ’90: Proceedings of the
27th ACM/IEEE conference on Design automation. New York, NY, USA: ACM Press,
1990, pp. 613–619.
[88] J. Fry and M. Langhammer, “Rsa and public key cryptography in fpgas.” [Online].
Available: www.altera.com/literature/cp/rsa-public-key-cryptography.pdf
[90] K. Gaj and P. Chodowiec, “Comparison of the hardware performance of the aes candi-
dates using reconfigurable hardware.” in AES Candidate Conference, 2000, pp. 40–54.
[92] D. D. Gajski and M. Reshadio, “Application and advantages,” CECS, UC Irvin, Tech-
nical Report 04-12, 2004.
[94] L. Geppert, “Sun’s big splash,” IEEE Spectrum Magazine, pp. 21–29, M, January 2005
2005.
326 RECONFIGURABLE COMPUTING
[95] J. Gerling, K. Danne, C. Bobda, and J. Schrage, “Distributed arithmetics for recursive
convolution of optical intercannects,” in EOS Topical Meeting, Optics in Computing
(OIC), Engelberg (Switzerland), Apr. 2004, pp. 65–66.
[96] P. B. Gibbons and S. S. Muchnick, “Efficient instruction scheduling for a pipelined ar-
chitecture,” in SIGPLAN ’86: Proceedings of the 1986 SIGPLAN symposium on Com-
piler construction. New York, NY, USA: ACM Press, 1986, pp. 11–16.
[99] G. H. Golub and C. F. V. Loan, Matrix Computations. North Oxford Academic Pub-
lisching, 1983.
[102] S. Guccione, D. Levi, and P. Sundararajan, “Jbits: A java-based interface for reconfig-
urable computing,” 1999.
[104] B. Gunther and G. Milne, “Accessing document relevance with run-time reconfigurable
machines,” in IEEE Workshop on FPGAs for Custom Computing Machines. Napa
California: IEEE, 1996, pp. 9–16.
[105] B. Gunther and G. Milne, “Hardware-based solution for message filtering,” school of
computer and information science, Tech. Rep., 1996.
[106] R. H. Güting, “An optimal contour algorithm for iso-oriented rectangles,” J. Algorithms,
vol. 5, pp. 303–326, 1984.
[111] R. Hartenstein, Morphware and Configware, A. Y. Zomaya, Ed. New York: Springer-
Verlag, 2006.
[113] R. Hartenstein, A. Hirschbiel, and M.Weber, “Xputers - an open family of non von neu-
mann architectures,” in 11th ITG/GI Conference on Architektur von Rechensystemen.
VDE-Verlag, 1990.
[115] S. D. Haynes, J. Stone, P. Y. K. Cheung, and W. Luk, “Video image processing with the
sonic architecture,” Computer, vol. 33, no. 4, pp. 50–57, 2000.
[116] P. Healy and M. Creavin, “An Optimal Algorithm for Rectangle Placement,” Dept. of
Computer Science and Information Systems, University of Limerick, Limerick, Ireland,
Tech. Rep. UL-CSIS-97-1, Aug. 1997.
[118] B. Hendrickson and R. Leland, “An improved spectral graph partitioning algorithm for
mapping parallel computations,” SIAM Journal on Scientific Computing, vol. 16, no. 2,
pp. 452–469, 1995. [Online]. Available: citeseer.nj.nec.com/hendrickson95improved.
html
[120] Hirata et al., “An elementary processor architecture with simultaneous instruction is-
suing from multiple threads,” in Proc. Intl. Symp. Computer Architecture, Assoc. of
Computing Machinery, 1992, pp. 136–145.
[122] J. Huie, P. DŠAntonio, R. Pelt, and B. Jentz, “Synthesizing fpga cores for software-
defined radio.” [Online]. Available: www.altera.com/literature/cp/fpga-cores-for-sdr.
pdf
[123] S. Iman, M. Pedram, C. Fabian, and J. Cong, “Finding uni-directional cuts based on
physical partitioning and logic restructuring,” in Fourth International Workshop on
Physical Design. IEEE, 1993.
328 RECONFIGURABLE COMPUTING
[124] A. Inc, Nios II Processor Reference Handbook, November, 2006. [Online]. Available:
http://www.altera.com/literature/lit-nio2.jsp
[126] X. Inc, PowerPC 405 Processor Block Reference Guide: Embedded Development Kit,
July 20, 2005. [Online]. Available: www.xilinx.com/bvdocs/userguides/ug018.pdf
[127] X. Inc, MicroBlaze Processor Reference Guide: Embedded Development Kit EDK 8.2i,
June 1, 2006. [Online]. Available: www.xilinx.com/ise/embedded/mb ref guide.pdf
[128] X. Inc, XAPP290: Two Flows for Partial Reconfiguration: Module Based or Difference
Based, September 9, 2004. [Online]. Available: www.xilinx.com/bvdocs/appnotes/
xapp290.pdf
[130] S. Jung, “Entwurf eines verfahrens und einer umgebung zur durchgängigen konfigura-
tion adativer on-chip multiprozessor,” 2006, master Thesis.
[131] Jürgen Reichardt, Bernd Schwarz, VHDL-Synthese, 3rd ed. Oldenbourg-Verlag, Dec
2003. [Online]. Available: http://users.etech.fh-hamburg.de/users/reichardt/buch.html
[132] H. Kalte, M. Pormann, and U. Rueckert, “Rapid prototyping system f"ur dynamisch
rekonfigurierbarer hardware struckturen,” in AES 2000, 2000, pp. 149–157.
[135] M. Kaul and R. Vemuri, “Optimal temporal partitioning and synthesis for reconfig-
urable architectures,” 1998.
[137] B. W. Kernighan and S. Lin, “An efficient heuristic procedure for partitioning graphs,”
The Bell System Technical Journal, vol. 49, no. 2, pp. 291–307, 1970.
[139] P. Kongetira, K. Aingaran, and K. Olukotun, “The hydra chip,” IEEE MICRO
Magazine, pp. 21–29, March-April 2005. [Online]. Available: http://citeseer.ist.psu.
edu/287939.html
REFERENCES 329
[140] H. Kropp, C. Reuter, M. Wiege, T.-T. Do, and P. Pirsch, “An fpga-based prototyping
system for real-time verification of video processing schemes.” in FPL, 1999, pp. 333–
338.
[142] P. Kurup and T. Abbasi, Logic Synthesis Using Synopsys. Kluwer Akademic publisher,
1997.
[147] W. Lipski, Jr. and F. P. Preparata, “Finding the contour of a union of iso-oriented rect-
angles,” J. Algorithms, vol. 1, pp. 235–246, 1980, errata in 2(1981), 105; corrigendum
in 3(1982), 301–302.
[148] H. Liu and D. F. Wong, “Network flow-based circuit partitioning for time-multiplexed
FPGAs,” in IEEE/ACM International Conference on Computer-Aided Design, 1998,
pp. 497–504.
[149] H. Liu and D. F. Wong, “Circuit partitioning for dynamicaly reconfigurable fpgas,” in
International Symposium on Field Programmable Gate Arrays(FPGA 98). Monterey,
California: ACM/SIGDA, 1999, pp. 187 – 194.
[151] S.-M. Ludwig, “Hades-fast hardware synthesis tools and a reconfigurable coprocessor,”
Ph.D. dissertation, Swiss Federal Institute of Technologie, Zürich, 1997.
[152] J. Lutz and A. Hasan, “High performance fpga based elliptic curve cryptographic co-
processor,” in ITCC ’04: Proceedings of the International Conference on Information
Technology: Coding and Computing (ITCC’04) Volume 2. Washington, DC, USA:
IEEE Computer Society, 2004, p. 486.
[155] A. Michalski and D. Buell, “A scalable architecture for rsa cryptography on large
fpgas,” in FCCM ’06: Proceedings of the 14th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM’06). Washington, DC, USA:
IEEE Computer Society, 2006, pp. 331–332.
[157] V. S. Miller, “Use of elliptic curves in cryptography,” in Lecture notes in computer sci-
ences; 218 on Advances in cryptology—CRYPTO 85. New York, NY, USA: Springer-
Verlag New York, Inc., 1986, pp. 417–426.
[158] L. Minzer, “Programmable silicon for embedde signal processing,” Embedded Systems
Programming, pp. 110–133, March 2000.
[161] A. Morse, “Control using logic-based switching,” Trends in Control, Springer, London,
1995.
[162] M. C.-G. H. Mr. Adam Harrington and D. S. S. Jones, “Software-defined radio: The
revolution of wireless communication,” Annual Review of Communications, vol. 58,
2005.
[166] Narendra and Balakrishnan, “Adaptive control using multiple models: Switching and
tuning,” Yale Workshop on Adaptive and Learning Systems, 1994.
[167] M. Nikitovic and M. Brorsson, “An adaptive chip-multiprocessor architecture for future
mobile terminals.” in CASES, 2002, pp. 43–49.
[169] K. Olukotun and L. Hammond, “The future of microprocessors,” ACM Queue, vol. 3,
no. 7, pp. 26–29, September 2005. [Online]. Available: http://doi.acm.org/10.1145/
1095408.1095418
[172] Y. Pan and M. Hamdi, “Singular value decomposition on processors arrays with a
pipelined bus system,” Journal of Network and Computer Applications, vol. 19, pp.
235–248, 1996.
[173] A. Pandey and R. Vemuri, “Combined temporal partitioning and scheduling for recon-
figurable architectures,” in Reconfigurable Technology: FPGAs for Computing and Ap-
plications, Proc. SPIE 3844, J. Schewel, P. M. Athanas, S. A. Guccione, S. Ludwig, and
J. T. McHenry, Eds. Bellingham, WA: SPIE – The International Society for Optical
Engineering, 1999, pp. 93–103.
[174] P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of
asic’s.” IEEE Transactions on CAD, vol. 6, no. 8, pp. 661–679, 1989.
[175] D. Pellerin and S. Thibault, Practical FPGA Programming in C. Prentice Hall„ April
2005.
[176] M. Petrov, T. Murgan, F. May, M. Vorbach, P. Zipf, and M. Glesner, “The XPP ar-
chitecture and its co-simulation within the simulink environment.” in Proceedings of
International Conference on Field-P rogrammable Logic and Applications (FPL), ser.
Lecture Notes in Computer Science (LNCS), vol. 3203. Antwep, Belgium: Springer,
Aug. 2004, pp. 761–770.
[180] D. Pryor, M.R.Thisle, and N.shirazi, “Text searching on splash 2,” in IEEE Workshop
on FPGAs for Custom Computing Machines. IEEE, 1993, pp. 172–177.
[181] K. M. G. Purna and D. Bhatia, “Temporal partitioning and scheduling data flow graphs
for reconfigurable computers,” IEEE Transactions on Computers, vol. 48, no. 6, pp.
579–590, 1999.
[183] D. Rech, “Automatische generierung und konfiguration von adaptiven on-chip symmet-
rical multiprocessing (smp) systemen,” 2006, master Thesis.
[185] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and
public-key cryptosystems,” Communications of the ACM, vol. 21, no. 2, pp. 120–126,
1978.
[186] R. L. Rivest and C. E. Leiserson, Introduction to Algorithms. New York, NY, USA:
McGraw-Hill, Inc., 1990.
[187] C. Rowen and S. Leibson, “Flexible architectures for engineering successful socs.” in
DAC, 2004, pp. 692–697.
[188] S. M. Scalera and J. R. Vázquez, “The design and implementation of a context switch-
ing fpga,” in IEEE Symposium on FPGAs for Custom Computing Machines. Napa
Valey, CA: IEEE Computer Society Press, April 1998, pp. 78–85.
[190] P. Sedcole, “Reconfigurable platform-based design in fpgas for video image process-
ing,” Ph.D. dissertation, Imperial College, London, January 2006.
[192] N. Shirazi, A. Walters, and P. M. Athanas, “Quantitative analysis of floating point arith-
metic on fpga based custom computing machines.” in FCCM, 1995, pp. 155–163.
[196] C. Steiger, “Operating systems for reconfigurable embedded platforms: Online schedul-
ing of real-time tasks,” IEEE Trans. Comput., vol. 53, no. 11, pp. 1393–1407, 2004,
member-Herbert Walder and Member-Marco Platzner.
[197] C. Steiger, H. Walder, M. Platzner, and L. Thiele, “Online Scheduling and Placement
of Real-time Tasks to Partially Reconfigurable Devices,” in Proceedings of the 24th
International Real-Time Systems Symposium (RTSS’03), December 2003.
REFERENCES 333
[198] M. A. Tahir, A. Bouridane, and F. Kurugollu, “An fpga based coprocessor for glcm and
haralick texture features and their application in prostate cancer classification,” Analog
Integr. Circuits Signal Process., vol. 43, no. 2, pp. 205–215, 2005.
[200] G. R. G. S. J. Thomas, “An optimal parallel jacobi-like solution method for the singular
value decomposition,” in Proc. Int. Conf. on Parallel Processing, January 1988.
[203] F. Vahid and T. Givargis, Embedded System Design: A Unified Hardware/Software In-
troduction. New York, NY, USA: John Wiley & Sons, Inc., 2001.
[206] A. van Breemen and T. de Vries, “An agent-based framework for designing multi-
controller systems,” Proc. of the Fifth International Conference on The Practical Ap-
plications of Intelligent Agents and Multi-Agent Technology, pp. 219-235, Manchester,
U.K, Apr. 2000.
[209] J. E. Volder, “The birth of cordic,” J. VLSI Signal Process. Syst., vol. 25, no. 2, pp.
101–105, 2000.
[211] H. Walder, S. Nobs, and M. Platzner, “Xf-board: A prototyping platform for reconfig-
urable hardware operating systems.” in ERSA, 2004, p. 306.
[212] H. Walder and M. Platzner, “A runtime environment for reconfigurable hardware oper-
ating systems.” in FPL, 2004, pp. 831–835.
334 RECONFIGURABLE COMPUTING
[214] K. Weiss, T. Steckstor, C. Ötker, I. Katchan, C. Nitsch, and J. Philipp, Spyder - Virtex -
X2 User’s Manual, 1999.
[216] G. Wigley and D. Kearney, “The development of an operating system for reconfigurable
computing,” in Proceedings of the 9th IEEE Symposium Field-Programmable Custom
Computing Machines(FCCM’01). IEEE-CS Press, April 2001.
[217] S. E. G. William Lehr, Fuencisla Merino, “Software radio: Implications for wireless
services, industry structure, and public policy,” Massachusetts Institute of Technology,
Program on Internet and Telecoms Convergence, Tech. Rep., Aug. 2002.
[220] P. L. E. Wolfgang Rosenstiel, New Algorithms. Architectures and Applications for Re-
configurable Computing. Berlin: Springer, 2005.
[222] T. J. Wollinger, M. Wang, J. Guajardo, and C. Paar, “How well are high-end dsps suited
for the aes algorithms? aes algorithms on the tms320c6x dsp.” in AES Candidate Con-
ference, 2000, pp. 94–105.
[223] Xilinx, “A guide to using field programmable gate arrays (fpgas) for application-specific
digital signal processing performance,” http://www.xilinx.com, 1995.
[224] Xilinx, “The role of distributed arithmetic design in fpga-based signal processing,”
http://www.xilinx.com, 2000.
[226] Xilinx Inc., “The early access partial reconfiguration lounge,” (registration required).
[Online]. Available: http://www.xilinx.com/support/prealounge/protected/index.htm
[227] Xilinx Inc., “Early access partial reconfiguration user guide,” xilinx User Guide
UG208, Version 1.1, March 6, 2006. [Online]. Available: http://www.xilinx.com/
bvdocs/appnotes/xapp290.pdf
REFERENCES 335
[229] Xilinx Inc., “Xilinx ISE 8 Software Manuals and Help - PDF Collection,” 2005.
[Online]. Available: http://toolbox.xilinx.com/docsan/xilinx8/books/manuals.pdf
[230] Xilinx Inc., “Edk platform studio documentation,” 2007. [Online]. Available:
http://www.xilinx.com/ise/embedded/edk docs.htm
[231] H. Yang and D. F. Wong, “Efficient network flow based min-cut balanced partitioning,”
in International Conference on Computer-Aided Design, 1994.
[232] Y. Yang, Z. Abid, and W. Wang, “Two-prime rsa immune cryptosystem and its fpga im-
plementation,” in GLSVSLI ’05: Proceedings of the 15th ACM Great Lakes symposium
on VLSI. New York, NY, USA: ACM Press, 2005, pp. 164–167.
Appendix A
Hints to Labs
This chapter gives a step-by-step guide in note form to create a partially reconfigurable
system. It has been developed while reconstructing the Video8 example and is intended to
give hints to the reader on how to create own designs. A more detailed description of the
demonstration project can be found in 5.2 on page 236. The sources of the guide can be found
on the book’s web page. Following the directory and entity names refer to the Video8 project.
A profound knowledge of VHDL, ISE and EDK will be needed to comprehend the instructions.
338 RECONFIGURABLE COMPUTING
Tutorial
Creation of partially reconfigurable designs
On the Example of Video8
Version 1.0
1. Prerequisites
Following basic conditions should be present when accomplishing this guide. In case of
deviations the given approach might not be feasible.
ISE 8.1.01i PR8 or PR12 (Early Access Partial Reconfiguration patch for ISE)
EDK 8.1.02i
Abbreviations:
PR partially reconfigurable
PRM partially reconfigurable module
TL top-level
TLM top-level module
DIR directory
<base> root directory of the project
1.1. In video in.vhd: additional ports have to be adjoined to the entity declaration of
video in so that rgbfilter can be instantiated outside this entity but the rest may
stay the same as before.
entity video_in is {
...
LLC_CLOCK_to_filter : out std_logic;
R_out_to_filter : out std_logic_vector(0 to 9);
G_out_to_filter : out std_logic_vector(0 to 9);
B_out_to_filter : out std_logic_vector(0 to 9);
h_counter_out_to_filter : out std_logic_vector(0 to 9);
v_counter_out_to_filter : out std_logic_vector(0 to 9);
valid_out_to_filter : out std_logic;
R_in_from_filter : in std_logic_vector(0 to 9);
G_in_from_filter : in std_logic_vector(0 to 9);
B_in_from_filter : in std_logic_vector(0 to 9);
h_counter_in_from_filter : in std_logic_vector(0 to 9);
v_counter_in_from_filter : in std_logic_vector(0 to 9);
valid_in_from_filter : in std_logic
...
Appendix A: Hints to Labs 339
};
1.2. In video in.vhd: modules that formerly accessed the rgbfilter instance have to
redirected to outward ports. E.g. simply by rewriting corresponding signals:
2. making the PRM available to EDK (not necessary; good for testing purposes)
2.1. adapt .mpd and .pao (PCORE-definition file und -synthesizing directives) for the
video in-module:
create .mpd from .vhd with:
psfutil -hdl2mpd video_in_pr.vhd -bus opb m \\
-lang vhdl -o VIDEO_IN_PR_v2_1_0.mpd
löschen.
rename video in to video in pr
2.2. Create an own PCORE for rgbfilter:
copy video in
fit .pao:
lib RGBFILTER_v1_00_a rgbfilter
participants on every side of the bridge. This is important because the data flows from
video in to the frame buffer situated in the RAM. (→ Memory-Mapped I/O)
2.6. Synthesis in EDK (for testing). If errors occur like
ERROR:HDLParsers:3317 - "D:/murr/edkProjects/ \\
Video8_81_PR/hdl/video_in_pr_0_wrapper.vhd" Line 10.
Library VIDEO_IN_PR_v1_00_a cannot be found.
ERROR:HDLParsers:3014 - "D:/murr/edkProjects/ \\
Video8_81_PR/hdl/video_in_pr_0_wrapper.vhd" Line 11.
Library unit VIDEO_IN_PR_v1_00_a \\
is not available in library work.
then the .pao files have not been altered correctly as stated above (e.g. not all oc-
curences of video in have been changed to video in pr)
2.7. add additional filters if desired (e.g. a mean-value filter)
2.8. place the address of the frame buffer at the end of the DDR-RAMs (at address
0x00001111111000000000000000000000) to allow the Linux System to start from
0x0
→ adapt files, since Video8 unfortunately received hard-coded addresses from its au-
thor:
In rgbstream2framebuffer.vhd:
next_wb_counter <= ram_out(24 to 33) + \\
(ram_out(34 to 43) * "00000000001010000000") ;
changes to:
next_wb_counter <= ram_out(24 to 33) + \\
(ram_out(34 to 43) * "00000000001010000000") + \\
"000011111110000000000000000000" ;
→ if that doesn’t work out there’s probably something wrong (a workaround might be com-
piling with I/O-buffers to get the system stub bd.bmm and then again without to generate
the appropriate .ngc-files)
5. Changes to the skript framework: in directory synth
The example Video8 comes with a bunch of scripts to support up the PRM generation. This
part gives hints on how to change those appropriately.
Directories:
mod * directories according to the modules (sobel, sharpen, mean-value, unfiltered)
ändern
top-directory remains
static - not needed (in this case contained in edk
edk - copy previously generated design here
Files:
adapt doit.cmd → descend in new DIR and execute xst
in these DIR: adapt .prj and .xst (are the same in all the DIR)
6. Changes to the project files
6.1. top.vhd (takes the longest time to accomplish):
adapt the entity-definition (ports) that are connected to the I/Os and further to the
ucf
not to forget the I, O and I/O buffers → IBUFs resp. OBUFs remain simple,
IOBUFs are extended to I, O and T (tristate)-line (e.g. for Sda decoder: one pin
to the outside, but * I, * O and * T towards the entity system i → very time and
work consuming!
add/erase components that are (not) needed:
→ adapt component system ⇒ comes up to the system from the EDK project →
can be copied from <base>/synth/edk/hdl/system.vhd (system stub.vhd
here is only a dummy to be able to instantiate system.vhd as a sub-module in
the ISE project)
→ guide the PRM signals as ports out in entity system
add/remove corresponding instances
care for intermediate signals if signals have to run directly from component in-
stances to I/Os
insert bus macros for signals coming from/going to the PRM (here: R,G u. B to
and from rbgfilter, v and h counter to and from rgbfilter and valid to
and from rgbfilter
→ again, add intermediate signals
additional instances, that are to be static or reconfigurable later on have to be
named → the definition will be done in the ucf (e.g. recon module: rgbfilter
portmap( ....)
6.2. top.ucf:
place bus macros (PlanAhead or FPGA-Designer can pose a big help here)
assign areas for static and reconfigurable modules (→ in PlanAhead this can be
done with drag-and-drop manner!)
344 RECONFIGURABLE COMPUTING
assign which instance from top.vhd is member of which group (static ↔ recon-
figurable)
7. The Build Process:
7.1. synthesize all files:
if not done yet: synthesize the ISE project in
<base>/synth/edk/projnav
(system stub bd.bmm) may not be omitted!)
execute <base>/synth/synthesize all.cmd
7.2. begin with the Early Access Partial Reconfiguration Design Flow:
→ execute buildit.cmd ⇒ the full and partial bitstreams should be built if the script
has been adapted properly.
Appendix B
Example of a User Constraints File
The following chapter gives an excerpt of a User Constraints File (UCF). It has been
created for the partial reconfiguration of the Video8 sample design. The complete file can be
found on the book’s web site.
# leds
net LED_0 loc = AC4 | IOSTANDARD = LVTTL;
net LED_1 loc = AC3 | IOSTANDARD = LVTTL;
net LED_2 loc = AA6 | IOSTANDARD = LVTTL;
net LED_3 loc = AA5 | IOSTANDARD = LVTTL;
# push buttons
net button_start loc = AH2 | IOSTANDARD = LVTTL; # Right
net button_stop loc = AH1 | IOSTANDARD = LVTTL; # Left
#### Diligent general purpose expansion port (where the VDEC1 is connected)
NET "YCrCb_in* LOC = "AA8" | IOSTANDARD = LVTTL ;
.
.
.
#### VGA
NET "BLANK_Z" LOC = A8 | DRIVE = 12 | SLEW = SLOW | IOSTANDARD = LVTTL;
NET "COMP_SYNC" LOC = G12 | DRIVE = 12 | SLEW = SLOW | IOSTANDARD = LVTTL;
NET "H_SYNC_Z" LOC = B8 | DRIVE = 12 | SLEW = SLOW | IOSTANDARD = LVTTL;
NET "V_SYNC_Z" LOC = D11 | DRIVE = 12 | SLEW = SLOW | IOSTANDARD = LVTTL;
NET "PIXEL_CLOCK" LOC = H12 | DRIVE = 12 | SLEW = SLOW | IOSTANDARD = LVTTL;
INST "red_out_DAC*
INST "green_out_DAC*
INST "blue_out_DAC*
This quick tutorial gives an introduction to the basic features of the Part-Y[71] tool devel-
oped by Florian Dittmann from the "Heinz Nixdorf Institute" at the University of Paderborn.
It is intended to guide a novice to Part-Y and partial reconfiguration through a first example
of a PR design. The targeted system was the Xilinx XUP Development Board with a Virtex-II
Pro FPGA.
Prerequisites:
– running version of Part-Y (this tutorial was created with version 1.2)
– the tutorial sources.zip file
– ISE 6.1 or higher installed (tutorial is known to work on 7.1.04i)
Notes:
– always hit OK before changing the tab in a view - changes won’t be saved otherwise
– StdOut will be directed to MiscellaneouskStandard Output
– you might have extend the Part-Y Project by the proper definition for your FPGA.
These can be found under
de.upb.cs.roichen.party.device.concrete
– encountering problems you might find Part-Y FAQ.txt useful
Proceeding:
1. Open the level configuration window and ensure that the following items from the
different views are checked:
– BehavioralModuleSelectionController
– BehavioralTopLevelController
– BehavioralDownloadController
– StructureTopAssemblyController
– StructureTreeLevelController
– GeoSystemLevelController
350 RECONFIGURABLE COMPUTING
– MiscStorageController
– MiscGenerateBitstreamsController
– MiscOutputController
2. In Module Selection import the modules rt1.vhd, rt2.vhd and fixed.vhd from tuto-
rial sources.zip. Add them to current modules and mark each of them as reconfig-
urable. Hit OK!
3. In Top import global.vhd and global2.vhd. Add global.ucf to each of the Top-Level-
Designs marking one at a time and adding the .ucf separately. OK!
4. Top Assembly: mark Top0 global and Top1 global2 each at a time and add every
module to both of them. You should see the added modules at the ’modules instances
of selected top’-window.
5. If everything worked out up to now, you’ll see an hierarchy like the one exposed in
figure [C.1] on page 351. in Structural View|Tree.
6. In Geometrical View|System: set the correct target platform and a proper bus macro
file
7. Miscellaneous|Storage: set a name for the Part-Y project and a project path. Clicking
OK will create a file hierarchy at the given spot, INTEGRATE will copy the vhdl files
and others to the correct places. Be sure not to use spaces in path names since Part-Y
as well as ISE cannot cope with them properly. If everything went right there should
be a bunch of subfolders in your specified directory containing copies of the vhdl files
you declared and a couple of additional files.
8. In Behavioral View|Synthesis all of the VHDL-files have to be synthesized.
9. With Miscellaneous|Bitstream Generation you can conduct the initialization phase, the
activation as well as the final assembly phase by clicking on the run buttons. For some
reason the won’t look pressed when they are - might be a bug in the GUI. You can
check if the action is taking place in the ’standard output’ window.
10. After everything has been accomplished, the bitstreams can be downloaded to the de-
vice with Behavioral|Download using iMPACT
Appendix C: Quick Part-Y Tutorial 351