0% found this document useful (0 votes)
19 views139 pages

rc02 Noor

rl lecture notes

Uploaded by

shashank.skr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views139 pages

rc02 Noor

rl lecture notes

Uploaded by

shashank.skr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

RECONFIGURABLE

ARCHITECTURES
Simple Programmable Logic Devices
• Programmable logic arrays (PLA) and programmable array logic (PAL)
consist of a plane of AND-gates connected to a plane of OR-gates.
• The inputs signals as well as their negations are connected to the
inputs of the AND gates in the AND-plane.
• The outputs of the AND-gates are use as input for the OR-gate in the
OR-plane whose outputs correspond to those of the PAL/PLA.
• The connections in the two planes are programmable by the user.
• Because every Boolean function can be written as a sum of products
• The products are implemented in the AND-plane by connecting the
wires accordingly, and the sums of the product are realized in the OR-
plane.
• While in the PLAs both fields can be programmed by the user, this is
not the case in the PALs where only the AND-plane is programmable.
• The OR-plane is fixed by the manufacturer.
• Therefore, the PALs can be seen as a subclass of the PLAs.
PLDs: PLA
• AND array feeding into an OR array
• User controls both the AND and OR arrays
• The number of AND functions is independent of the number of
inputs.
• The number of OR functions is independent of both the
number of inputs and the number of AND functions.
• Implement Sum-of-Products (SoP) expressions
PLA Principle
PLA
PROM
PAL
PAL: The OR-connections are fixed

PAL and PLA implementations of the functions F1=A · C + A · B


and F2=A · B + B · C
PLA: Programmable OR-connections

PAL and PLA implementations of the functions F1=A · C + A · B


and F2=A · B + B · C
Complex Programmable Logic Device
• As stated in the previous section, PALs and PLAs are only
available in small sizes, equivalent to a few hundred logic gates.
• A CPLD consists of a set of macro cells, input/output blocks and
an interconnection network.
• The connection between the input/output blocks and the
macro cells and those between macro cells and macro cells can
be made through the programmable interconnection network.
• A macro cell typically contains several PLAs and flip flops.
Despite their relative large capacity (few hundreds thousands
of logic gates), compared to those of PLAs, CPLDs are still too
small for using in reconfigurable computing devices.
• They are usually used as glue logic, or to implement small
functions.
Structure of a CPLD device
Field Programmable Gate Arrays
• FPGA is a programmable device consisting of three main parts.
• A set of programmable logic cells also called logic blocks or
configurable logic blocks, a programmable interconnection
network and a set of input and output cells around the device.
• A function to be implemented in FPGA is partitioned in
modules, each of which can be implemented in a logic block.
• The logic blocks are then connected together using the
programmable interconnection.
• All three basic components of an FPGA (logic block,
interconnection and input output) can be programmed by the
user in the field.
• FPGAs can be programmed once or several times depending on
the technology used.
Structure of an FPGA
Technology
• The technology defines how the different blocks (logic blocks,
interconnect, input/output) are physically realized.
• Basically, two major technologies exist:
• Antifuse and
• Memory-based
• Whereas the antifuse paradigm is limited to the realization of
interconnections, the memory-based paradigm is used for the
computation as well as the interconnections.
• In the memory-based category, we can list the SRAM the
EEPROM and the Flash based FPGAs.
Antifuse
• An antifuse is normally an open circuit
• The two-terminal elements are connected to the upper and
lower layer of the antifuse, in the middle of which a dielectric is
placed.
• In its initial state, the high resistance of the dielectric does not
allow any current to flow between the two layers.
• Applying a high voltage causes large power dissipation in a
small area, which melts the dielectric.
• This operation drastically reduces the resistance and a link can
be built, which permanently connects the two layers.
Antifuse
• The two types of antifuses actually commercialized are:
• The Programmable Low-Impedance Circuit Element (PLICE),
which is manufactured by the company Actel
• The Metal Antifuse also called ViaLink made by the company
QuickLogic.

Q-Logic Vialink antifuse Actel PLICE antifuse


Antifuse
• The PLICE antifuse consists of an Oxygen-Nitrogen-Oxygen (ONO) dielectric
layer sandwiched between a polysilicon and an n+ diffusion layer that
serves as conductor.
• The ViaLink antifuse is composed of a sandwich of very high resistance layer
of programmable amorphous silicon between two metal layers.
• When a programming voltage is applied, a metal-to-metal link is formed by
permanently converting the silicon to a low resistance state.
• The main advantage of the antifuse chips is their small area and their
significantly lower resistance and parasitic capacitance compared with
transistors.
• This helps to reduce the RC delays in the routing.
• However, anti-fuse-based FPGAs are not suitable for devices that must be
frequently reprogrammed, as it is the case in reconfigurable computing.
• Antifuse FPGAs are normally programmed once by the user and will not
change anymore.
• For this reason, they are also known as one-time programmable FPGAs.
SRAM
• A static RAM (SRAM) is use to configure the logic blocks and
the connection as well.

A Xillinx SRAM cell


Configuration Bits
• Use of SRAM in FPGA-Configuration

Function generation
Wire connection MUX programming
Advantages
• The major advantage of this technology is that FPGAs can be
programmed (configured) indefinitely.
• We just need to change the value into the SRAMcells to realize
a new connection or a new function.
• Moreover, the device can be done in-circuit very quickly and
allow the reconfiguration to be done on the fly.
Disadvantages
• The disadvantages of SRAM-based FPGAs are the chip area
required by the SRAM approach is relatively large.
• The total size of an SRAM-configuration cell plus the transistor
switch that the SRAM-cell drives is also larger than the
programming devices used in the antifuse technology.
• Furthermore, the device is volatile, i.e. the configuration of the
device stored in the SRAM-cells is lost if the power is cut off.
• Therefore, external storage or non-volatile devices such as
CPLDs, EPROM or Flash devices, are required to store the
configuration and load it into the FPGA-device at power-on.
EPROM
• Erasable programmable read only memory (EPROM) devices
are based on a floating gate.
• The device can be permanently programmed by applying a high
voltage (10–21 V) between the control gate and the drain of
the transistor (12 V).
• This causes the floating gate to be permanently and negatively
charged.
• This negative potential on the floating gate compensates the
voltage on the control gate and keeps the transistor closed.
PROM
• In an ultra violet (UV) PROM, the programming process can be
reversed by exposing the floating gate to UV-light.
• This process reduces the threshold voltage and makes the
transistor function normally.
• For this purpose, the device must be removed from the system
in which it operates and plug into a special device.
EEPROMs
• In electrically erasable and programmable ROM (EEPROMs) as
well as in flash-EPROM, the erase operation is accomplished
electrically, rather than by exposure to ultraviolet light.
• A high negative voltage must therefore be applied at the
control gate.
• This process is faster than using a UV lamp, and the chip does
not have to be removed from the system.
• In EEPROM-based devices, two or more transistors are typically
used in a ROM cell: one access and one programmed transistor.
• The programmed transistor performs the same function as the
floating gate in an EPROM, with both charge and discharge
being done electrically.
flash-EEPROMs
• In the flash-EEPROMs that are used as logic tile cell in the Actel
ProASIC chips, two transistors share the floating gate, which
store the programming information.
• The sensing transistor is only used for writing and verification
of the floating gate voltage whereas the other is used as switch.
• This can be used to connect or disconnect routing nets to or
from the configured logic.
• The switch is also used to erase the floating gate.
EEPROM Technology

EPROM Cell Actel Flash-EEPROM


Function generators
• A reconfigurable hardware device should provide the users
with the possibility to dynamically implement and reimplement
new functions.
• This is usually done by means of function generators that can
be seen as the basic computing unit in the device.
• Two types of function generators are in use in commercial
FPGAs:
• the multiplexers and
• the look-up tables.
Multiplexer
• A2:1(2n-input-1) multiplexer (MUX) is a selector circuit with 2n inputs
and one output.
• Its function is to allow only one input line to be fed at the output.
• The line to be fed at the output can be selected using some selector
inputs.
• To select one of the 2n possible inputs, n selectors lines are required.
• An MUX can be used to implement a given function.
• The straightforward way is to place the possible results of the
function at the 2n inputs of the MUX and to place the function
arguments at the selector inputs.
• Several possibilities exist to implement a function in a MUX, for
instance by using some arguments as inputs and other as selectors.
2- input MUX as programmable Logic block
Example
• Figure illustrates this case for the function f = a×b.
• The argument a is used as input in combination with a second
input 0 and the second argument b is used as selector.

Implementation of f=ab in a 2-input MUX


• A complex function can be implemented in many multiplexers
connected together.
• The function is first broken down into small pieces.
• Each piece is then implemented on a multiplexer.
• The multiplexers will then be connected to build the given
function.
Shannon expansion theorem
• The Shannon expansion theorem can be used to decompose a
function and implement it into a MUX
• This theorem states that a given Boolean logic function F(x1, …, xn ) of
n variables can be written as shown in the following equation.

• Where F(x1, …, xi=1, …, xn) is the function obtained by replacing xi


with one and F(x1, …, xi=0, …, xn) the function obtained by replacing
xi by zero in F.
• The functions F1 = F(x1 , ··· ,xi =1, ··· , xn) and F2=F(x1, ··· ,xi =0, ··· ,xn)
are called cofactors.
full adder using 4:1 MUX
The Actel ACT X Logic module
• Multiplexer are used as function generators in the Actel FPGA
devices.
• In the Actel ACT1 device, the basic computing element is the logic
module.
• It is an 8-input 1-output logic circuit that can implement a wide range
of functions.
• Besides combinatorial functions, the logic module can also
implement a variety of D-latches.
• The C-modules present in the second generation of Actel devices, the
ACT2, are similar to the logic module.
• The S-modules, which are found in the second and third generation of
actel devices, contain an additional dedicated flip-flop.
• This avoids the building of flip flop from the combinatorial logic as it
is the case in the logic module.
Logic Module

Logic Module C Module


S Module (Act 2)
S Module (Act 3)
Look-Up Tables
• A look-up tables (LUT) is a group of memory cells, which contain
all the possible results of a given function for a given set of
input values.
• The values of the function are stored in such a way that they
can be retrieved by the corresponding input values.
• An n-input LUT can be used to implement up to 22^n different
functions, each of which can take 2n possible values.
• Therefore, an n-input LUT must provide 2n cells for storing the
possible values of an n-input function.
• In FPGAs, an LUT physically consists of a set of SRAM-cells to
store the values and a decoder that is used to access the
correct SRAM location and retrieve the result of the function,
which corresponds to the input combination
LUT
• To implement a complex function in an LUT-based FPGA, the
function must be divided into small pieces, each of which can
be implemented in a single LUT.
• The interconnections are used to connect small pieces together
and form the complete function
LUT
• Memory Cells + MUX = LUT

SRAM Cell
Example 1: 3-input LUT
Example
Full Adder
• Implementation of a full adder in two 3-input LUTs
The Xilinx Configurable Logic Block
• The basic computing block in the Xilinx FPGAs consists of an
LUT with variable number of inputs, a set of multiplexers,
arithmetic logic and a storage element

Basic block of the Xilinx FPGAs


Configurable Logic Block (CLB)
• Several basic computing blocks are grouped in a coarse-grained
element called the configurable logic block (CLB)
• The number of basic blocks in a CLB varies from device to
device.
• In the older devices such as the 4000 series, the Virtex and
Virtex E and the Spartan devices, two basic blocks were
available in a CLB.
• In the newer devices such as the Spartan 3, the Virtex II, the
Virtex II-Pro and the Virtex 4, the CLBs are divided into four
slices each of which contains two basic blocks.
• The CLBs in the Virtex 5 devices contain only two slices, each of
which contains four basic blocks.
CLB in the newer Xilinx FPGAs
• CLB in the Spartan, Virtex II, Virtex II Pro and Virtex 4
CLB in the newer Xilinx FPGAs
• CLB in the Virtex 5
CLB – SLICEM & SLICEL
• The left part slices of a CLB, also called SLICEM, can be
configured either as combinatorial logic, or can be use as 16-bit
SRAM or as shift register
• Right-hand slices, the SLICEL, can only be configured
combinatorial logic.

• Except for the Virtex 5, all LUTs in Xilinx devices have four
inputs and one output.
• In the Virtex 5 each LUT has six inputs and two outputs.
• The LUT can be configured either as a 6-input LUT, in which
case only one output can be used, or
• As two 5-input LUTs, in which case each of the two outputs is
used as output of a 5-input LUT.
The Altera Logic Array Block
• Altera’s FPGAs (Cyclone, FLEX and Stratix) are LUT-based
• In the Cyclone II as well as in the FLEX architecture, the basic
unit of logic is the logic element (LE) that typically contains an
LUT, a flip flop, a multiplexer and additional logic for carry chain
and register chain.

Logic Element in the Cyclone II


Stratix II devices,
• The basic computing unit is called adaptive logic module (ALM).
• The ALM is made upon a mixture of 4-input and 3-input LUTs
that can be used to implement logic functions with variable
number of inputs.
• This ensures a backward compatibility to 4-input-based
designs, while providing the possibility to implement coarse-
grained module with variable number (up to 8) inputs.
• Additional modules including flip flops, adders and carry logic
are also provided.
Stratix II Adaptive Logic Module
Logic Array Block - Altera
• Altera logic cells are grouped to form coarse-grained computing
elements called logic array blocks (LAB).
• The number of logic cells per LAB varies from the device to
device.
• The Flex 6000 LABs contains ten logic elements while the FLEX
8000 LAB contains only eight.
• Sixteen LEs are available for each LAB in the cyclone II while the
Stratix II LAB contains eight ALMs.
FPGA STRUCTURES
FPGA Structures
• FPGAs consist of a set of programmable logic cells placed on
the device such as to build an array of computing resources.
• The resulting structure is vendor-dependant.
• According to the arrangement of logic blocks and the
interconnection paradigm of the logic blocks on the device,
FPGAs can be classified in four categories:
• Symmetrical array,
• Row-based,
• Hierarchy-based and
• Sea of gates
FPGA structures

Symmetrical Array Sea of Gates


FPGA Structures

Row Based Hierarchical


Symmetrical Array: The Xilinx Virtex and
Atmel AT40K Families
• A symmetrical array-based FPGA consists of a two-dimensional
array of logic blocks immersed in a set of vertical and horizontal
lines.
• Switch elements exist at the intersections of the vertical and
horizontal lines to allow for the connections of vertical and
horizontal lines.
• Examples of FPGAs arranged in a symmetrical array-based are
the Xilinx Virtex FPGA and the Atmel.
The XILINX Virtex II
• Symmetrical array arrangement in
Atmel’s symmetrical array arrangement
• The Atmel AT40K FPGAs
RV: Vertical Repeater
RH: Horizontal Repeater

Core Cell
Xilinx Devices
• CLBs are embedded in the routing structure that consists of
vertical and horizontal wires.
• Each CLB element is tied to a switch matrix to access the
general routing structure.
• The switch matrix provides programmable multiplexers, which
are used to select the signals in the given routing channel that
should be connected to the CLB terminals.
• The switch matrix can also connect vertical and horizontal lines,
thus making routing possible on the FPGA.
Virtex routing resource
• CLB connection to the switch matrix
Tri-state buffer connection to horizontal lines
• Each CLB has access to two tri-state driver (TBUF) over the
switch matrix.
• Tri-State buffers are used to drive on-chip busses.
• Each tri-state buffer has its own tri-state control pin and its own
input pin that are controlled by the logic built in the CLB.
• Four horizontal routing resources per CLB are provided for on-
chip tristate busses.
• Each tri-state buffer has access alternately to two horizontal
lines, which can be partitioned as shown in figure.
• Besides the switch matrix, CLBs connect to their neighbours
using dedicated fast connection tracks.
Tri-state buffer connection to horizontal lines
Routing on Atmel Chips
• The routing is done on the Atmel chips using a set of busing
planes.
• Seven busing planes are available on the AT40K.
• Figure depicts a part of the plane with five identical busing
planes.
• Each plane has three bus resources:
• A local-bus resource (the middle bus) and
• Two express-bus resources (both sides).
Local connection of an Atmel Cell
Star like Connection
• Repeaters are connected to two adjacent local-bus segments
and two express bus segments.
• Local bus segments span four cells whereas an express bus
segments span eight cells.
• Long tri-state bus can be created by bypassing a repeater.
• Locally, the Atmel chip provides a star-like connection resource
that allows each cell (which is the basic unit of computation) to
be connected directly to all its eight neighbours.
• Figure depicts direct connections between a cell and its eight
nearest neighbours.
Row-Based FPGAs: The Actel ACT3 Family
• A row-based FPGA consists of alternating rows of logic block or
macro cells and channels.
• The space between the logic blocks is called channel and is
used for signal routing.
• The routing is done via the horizontal direction using the
channels.
• In the vertical direction, dedicated vertical tracks are used.
• A channel consists of several routing tracks divided into
segments.
• The minimum length of a segment is the width of a module
pair and its maximum length is the length of a complete
channel, i.e. the width of the device.
Row based arrangement on the Actel ACT3
FPGA Family
• Any segment that spans more than one-third of the row length
is considered a long horizontal segment.
• Non dedicated horizontal routing tracks are used to route signal
nets.
• Dedicated routing tracks are used for the global clock networks
and for power and ground tracks
Vertical Tracks
• Vertical tracks are of three types: input, output and long.
• They are also divided into more segments.
• Each segment in an input track is dedicated to the input of a
particular module, and each segment in the output track is dedicated
to the output of a particular module.
• Long segments are uncommitted and can be assigned during routing.
• Each output segment spans four channels (two above and two
below) except near the top and the bottom of the array.
• Vertical input segments span only the channel above or the channel
below.
• The tracks dedicated to module inputs are segmented by pass
transistors in each module row.
• During normal user operation the pass transistors are inactive, which
isolate the inputs of a module from the inputs of the module above
it.
Actel’s ACT3 FPGA horizontal and vertical
routing resources
Technology
• The connections inside Actel FPGAs are established using
antifuse.
• Four types of antifuse connections exist for the ACT3:
• horizontal-to-vertical (XF) connection,
• horizontal-to-horizontal (HF) connection,
• vertical-to-vertical (FF) connection and
• fast-vertical connection
Sea-of-gates: The Actel ProASIC family
• The macro cells are arranged on a two-dimensional array
structure such that an entry in the array correspond to the
coordinate of a given macro cell.
• The difference between the symmetrical array and the sea-of-
gate is that there is no space left aside between the macro cells
for routing.
• The interconnection wires are fabricated on top of the cells.
• The Actel ProASIC FPGA family is an implementation of the sea-
of-gate approach.
• The ProASIC core consists of a sea-of-gates called sea-of-tiles.
• The macro cells are the EEPROM-based tiles.
The Actel ProASIC family
• The device uses a four level of hierarchy routing resource to connect
the logic tiles:
• The local resources,
• The long-line resources,
• The very long-line resources and
• The global networks.
• The local resources allow the output of the tile to be connected to
the inputs of one of the eight surrounding tiles.
• The long-line resources provide routing for longer distances and
higher fanout connections.
• These resources, which vary in length (spanning one, two, or four
tiles), run both vertically and horizontally and cover the entire device.
• The very long lines span the entire device.
• They are used to route very long or very high fan out nets.
Actel ProASIC local routing resources
Hierarchical-based: The Altera Cyclone, Flex
and Stratix families
• In hierarchical based FPGAs, macro cells are hierarchically
placed on the device.
• Elements with the lowest granularity are at the lowest level
hierarchy.
• They are grouped to form the elements of the next level.
• Each element of a level i consists of a given number of
elements from level i-1.
Altera FPGAs
• Altera FPGAs (FLEX, Cyclone II and Stratix II) have two
hierarchical levels.
• The logic cells (in the Cyclone II and FLEX) and the ALM in the
Stratix II are on the lowest level of the hierarchy.
• The logic array blocks (LABs) build the higher level.
• Each LAB contains a given number of logic elements (eight for
the FLEX8000, ten for the FLEX6000, sixteen for the Cyclone
and eight AMLs for the Stratix II).
• The LABs in turn are arranged as array on the device.
Routing Tracks
• Signal connections to and from device pin are provided via a
routing structure called FastTrack in the FLEX and MultiTrack in
Cyclone II and Stratix II.
• The FastTrack as well as the MultiTrack interconnects consist of
a series of fast, continuous row and column channels that run
the entire length and width of the device.
Hierarchical arrangement
on the Altera Stratix II FPGA
LAB connection on the Altera Stratix devices
• Signals between LEs or
ALMs in the same LAB
and those in the
adjacent LABs are
routed via local
interconnect signals.
• Each row of a LAB is
served by a dedicated
row interconnect,
which routes signals
between LABs in the
same row.
• The column
interconnect routes
signals between rows
and routes signals
from I/O pin rows.
LAB connection on the Altera Stratix devices
• A row channel can be driven by an LE (or ALM in the Stratix II)
or by one of two column channels.
• Each column of LABs is served by a dedicated column
interconnect.
• The LEs in an LAB can drive the column interconnect, which can
then drive another row’s interconnect, to route the signals to
other LABs in the device.
• A signal from the column interconnect must be routed to the
row interconnect before it can enter an LAB.
• LEs can drive global control signals.
• This is helpful for distributing the internally generated clock,
asynchronous clear and asynchronous preset signals and high-
fan-out data signals.
Programmable I/O
• Located around the periphery of the device, I/O components
allow for the communication of a design inside the FPGA with
off-chip modules.
• Like the logic cells and the interconnections, FPGA I/Os are
programmable, which allows designs inside the FPGA to
configure a single interface pin as input, output or bidirectional.
General structure of an I/O component
The general structure of an I/O component
• It consists of an input block, an output block and an output enable
block for driving the tri-state buffer.
• Two registers that are activated either by the falling or by the rising
edge of the clock are available in each block.
• The I/Os can be parameterized for a single data rate (SDR) or a
double data rate (DDR) operation mode.
• Whereas in the SDR-mode, data are copied into the I/O registers on
the rising clock edge only, the DDR mode exploits the falling clock
edge and the rising clock edge to copy data into the I/O registers.
• On the input, output, tri-state, one of the double data rate (DDR)
register can be used.
• The double data rate is directly accomplished by the two registers on
each path, clocked by the rising edge (or the falling edge) from
different clock nets.
I/O Elemenent
• DDR input can be done using both input registers whereas DDR
output will use both output registers.
• In the Altera Stratix II, the I/O component is called I/O element
(IOE).
• The IOEs are located in I/O blocks around the periphery of the
device.
• The I/O blocks, which contain up to four IOEs, are used to drive
the rows and columns interconnects.
• They are divided in two groups:
• The row I/O blocks, which drive row,
• Column or direct link interconnects, whereas the column I/O blocks drive
column interconnects.
I/O Block
• The Xilinx Virtex I/O components are called I/O block (IOB), and
they are provided in groups of two or four on the device
boundary.
• The IOB can be used independent from each other as input
and/or output, or they can be combined in group of two to be
used as differential pair directly connected to a switch matrix.
Hybrid FPGAs
• The process technology as well as the market demand is
pushing manufacturers to include more and more pre-designed
and well-tested hard macros in their chips.
• Resources, such as memory, that are used in almost all designs
can be directly be found on the chip.
• This allows the designer to use well-tested and efficient
modules.
• Moreover, hard macros are more efficiently implemented and
are faster than macro implemented on the universal function
generators.
• The resources often available on hybrid FPGAs are RAMs, clock
managers, arithmetic modules, network interface modules and
processors.
Hybrid FPGAs
• The resources required by the users vary with their application class,
some manufacturers, such as Xilinx, provided different classes of
FPGAs, each of which is suited for a given purpose.
• The most emerging classes as classified by Xilinx are the system on
chip (SoC), digital signal processing (DSP) and pure logic.
• The system on chip class to which the Virtex 4 FX belongs is
characterized by embedded processors directly available on the chip,
memory and dedicated bus resources.
• Reference designs also exist, which efficiently use the chip resource
to get the maximum performance.
• The DSP class, which contains the Virtex 4 SX is characterized by the
abundance of multipliers macros and the pure logic class, to which
the Virtex 4 LX belongs, is dominated by LUT function generators.
Xilinx Virtex II Pro FPGA with two PowerPC
405 Processor blocks
Xilinx Virtex II Pro
• The Xilinx Virtex II Pro contains up to four embedded IBM
Power PC 405 RISC hard core processors, which can be clocked
at more than 300 MHz.
• Embedded high-speed serial RocketIO transceivers with up to
3.125 Gb/s per channel
• Internal BlockRAM memory module to be used as dual-ported
RAM
• Embedded 18 × 18-bit multiplier blocks
• Digital clock manager (DCM) to provide self-calibrating, fully
digital solution for clock distribution delay compensation, clock
multiplication and division, and coarse-and fine-grained clock
phase shifting are available on the chip.
Coarse-Grained Reconfigurable Devices
• FPGAs allow for programming any kind of function as far as this can fit onto the
device.
• This is only possible because of the low granularity of the function generators
(LUT and MUX).
• However, the programmable interconnections used to connect the logic blocks
reduce the performance of FPGAs.
• A way to overcome this is to provide frequently used module as hard macro, as
it is the case in hybrid FPGAs, and therefore, to allow programmable
interconnections only between processing elements available as hard macros on
the chip.
• Coarse-grained reconfigurable devices follow this approach.
• In general, those devices are made upon a set of hard macros (8-bit, 16-bit or
even a 32-bit ALU), usually called processing element (PE).
• The PEs are able to carry few operations such as addition, subtraction or even
multiplication.
• The interconnection is realized either through switching matrices or dedicated
busses.
• The configuration is done by defining the operation mode of the PEs and
programming the interconnection between the processing elements.
coarse-grained reconfigurable devices Types
• A wide variety of coarse-grained reconfigurable devices that we
classify into three categories:
• The dataflow machines, functions are usually built by
connecting some PEs to build a functional unit that is used to
compute on a stream of data.
• The network-based devices in which the connection between
the PEs is done using messages instead of wires.
• The embedded FPGA devices, which consist of a processor core
that cohabit with a programmable logic on the same chip.
Dataflow Machines
• Dataflow machines are the most dominating coarse-grained
reconfigurable devices.
• Three of those architectures:
• The PACT-XPP,
• The NEC-DRP and
• The PicoChip devices
The PACT XPP device
• The idea behind the PACT XPP architecture is to efficiently
compute streams of data provided from different sources such
as A/D converters rather than single instructions as it is the
case in the Von-Neumann computers.
• Because the computation should be done while data are
streaming through the processing elements, it is suitable to
configure the PEs to adapt to the natural computation
paradigm of a given application or part of it at a given time.
PACT
• The eXtreme Processing Platform (XPP) architecture of PACT
consist of:
• An array of processing array elements (PAE) grouped in
processing array (PA)
• A communication network
• A hierarchical configuration tree
• Memory elements aside the PAs
• A set of I/O elements on each side of the device.
processing array cluster
• One configuration manager (CM) attached to a local memory is
responsible for writing configuration onto a PA.
• The configuration manager together with PA build the
processing array cluster (PAC).
• An XPP chip contains many PACs arranged as grid array on the
device.
Structure of the PACT XPP device
Root CM
• An XPP device with four PACs, each of which contains 4 PAEs
and surrounded by memory blocks.
• The CMs at a lower level are controlled by a CM at the next
higher level.
• The root CM at the highest level is attached to an external
configuration memory and supervises the whole device
configuration.
The Processing Array Element (PAE)
• There exist two different kinds of PAEs:
• The ALU PAE and the RAM-PAE.
• An ALU-PAE contains an ALU that can be configured to perform basic
arithmetic operations
• The RAMPAE is used for storing data.
• The back-register (BREG) provides routing channels for data and
events from bottom to top, additional arithmetic and register
functions whereas the forward-register (FREG) is used for routing the
signals from top to bottom and for the control of dataflow using
event signals.
• All objects can be connected to horizontal routing channels using
switch objects.
• Dataflow register (DF-Registers) can be used at the object output for
data buffering in case of a pipeline stall.
• Input registers can be pre-loaded by configuration data and always
provide single cycle stall.
RAM-PAE
• A RAM-PAE is similar to an ALU-PAE.
• Instead of an ALU, a dual ported RAM is used for storing data.
• The RAM generates a data packet after an address was
received at the input.
• Writing to the RAM requires two data packet:
• One for the address and the other for the data to be written.
The XPP ALU Processing Array Element.
• The structure of the RAM ALU is similar.
Routing and Communication
• The XPP interconnection network consists of two independent
networks:
• One for data transmission and the other for event transmission.
• These two networks consist of horizontal and vertical channels.
• The vertical channels are controlled by the BREG and FREG whereas
connection to horizontal channel is done via switch elements.
• Besides the horizontal and vertical channels a configuration bus
exists, which allows the CMs to configure the PAEs.
• Horizontal buses are used to connect a PAE within a row whereas the
vertical buses are used to connect objects to a given horizontal bus.
• Vertical connections are done using configurable switch objects that
segment the vertical communication channels.
• The vertical routing is enabled using register-objects integrated into
the PAEs.
Interfaces
• XPP devices provide communication interfaces aside the chip.
• The XPP64-A1 for example contain six external interfaces consisting
of four identical general purpose I/O interfaces on the chip corner
(bottom left, upper left, bottom right and upper right), one
configuration manager interface and a JTAG compliant interface for
debugging and testing purpose.
• The I/O interfaces can operate independently from each other either
in RAM, or in streaming mode.
• In streaming mode, each I/O element provides two bidirectional
ports for data streaming.
• Handshake signals are used for synchronization of data packets to
external ports.
• In RAM mode, each port can access external synchronous SRAMs
with 24-bit addresses and 24-bit data.
• Control signals for the SRAM transactions are available such that no
extra logic is required.
Configuration Manager Interfaces
• The configuration manager interface consists of three
subgroups of signals:
• code, message send and message receive.
• The code group provides channels over which configuration
data can be downloaded to the device whereas the send and
receive groups provide communication channels with a host
processor.
The NEC DRP Architecture
• We now present the NEC dynamically reconfigurable processor
(DRP), which operates in a similar way as the PACT.
• The reconfiguration control is not done in a hierarchical fashion such
as in the PACT devices.
• Figure shows the overall structure of a NEC DRP.
• It consist of:
• An array of byte-oriented processing elements
• A programmable interconnection network to connect the processing
elements
• A sequencer (State Transition Controller) which can be programmed
as finite state machine to control the dynamic reconfiguration
process on the device
• Memory blocks to store configuration and computation data.
• The memory blocks are arranged around the device various
interfaces and RAM controllers such as PCI, PLL, SDRAM/SRAM
The DRP Processing Element
DRP Processing Element
• A DRP processing element contains an ALU for byte-oriented
arithmetic and logic operation
• A data management unit to handle byte selects, shift, mask and
constant generation.
• The operand can be fetched from the PE’s register file or collected
directly from the PE’s inputs.
• The results can be stored in the PE’s register file or can be sent to the
output.
• The configuration is done by the state transition controller that set a
pointer to a corresponding instruction register according to the
operation mode of the ALU.
• Having many configuration registers allow for storing configuration
data directly in the proximity of the PEs and allow a fast switching
from one configuration to the next one.
The picoChip Reconfigurable Device
• A picochip consists of a picoArray core
• A set of different interfaces for external connection of various module such
as memory and processors.
• The picoArray core has a similar structure such as the NEC DRP and the PACT
• The connection between the PES is done at the column and row
intersection.
• The PC102 Chip for example contains hundreds of array elements
• Each with a versatile 16-bit processor with local data memory connected by
a programmable interconnect structure.
• The architecture is heterogeneous, with four types of processing element all
having a common basic structure, but optimized for different tasks:
• The standard AE (STAN),
• The control AE (CTRL),
• The Memory AE (MEM) and
• The function accelerator unit (FAU).
• A standard AE type includes multiply-accumulate peripheral as well as
special instructions optimized for CDMA operations.
• The memory AE contains multiply unit and additional memory.
• The function accelerator unit is a coprocessor optimized for specific signal-
processing tasks.
• The control AE is equipped with a multiply unit and larger amounts of data
and instruction memory optimized for the implementation of base station
control functionality.
• Multiple elements can be programmed together as a group to perform
particular function.
• The device can be reconfigured at run-time to run different applications
such as wireless protocols.
• Several interfaces are available for seamless external connection:
• Each chip has four inter processor communications links that can be used to
build an array of several picoChips to implement larger and complex
functions that cannot fit in only one picoChip
Structure of the picoChip device
• The communication on those links is based on a time division
multiplexing (TDM) protocol scheme.
• Besides the inter processor communication, a microprocessor
interface is available for connecting an external processor that
can be used to configured the device and stream data into the
device.
• External storage (EEPROM, FLASH) can be connected as well to
this interface to allow a self reconfiguration of the device on
start up.
• Other interfaces are provided among which a memory
interface for connection of external memory and a JTAG
interface for debugging purpose.
Network-oriented architectures
• Despite the interest on networks on chip as communication
paradigm has grown recently, very few reconfigurable devices
rely on message passing for data exchange among the PEs.
• One of the few companies to have implemented this concept is
Quicksilver Tech.
The Quicksilver ACM Architecture
• The adaptive computing machine (ACM) is based on a revolutionary
network on chip paradigm and is one of the very few devices that
work on such a principle.
• The Quicksilver ACM consists of a set of heterogeneous computing
nodes hierarchically arranged on a device.
• At the lowest level, four computing nodes are placed in a cluster and
connected locally together.
• Many clusters at a given level are put together to build bigger
clusters at the next higher level.
• An ACM chip consists of the following elements:
• A set of heterogeneous processing nodes (PN)
• An homogenous matrix interconnect network (MIN)
• A system controller
• Various I/O interfaces.
The Quicksilver ACM hierarchical structure
with 64 nodes
The ACM Processing Node structure
• An ACM processing nodes consist of
• An algorithmic engine that defines the node type.
• The node type can be customized at compile-time or at run-time by
the user to match a given algorithm.
• Four types of nodes exist:
• The Pprogrammable scalar node (PSN) provides a standard 32-bit
RISC architecture with 32 general purpose registers.
• The adaptive execution node (AXN) provides variable word size
multiply accumulate (MAC) and ALU operations.
• The domain bit manipulation ( DBN) node provides bit manipulation
and byte oriented operations
• The external memory controller node provides DDRRAM, SRAM,
memory random access and DMA control interfaces for off-chip
memory access
• The node memory for data storage at node level
• A node wrapper that hides the complexity of the network
architecture.
• It contains an MIN interface to support communication, a
hardware task manager for task managements at node level
and a DMA engine.
• The wrapper envelops the algorithmic engine and presents an
identical interface to the neighbouring nodes.
• It also incorporates dedicated I/O circuitry, memory, memory
controllers and data distributors and aggregators.
The ACM node structure
The ACM QS2412 Resources
The Matrix Interconnect Network (MIN)
• The communication inside an ACM chip is done via an MIN,
which is organized hierarchically.
• At a given level, the MIN connects many lower level MINs.
• The top level MIN, the MIN root, is used to access the nodes
from outside and to control the configuration of the nodes.
• The communication among nodes is done via the MIN with the
help of the node wrapper.
• The MIN provides a diversity of services such as point-to-point
dataflow streaming, real-time broadcasting, direct memory
access and random memory access.
• The ACM chip also contains various I/O interfaces accessible via
the MIN for testing (JTAG) and communication with off-chip
devices
System Controller
• The system management is done via an embedded system
controller.
• The system controller loads tasks into the node’s ready-to-run
queue for execution, statically or dynamically sets the
communication channels between the processing nodes.
• Any node can be adapted (reconfigured) at run-time by the
system controller to perform a new function, in a cycle-by-cycle
manner.
Embedded FPGA
• Embedded programmable logic devices, also known under the
name embedded FPGA, usually integrate a processor core, a
programmable logic or FPGA and memory on the same chip.
• Strictly seen, several existing FPGAs such as the Xilinx Virtex 4
FX fall under this category, as they provides the same elements
like any other embedded FPGA device.
• However, the processor in the Xilinx Virtex 4 FPGAs is immersed
in the programmable logic, whereas it is strictly decoupled
from the logic in common embedded FPGAs.
• Two examples of such devices are the DAP/DNA from IPflex and
the S500 series from Stretch.
The IPflex DAP/DNA Reconfigurable
Processor
• The DAP/DNA consists of an integrated DAP RISC processor, a distributed
network architecture (DNA) matrix and some interfaces.
• The processor controls the system, configures the DNA, performs computations
in parallel to the DNA and manages the data exchange on the device.
• The DNA Matrix 2.46 is a dataflow accelerator with more than hundred dynamic
reconfigurable processing elements.
• The wiring among elements can be changed dynamically, therefore, providing
the possibility to build and quickly change parallel/pipelined processing system
tailored to each application.
• The DNA configuration data are stored in the configuration memory from where
it can be downloaded to the DNA on a clock-by-clock basis.
• Like in other reconfigurable devices, several interfaces exist for connecting the
chip to external devices.
• Large applications can be partitioned and sequentially executed on a DAP/DNA
chip.
• While the processor controls the whole execution process, the DNA executes
critical parts of the application.
IPflex DAP/DNA reconfigurable processor
The Stretch Processor
• The second example of embedded FPGA that we present is the
S5000 series from Stretch.
• The device consists of a 32-bit Xtensa RISC processor from
Tensilica, operating at 300 MHz and featuring a single precision
floating point unit, an instruction set extension fabric (ISEF),
embedded memory and a set of peripheral control modules
The Stretch 5530 configurable processor
• Like the DAP/DNA, the processor controls the whole system and configures
the ISEF.
• The ISEF is used to augment the processor capacity by implementing
additional instructions that are directly accessible by a program running on
the processor.
• The S5 device uses thirty-two 128-bit wide registers coupled with 128-bit
wide access to memory to feed data to the ISEF.
• Several interfaces, connected to a peripheral bus exist for communication
with the external world.
• The S5530 peripherals are divided into three categories as follows:
• The low speed peripherals that includes UART, IrDA, serial peripheral
interface (SPI).
• The second category consists of the mid-speed peripherals that include
parallel generic interface bus (GIB) for Flash and SRAM connection.
• The last category are the high-speed peripherals that include a 64-bit
PCI/PCI-X port and one 64-bit SDRAM port.
The MATRIX Architecture
• The multiple ALU architecture with reconfigurable interconnect experiment (MATRIX)
[159] comprises an array of identical basic functional units (BFUs).
• Each BFU contains an 8-bit ALU, 256 words of 8-bit memory and control logic.
• The ALU features the standard set of arithmetic and logic functions and a multiplier.
• A configurable carry-chain between adjacent ALUs can be used to cascade ALUs for wide-
word operations.
• The control logic can generate local control signals from ALU output by a pattern matcher.
• A reduction network can be employed for control generated from neighbouring data.
• Finally, a 20-input, 8-output NOR block may be used as half of a PLA to produce control
signals.
• According to these features, a BFU can serve as an instruction memory, a data memory, a
register-file-ALU combination or an independent ALU function.
• Instructions can be routed over the array to several ALUs.
• The routing fabric provides three levels of 8-bit buses:
• Eight nearest neighbour and four second nearest neighbour connections of length four,
and global lines spanning an entire row or column.
RAW: Reconfigurable Architecture
Workstation
• The idea of RAW [210] is to provide a simple and highly parallel computing architecture composed of
several repeated tiles connected to each other by nearest neighbor connections.
• The tiles comprise computation facilities as well as memory, thus implementing a distributed memory
model.
• A RAW microprocessor is a homogeneous array of processing elements called tiles.
• The prototype chip features 16 tiles arranged in a 4 × 4 array.
• Each tile comprises a simple RISC-like processor consisting of ALU, register file and program counter,
SRAM-based instruction and data memories, and a programmable switch supplying point-to-point
connections to NN.
• The CPU in each tile of the prototype is a modified 32-bit MIPS R2000 processor with an extended 6
stage pipeline, a floating point unit and a register file of 32 general purpose and 16 floating point
registers.
• Both the data memory and the instruction memory consist of 32-kilobytes of SRAM. While the
instruction memory is uncached, the data memory can operate in cached and uncached mode.
• Interconnection of the RAW architecture is done over nearest neighbour connections being optimized
for single data word transfer.
• Communication between tiles is pipelined over these connections and appears at register level
between processors, making it different from multiprocessor systems.
• A programmable switch on each tile connects the four nearest neighbour links to each other and to
the processor.
• The RAW architecture provides both a static and a dynamic network with wormhole routing for the
forwarding of data.
REMARC: Reconfigurable Multimedia Array
Coprocessor
• REMARC [160] is a reconfigurable coprocessor that is tightly coupled to a main RISC processor.
• It consists of an 8 × 8 array of 16-bit programmable logic units called nanoprocessor, which is
attached to a global control unit.
• The control unit manages data transfers between the main processor and the reconfigurable array
and controls the execution on the nanoprocessors.
• It comprises an instruction RAM with 1024 entries, 64-bit data registers and four control registers.
• The nanoprocessor consists of a 32-entry local instruction RAM, an ALU with 16-bit datapath, a data
RAM with 16 entries, an instruction register, eight data registers, four 16-bit data input registers and a
16-bit data output register.
• The ALUs can execute 30 instructions, including addition, subtraction, logical operations, shift
instructions, as well as some operations often found in multimedia applications such as minimum,
maximum, average, absolute and add.
• Each ALU can use data from the data output registers of the adjacent processors via nearest neighbor
connect, from a data register, a data input register, or from immediate values as operands.
• The result of the ALU operation is stored in the data output register.
• The communication lines consist of nearest neighbour connections between adjacent nanoprocessors
and additional horizontal and vertical buses in each row and column.
• The nearest neighbour connections allow the data in the data output register of a nanoprocessor to
be sent to any of its four adjacent neighbors.
• The horizontal and vertical buses have double width (32-bit) and allow data from a data output
register to be broadcasted to processors in the same row or column, respectively.
• Furthermore, the buses can be used to transfer data between processors, which are not adjacent to
each other.
MorphoSys
• The complete MorphoSys [194] chip comprises a control processor (TinyRISC), a frame
buffer (data buffer), a DMA controller, a context memory (configuration memory) and an
array of 64 reconfigurable cells (RC).
• Each RC comprises an ALU-multiplier, a shift unit, and two multiplexers at the RC inputs.
• Each RC also has an output register, a feedback register and a register file.
• A context word, loaded from the configuration memory and stored in the context register,
defines the functionality of the RC.
• Besides standard logic/arithmetic functions, the ALU has other functions such as
computation of absolute value of the difference of two operands and a single cycle
multiply accumulate operation.
• There are a total of 25 ALU functions.
• The RC interconnection network features three layers.
• In the first layer, all cells are connected to their four nearest neighbours.
• In the second layer, each cell can access data from any other cell in the same row or
column of the same array quadrant.
• The third layer of hierarchy consists of buses spanning the whole array and allowing
transfer of data from a cell in a row or column of a quadrant to any other cell in the same
row or column in the adjacent quadrant.
• In addition, two horizontal 128-bit buses connect the array to the frame buffer.
NISC Processor: No-Instruction-Set Computer
• The NISC [92] processor consists of a Controller and Datapath on which any C program
can be executed on it.
• The datapath consists of a set of storage elements (registers, register files, memories),
functional units (ALUs, multipliers, shifters, custom functions) and a set of busses.
• Each component may take one or more clock cycles to execute, each component may be
pipelined and each component may have input or output latches or registers.
• The entire Datapath can be pipelined in several stages in addition to components being
pipelined themselves.
• The Controller defines the state of the processor and issues the control signals for the
Datapath.
• The Controller can be fixed or programmable, whereas the Datapath can be
reprogrammable and reconfigurable.
• Reprogrammable means that the Datapath can be extended or reduced by adding or
omitting some components, while reconfigurable means that the Datapath can be
reconnected with the same components.
• To speed up the NISC pipelining, a control register (CR) and a status register (SR) has been
inserted between the Controller and the Datapath.
Virtual Pipelines: The PipeRench
• Pipeline reconfiguration was proposed by Goldstein et al. [98] as a method of
virtualizing pipelined hardware application designs.
• The single static reconfiguration is broken into pieces that correspond to
pipeline stages in the application.
• The resulting configurations are then loaded into the device on a cycle-by-cycle
basis.
• With pipeline virtualization, an application is pipelined implemented on a given
amount of virtual resources.
• The virtual resources are then mapped to the physical resources in a final step.
• The physical resources are computing units modules, whose functionality can be
changed by reconfiguration.
• The process is somehow similar to the implementation on parallel machines,
where virtual processors are first used to easily capture the inherent parallel
structure of an application.
• The virtual processors are then mapped to the physical one in the next step.
• Virtualization provides the designer freedom to better explore the parallelism in
the implementation, and he/she does not need to face the resource constraint
of a given platform.
Pipeline Reconfiguration: Mapping of a 5
stage virtual pipeline auf eine 3 stage
• Figure illustrates the virtualization processing on the mapping
of a five stage virtual pipeline on a three-stage physical fabric.
• On the top part of this figure, we see the five-stage application
and the state of each of the stages of the pipeline in the five
consecutive cycles.
• The bottom half of the figure shows the mapping of the virtual
blocks on the physical modules of the device.
• Because the configuration of each single unit in the pipeline is
independent from the other, the reconfiguration process can
be broken down to a cycle-bycycle configuration.
• In this way, part of the pipeline can be reconfigured while the
rest is computing.
• Based on the pipeline reconfiguration concept, a class of
reconfigurable devices called PipeRench was proposed in [98] as co-
processor in multimedia applications.
• A PipeRench device consists of a set of physical pipeline stages also
called stripes.
• A stripe is composed of interconnect and processing elements (PE),
each of which contains registers and LUT-based ALUs.
• The PEs access operands from registered outputs of the previous
stripe as well as registered or unregistered outputs of the other PEs
in the stripe.
• All PipeRench devices have four global busses, two of which are
dedicated to storing and restoring stripe state during hardware
virtualization (configuration).
• The other two are used for input and output.
RaPiD
• The last work that we consider in this section is the one of Ebeling et al. whose goal was
to overcome the handicap of FPGA.
• A one-dimensional coarse-grained reconfigurable architecture called RaPiD, an acronym
for reconfigurable pipelined datapath is proposed for this purpose.
• The structure of RaPiD datapaths resembles that of systolic arrays.
• The structure is made upon linear arrays of functional units communicating in mostly a
nearest neighbour fashion.
• This can be used for example, to construct a hardware module, which comprises different
computations at different stages and at different times resulting in a linear array of
functional units that can be configured to form a linear computational pipeline.
• The resulting array of functional units is divided into identical cells that are replicated to
form a complete array.
• A RaPiD-cell consists of an integer multiplier, two integer ALUs, and six general purpose
registers and three small local memories.
• Interconnections among the functional units are realized using a set of ten segmented
busses that run the length of the datapath.
• Many of the registers in a pipelined computation can be implemented using the bus
pipeline registers.
• Functional unit outputs are registered.
• However, the output registers can be bypassed via configuration control.
• Functional units may additionally be pipelined internally depending on their
complexity.
• The control of the datapath is done using two types of signals:
• The static control signals that are defined by the configuration memory as in
ordinary
• FPGAs and dynamic control that must be provided on every cycle.
• To program an application on the RaPiD, a mapping of functional blocks to
computing elements of the datapath must be done, which result on the
generation of a static programming bitstream.
• This will be used to construct the pipeline, and the dynamic programming
bits are used to schedule the operations of the computation onto the
datapath over time.
• A controller is programmed to generate the dynamic information needed to
produce the dynamic programming bits.
Capacity / Chip Size
• The last point we deal with in the architecture is the definition of a measure of comparison between
the RPUs.
• The most used measure of comparison is the device capacity, which is usually provided by the
manufacturer as the number of gates that can be used to build a function into the device.
• On fine-grained reconfigurable devices, the number of gates is often used as unit of measure for the
capacity of an RPU.
• A gate equivalent corresponds to a two-input NAND gate, i.e. a circuit that performs the function F =A
· B.
• The gate density defines the number of gates per unit area.
• Besides the capacity of an RPU, others factors such as the number of pins and the device speed may
also play a big role in the comparison.
• Coarse-grained reconfigurable devices do not have any established measure of comparison.
• However, the number of PEs on a device as well as their characteristics such as the granularity, the
speed of the communication link, the amount of memory on the device and the variety of peripherals
may provide an indication on the quality of the device.
• Although the capacity of the device defines the “amount of parallelism”, which can be implement in
the device, the speed gives an indication on the throughput of data in the device.
• Designers should however bear in mind that the real speed and real size that a design can achieve
depends on the implementation style and the compilers used to produce the designs.
Reference
• Chapter2, Introduction to Reconfigurable Computing
Architectures, Algorithms, and Applications by Christophe
Bobda

You might also like