RaPiD - Recon Figurable Pipelined Datapath
RaPiD - Recon Figurable Pipelined Datapath
y
Carl Ebeling, Darren C. Cronquist, and Paul Franklin
Department of Computer Science and Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
1 Introduction
Congurable computing promises to deliver the high performance required by
computationally demanding applications while providing the
exibility and adapt-
ability of programmed processors. As such, congurable computing platforms lie
somewhere between ASIC solutions, which provide the highest performance/cost
at the expense of
exibility and adaptability, and programmable processors,
which provide the greatest
exibility at the expense of performance/cost. Un-
fortunately the promise of congurable computing has yet to be realized in spite
of some very successful examples[1, 5]. There are two main reasons for this.
First, congurable computing platforms are currently implemented using com-
mercial FPGAs which are very ecient for implementing random logic functions,
but much less so for general arithmetic functions. Building a multiplier using an
FPGA incurs a performance/cost penalty of at least 100. Second, current cong-
urable platforms are extremely hard to program[5, 6]. Taking an application from
concept to a high-performance implementation is a time-consuming, designer-
intensive task. The dream of automatic compilation from high-level specication
to a fast and ecient implementation is still unattainable.
? This paper appeared in FPL '96: The 6th International Workshop on Field-
Programmable Logic and Applications, pages 126-135. Springer-Verlag, 1996.
y This work was supported in part by the Defense Advanced Research Projects Agency
under Contract DAAH04-94-G0272. D. Cronquist was supported in part by an IBM
fellowship. P. Franklin was supported by an NSF fellowship.
The RaPiD architecture takes aim at these two problems in the context
of computationally demanding tasks such as those found in signal processing
applications. RaPiD is a coarse-grained FPGA architecture that allows deeply
pipelined computational datapaths to be constructed dynamically from a mix
of ALUs, multipliers, registers and local memories. The goal of RaPiD is to
compile regular computations like those found in DSP applications into both
an application-specic datapath and the program for controlling that datapath.
The datapath is controlled using a combination of static and dynamic control
signals. The static control determines the underlying structure of the datapath
that remains constant for a particular application. The dynamic control signals
can change from cycle to cycle and specify the variable operations performed
and the data to be used by those operations. The static control signals are
generated by static RAM cells that are changed only between applications while
the dynamic control is provided by a control program.
The structure of the datapaths constructed in RaPiD is biased strongly to-
wards linear arrays of functional units communicating in mostly a nearest neigh-
bor fashion. Systolic arrays[2], for example, map very well into RaPiD datapaths,
which allows the considerable amount of research on compiling to systolic ar-
rays to be applied to compiling computations to RaPiD[4, 3]. RaPiD is not
limited to implementing systolic arrays, however. For example, a pipeline can be
constructed which comprises dierent computations at dierent stages and at
dierent times.
The computational bandwidth provided by a RaPiD array is extremely high
and scales with the size of the array. The input and output data bandwidth,
however, is limited to the data memory bandwidth which does not scale. Thus
the amount of computation performed per I/O operation bounds the amount of
parallelism and thus the speedup an application can exhibit when implemented
using RaPiD. The RaPiD architecture assumes that at most three memory ac-
cesses are made per cycle. Providing even this much bandwidth requires a very
high-performance memory architecture.
RaPiD is also not suited for tasks that are unstructured, not highly repetitive,
or whose control
ow depends strongly on the data. The assumption is that
RaPiD will be integrated closely with a RISC engine on the same chip. The
RISC would control the overall computational
ow, farming out the heavy-duty
computation to RaPiD that requires brute force computation.
The concept of RaPiD can in theory be extended to 2-D arrays of functional
units. However, dynamically conguring 2-D arrays is much more dicult, and
the underlying communication structure is much more costly. Since most 2-D
computations can be computed eciently using a linear array, RaPiD is currently
restricted to linear arrays.
The paper begins with a description of the datapath architecture and how
computations are congured. This is followed by a description of the way dy-
namic control signals are generated. Next, a FIR lter example is used to illus-
trate how computations are mapped to the RaPiD architecture. The paper ends
with a discussion of the performance of RaPiD-I and future work.
2 RaPiD Architecture
This section describes the version of the RaPiD architecture, called RaPiD-I,
which is currently being implemented at the University of Washington. Variants
of this architecture with a dierent data width and data format, dierent func-
tional units, dierent number and conguration of busses and so on, could be
dened for dierent application domains. The RaPiD-I architecture contains all
the salient features of RaPiD and will allow us to describe RaPiD computations
for a variety of applications.
R R R
A A A A A
M L M L M
U U
bus connector
Fig. 1. The basic cell of RaPiD-I. This cell is replicated left to right to form a
complete RaPiD array.
RaPiD-I is a linear array of functional units which can be congured to form
a (mostly) linear computational pipeline. This array of functional units is divided
into identical cells which are replicated to form a complete array. One cell for
RaPiD-I is shown in Figure 1. This cell comprises an integer multiplier, two
integer ALUs, six general-purpose registers and three small local memories. The
complete RaPiD-I array contains 16 of these cells. Although the array is divided
into cells, this division is invisible when it comes to mapping an application to
the functional units and busses.
The functional units are interconnected using a set of ten segmented busses
that run the length of the datapath. Each input of the functional units is at-
tached to a multiplexer which is congured to select one of eight busses. Each
output of the functional units is attached to a demultiplexer comprised of tristate
drivers, each driving one of eight busses. Each output driver can be congured
independently, which allows an output to fan out to several busses, or none at
all if the functional unit is not being used. The assignment of operations to func-
tional units must be done so there is a bus segment available to connect units
that communicate.
The busses in dierent tracks are segmented into dierent lengths so that
bus tracks are used eciently. In several tracks, adjacent bus segments can be
connected together using either a buer or a register. This bus connector is
shown in Figure 2 and is represented in Figure 1 as a pair of lines between bus
segments. The connection is active and can drive in either direction but not both
at once. Many of the registers in a pipelined computation can be implemented
using these bus pipeline registers. In theory, all the bus segments in one track
could be connected together by bus connectors congured as bypass buers to
provide a broadcast signal the length of the array. In practice, the delay is much
too long and all signals are pipelined to some degree.
left right
bus bus
left-to-right
right-to-left
bypass-right
bypass-left
Fig. 2. Bus connectors can be used to connect adjacent bus segments via a buer
or a register.
Functional unit outputs are registered, although this output register can
be bypassed via conguration control. Functional units may additionally be
pipelined internally depending on their complexity. These pipeline registers can
also be bypassed if appropriate.
RaPiD-I operates on 16-bit signed or unsigned xed-point data which is
maintained via shifters in the multipliers. Dierent xed point representation
can be used in the same application by appropriately conguring the dierent
shifters in the datapath. An extra tag bit is associated with each data value
to indicate whether an over
ow has occurred. Once set, the over
ow value is
propagated to all results. The datapath thus generates no exceptions during
operation, but incorporates them into the data produced.
The ALUs perform the usual logical and arithmetic operations on 16-bit
data. The two ALUs in a cell can be combined to perform a pipelined 32-bit
operation, most typically as a 32-bit add for multiply-accumulate computations.
The ALU output register can be used as the accumulator for multiply-accumulate
operations.
The multiplier multiplies two 16-bit numbers and produces a 32-bit result,
shifted by a statically programmed amount to maintain the appropriate xed-
point representation. Both 16-bit halves of the result are available as output via
separate bus drivers. Either driver can be turned o to drop the corresponding
output if it is not needed. The multiplier uses a modied Booth's algorithm and
includes one congurable pipeline register.
clr
load
bypass
to bus
from bus
The registers in the datapath are used to store constants and temporary val-
ues as well as create pipelines of dierent lengths. These registers are completely
general unlike the registers found in the bus connectors and functional units,
which are used only for pipelining. Figure 3 shows the design of the datapath
registers. The datapath register inputs and outputs are connected to the busses
just like other functional units. One conguration signal controls whether the
output is driven by the register or the bypass path. This bypass is used to con-
nect a bus segment on one track to a bus segment in a dierent track. The load
and clear signals control the operation of the register. As discussed in Section 3,
these control signals must be set dynamically. While datapath registers are very
general, they are expensive in terms of both area and bus utilization. While
the datapath registers themselves are relatively small, their input multiplexer
and output drivers are quite large. Wiring the input and output of a datapath
register usually requires bus segments in two dierent tracks which consumes
extra routing resources. Thus the bus pipeline registers and the functional unit
registers are used whenever possible.
A limited amount of local memory is provided in the datapath for saving
and reusing data over many cycles. In many applications, the input or output
data is segmented into blocks that are accessed once, saved locally and reused as
needed, and then discarded. Local memory can also be used for constant arrays.
RaPiD-I includes three local memories per cell. The input and output data lines
are connected to busses as in other functional units. Because of the time needed
to read and write memory, congurable registers are included on both the input
and output data ports. The memory address is supplied either by a data bus or
by a local address generator, shown in Figure 4, that supports simple sequential
memory access. If values are read and written to consecutive addresses, which
is the most common case, then the memory address generator can supply the
addresses without using datapath resources.
+1 Memory
0
address
R/W
inc/hold R/W
load/clear/count
AddressIn
Datain Dataout
Input and output data enter and exit the datapath via I/O streams at each
end of the datapath. These streams act as the interface to external memory. Each
stream contains a FIFO which is lled with data required by the computation or
with results produced by the computation. The data for each stream is associated
with a predetermined block of memory from which it is read or to which it is
written. The datapath reads from an input stream to obtain the next input data
value and writes to an output stream to store a result. Address generation and
memory reads and writes are handled entirely by the I/O streams themselves.
The I/O stream FIFOs operate asynchronously: if the datapath reads a value
from an empty FIFO or writes a value to a full FIFO, the datapath is stalled
until the FIFO is ready.
3 Datapath Control
For the most part, the signals that control the operation of the functional units
and their interconnection can be static over an entire application. However,
there are almost always some control signals that must be dynamic. For ex-
ample, constants are loaded into datapath registers during initialization but
then remain unchanged. The load signals of the datapath registers thus take
on dierent values during initialization and computation. More complex exam-
ples include double-buering the local memories and performing data-dependent
calculations.
The control signals are thus divided into static control signals provided by
conguration memory as in ordinary FPGAs, and dynamic control which must
be provided on every cycle. RaPiD is programmed for a particular application
by rst mapping the computation onto a datapath pipeline. The static program-
ming bits are used to construct this pipeline and the dynamic programming bits
are used to schedule the operations of the computation onto the datapath over
time. A controller is programmed to generate the dynamic information needed
to produce the dynamic programming bits.
Of the 230 control signals in a RaPiD-I cell, 80 are dynamic. Thus there is a
total of over 1200 dynamic control signals for the entire datapath. While cong-
uration memory is relatively cheap, producing and communicating the dynamic
control signals on every cycle, using a standard microprogram for example, would
be very expensive.
The problem of generating static control signals is solved using a control path
which parallels the data path. RaPiD applications map into pipelines of similar,
if not identical, repeating pipeline stages. The control signals of these stages are
thus similar as well, except that their values are skewed in time in the same way
the data passing through the pipeline is skewed in time.
The control path is thus a set of segmented busses containing congurable
pipeline registers through which control signal values are sent from one end of the
datapath to the other. Control values are inserted at one end of the control path
and are passed from stage to stage where they are applied to the appropriate
control signals. The congurable pipeline registers allow dierent control signals
to travel at dierent rates through the control path.
Generating the dynamic control signals is then accomplished by connecting
each dynamic control signal to a bus in the control path that carries the ap-
propriate value each cycle. The number of busses required in the control path
varies by application, but it is kept manageable because many control signals
have identical values. The values inserted into the control path are generated by
a simple microprogrammed controller whose microinstructions contain the dat-
apath control information in addition to looping constructs that allow datapath
instructions or instruction sequences to be repeated many times.
for j = 0 to NumTaps-1 W0 W1 W2 W3
end
(a) (b)
Fig. 5. (a) Algorithm for FIR lter. (b) Computation for NumTaps=4 and i=6.
As with most applications, there are a variety of ways to map FIR lter to
RaPiD. The choice of mapping is driven by the parameters of both the RaPiD
array and the application. For example, if the number of taps is less than the
number of RaPiD multipliers, then each multiplier is assigned to multiply a
specic weight. The weights are rst preloaded into datapath registers whose
outputs drive the input of a specic multiplier. Pipeline registers are used to
stream the X inputs and Y outputs. Since each Y output must see NumTaps
inputs, the X and Y busses are pipelined at dierent rates. Figure 6a shows a
schematic diagram for this implementation on a four-tap FIR lter. The X input
bus was chosen to be doubly pipelined and the Y input bus singly pipelined.
Wires are annotated with the weight, input, and output values from a single
point in time during the computation phase.
IN
X9 X8 X7 X6 X5 X4 X3 X2
W0 * W1 * W2 * W3 *
0 Y8 Y7 Y6 Y5
ALU ALU ALU ALU
OUT
(a)
X X
H H
L L
W W
A Y A Y
L L
U U
OUT
IN
(b)
Fig. 6. (a) Schematic diagram for four-tap FIR lter, labeled at a point in time
(computing four parallel computations for y5, y6, y7, and y8). (b) Two taps of
the FIR lter mapped to the RaPiD array (this is replicated to form more taps).
This implementation maps easily to the RaPiD array, as shown for two taps in
Figure 6b. For clarity, all unused functional units are removed, and used busses
are highlighted. The bus connectors from Figure 1 are left open to represent
no connection and boxed to represent a register. The control for this mapping
consists of two phases of execution: loading the weights and computing the out-
put results. In the rst phase, the weights are sent down the IN double pipeline
along with a singly pipelined control bit (not shown) which sets the state of each
datapath register to \LOAD". When the nal weight is inserted, the control bit
is switched to \HOLD". Since the control bit travels twice as fast as the weights,
each datapath register will hold a unique weight. No special signals are required
to begin the computation; hence, the second phase is started the moment the
control bit is set to \HOLD".
5 Performance
This section evaluates the sustained (ignoring initialization and nalization)
computation rate of mapping FIR lter and matrix multiply to the RaPiD array.
These results are a function of both the RaPiD array parameters and the algo-
rithmic parameters. The parameters associated with the RaPiD array are the
clock rate in MHz ( ), the number of cells (S ), and the number of addressable
memory locations per cell (M ). Because RaPiD by its very nature is heavily
pipelined, a conservative estimate on the RaPiD-I clock rate of a mapped appli-
cation is 100MHz. In addition, conservative estimates of the number of RaPiD-I
cells and memory locations per cell are 16 and 96, respectively. Results will be
measured in MOPS or GOPS, where an operation is a single multiply-accumulate
combination. The maximum rate on RaPiD-I is 1.6 GOPS.
5.1 FIR Filter
The only algorithmic parameter aecting the sustained computation rate of the
FIR lter is the number of taps, T . The mapping described in Section 4 produced
one output per cycle and thus T MOPS with the constraint that T S . For a
more general mapping restricting the taps to T 31 MS , the RaPiD array can
produce min(1; TS ) outputs per cycle and min(T; S ) MOPS.5 For example, with
= 100, S = 16, M = 96, and T 16, RaPiD can perform a sustained rate of
1:6 GOPS on a FIR lter with up to 512 taps (and an unbounded number of
input values).
5.2 Matrix Multiply
Matrix multiply takes an X Y matrix A P and a Y Z matrix B and computes
the X Z matrix C = A B as cij = Yk=0 ?1 a b . Many dierent RaPiD
ik kj
mappings exist, each producing slightly dierent performance results. In one
implementation, the RaPiD array can produce min(1; YS ; 3YM ) operations per
1
cycle and min(Y; 13 M; S ) MOPS. With = 100, S = 16, M = 96, and Y 16,
RaPiD can perform a sustained rate of 1:6 GOPS (X and Z are unbounded).
5 This is a simplied version of a more complex formulation which is beyond the scope
of this paper.
6 Conclusions and Future Work
The RaPiD architecture potentially provides a very ecient recongurable plat-
form for implementing computationally intensive applications. Many applica-
tions have been mapped successfully by hand to RaPiD and simulated with very
promising results. However, there are several open problems which need to be
solved to make RaPiD truly successful.
{ The domain of applicability must be explored by mapping more problems
from dierent domains to RaPiD.
{ Thus far all RaPiD applications have been designed by hand. The next
step will be to apply compiler technology, particularly loop-transformation
theory[7] and systolic array compiling methods[4] to build a compiler for
RaPiD.
{ A memory architecture must be designed which can support the I/O band-
width required by RaPiD over a wide range of applications.
{ Although it is clear that RaPiD should be closely coupled to a generic RISC
processor, it is not clear exactly how this should be done. This is a problem
being faced by other recongurable computers.
Acknowledgments
We would like to thank the rest of the RaPiD team, Chris Fisher, Larry Mc-
Murchie and Jerey Weener, for their contributions to the project.
References
1. J. M. Arnold, D. A. Buell, D. T. Hoang, D. V. Pryor, N. Shirazi, and M. R. This-
tle. The Splash 2 processor and applications. In Proceedings IEEE International
Conference on Computer Design: VLSI in Computers and Processors, pages 482{5.
IEEE Comput. Soc. Press, 1993.
2. H.T. Kung. Let's design algorithms for VLSI systems. Technical Report CMU-CS-
79-151, Carnegie-Mellon University, January 1979.
3. P. Lee and Z. M. Kedem. Synthesizing linear array algorithms from nested FOR
loop algorithms. IEEE Transactions on Computers, 37(12):1578{98, 1988.
4. D. I. Moldovan and J. A. B. Fortes. Partitioning and mapping algorithms into xed
size systolic arrays. IEEE Transactions on Computers, C-35(1):1{12, 1986.
5. J. E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. H. Touati, and P. Boucard.
Programmable active memories: recongurable systems come of age. IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 4(1):56{69, 1996.
6. M. Wazlowski, L. Agarwal, T. Lee, A. Smith, E. Lam, P. Athanas, H. Silverman,
and S. Ghosh. PRISM-II compiler and architecture. In Proceedings IEEE Workshop
on FPGAs for Custom Computing Machines, pages 9{16. IEEE Comput. Soc. Press,
1993.
7. M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maxi-
mize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452{
471, 1991.
This article was processed using the LATEX macro package with LLNCS style