Digital Signal Processing With Field Programmable Gate Arrays
Digital Signal Processing With Field Programmable Gate Arrays
Fixed-point Floating-point
19-Bit Splash II
Two’s complement Signed digit (SD) IEEE 754
format
Single precision
Logarithmic Number
One’s complement 8bit Exponent
System (LNS)
23bit Mantissa
Double precision
Residue Number
Signed Magnitude 11bit Exponent
System (RNS)
52bit Mantissa
Diminished by one
Fixed point vs Floating point
Advantages Disadvantages
Expensive in terms
of area and power
Floating point High precision
( computationally
intensive )
Conventional coding of signed
binary numbers
Binary 2C 1C D1 SM
011 3 3 4 3
010 2 2 3 2
001 1 1 2 1
000 0 0 1 0
111 -1 -0 -1 -3
110 -2 -1 -2 -2
101 -3 -2 -3 -1
100 -4 -3 -4 -0
Which representation is the best?
Why?
• Virtually all modern computers operate
based on 2's complement
representation.
• Why?
1. hardware for doing the most
common operations is faster
(the most common operation is
addition)
2. hardware is simpler
Canonical Signed Digit(CSD)
• Cost of adder and subtractor is identical in hardware
• CSD notation uses 1, 0 and -1 to represent numbers
100 1
• 7 = 0111 and can also be represented as
• reduces the number of non-zero bits in a number
• reduces the size of multipliers
• Useful only for hardwired (constant) multiplications
• Great for communication and signal processing
• applications (filters, fft, etc.)
Carry-free Adder
xkyk 00 01 01 01 01 11 11
• If we stop to consider that the scaling accumulator multiplier is really just a sum of vectors, then it
becomes obvious that we can rearrange the circuit.
Distributed arithmetic
• Here, the adder tree combines the 1 bit partial products before they are accumulated by the scaling
accumulator. All we have done is rearranged the order in which the 1xN partial products are summed. Now
instead of individually accumulating each partial product and then summing the results, we postpone the
accumulate function until after we’ve summed all the 1xN partials at a particular bit time. This simple
rearrangement of the order of the adds has effectively replaced N multiplies followed by an N input add
with a series of N input adds followed by a multiply. This arithmetic manipulation directly eliminates N-1
Adders in an N product term multiply-accumulate function. For larger numbers of product terms, the
savings becomes significant.
Distributed arithmetic
• Further hardware savings are available when the coefficients Cn are constants. If that is true, then the adde
r tree shown above becomes a boolean logic function of the 4 serial inputs. The combined 1xN products
and adder tree is reduced to a four input look up table. The sixteen entries in the table are sums of the con
stant coefficients for all the possible serial input combinations. The table is made wide enough to accomm
odate the largest sum without overflow. Negative table values are sign extended to the width of the table, a
nd the input to the scaling accumulator should be sign extended to maintain negative sums.
Distributed arithmetic
• Obviously the serial inputs limit the performance of such a circuit. As with most hardware applications, we
can obtain more performance by using more hardware. In this case, more than one bit sum can be
computed at a time by duplicating the LUT and adder tree as shown here. The second bit computed will
have a different weight than the first, so some shifting is required before the bit sums are combined. In this
2 bit at a time implementation, the odd bits are fed to one LUT and adder tree, while the even bits are
simultaneously fed to an identical tree. The odd bit partials are left shifted to properly weight the result and
added to the even partials before accumulating the aggregate. Since two bits are taken at a time, the
scaling accumulator has to shift the feedback by 2 places.
Distributed arithmetic
• This paralleling scheme can be extended to compute more than two bits at a time. In the extreme case, all
input bits can be computed in parallel and then combined in a shifting adder tree. No scaling accumulator
is needed in this case, since the output from the adder tree is the entire sum of products. This fully parall
el implementation has a data rate that matches the serial clock, which can be greater than 100 MS/S in tod
ay's FPGAs.
Distributed arithmetic
• Most often, we have more than 4 product terms to accumulate. Increasing the size of the LUT might look at
tractive until you consider that the LUT size grows exponentially. Considering the construction of the logic
we stuffed into the LUT, it becomes obvious that we can combine the results from the LUTs in an adder tre
e. The area of the circuit grows by roughly 2n-1 using adder trees to expand it rather than the 2n growth ex
perienced by increasing LUT size. For FPGAs, the most efficient use of the logic occurs when we use the n
atural LUT size (usually a 4-LUT, although and 8-LUT would make sense if we were using an 8 input block
RAM) for the LUTs and then add the outputs of the LUTs together in an adder tree, as shown below.
Coordinate Rotation Digital
Computer (CORDIC)
CORDIC is an iterative algorithm for calculating trig functions including
sine, cosine, magnitude and phase. It is particularly suited to hardware
implementations because it does not require any multiplies.
• Birth of CORDIC
CORDIC was introduced by Volder in 1959 to calculate trigonometric va
lues like sine, cosine, etc. In 1971, Walther extended this algorithm to c
alculate hyperbolic, logarithmic and other functions.
• Advantages
This algorithm uses only minimal hardware (adder and shift) for comput
ation of all trigonometric and other function values. It consumes fewer r
esources than any other techniques and so the performance is high. Th
us, almost all scientific calculators use the CORDIC algorithm in their c
alculations.
CORDIC Principle
• Principle
CORDIC works by rotating the coordinate system through constant angles until the angle is reduced to zero. So with this principle
we are changing the given angle each time to reduce to zero. Here we are using addition, subtraction and shift to calculate the fu
nction values. Now let us see, how we can calculate sine and cosine values using CORDIC. Consider a vector C with coordinate
(X, Y) that is to be rotated through an angle σ. The new coordinate (X′,Y′) after rotation is
1. Get the angle and store it in Angle. Store the pre-calculated arctan
values in an array
2. Assign X = 0.607252935 (i.e., X=T), Y=0
3. Find X′ and Y′
4. If sign of Angle is positive then
X = X - Y′
Y = Y + X′
else ( If sign of Angle is negative )
X = X + Y′
Y = Y - X′
5. Repeat steps (3) and (4) till the Angle approaches 0
6. Print “Value of cos =” X
7. Print “Value of sin =” Y
8. Exit
Scaling Accumulator Multipliers
• Parallel by serial algorithm
• Iterative shift add routine
• N clock cycles to complete
• Very compact design
• Serial input can be MSB or LSB first depending on direction of shift in accumulator
• Parallel output
1 1011001
0 0000000
1 1011001
1 +1011001
10010000101
Serial by Parallel Booth Multipliers
• Bit serial adds eliminate need for carry chain
• Well suited for FPGAs without fast carry logic
• Serial input LSB first
• Serial output
• Routing is all nearest neighbor except serial input which is broadcast
• Latency is one bit time
The simple serial by parallel booth multiplier is particularly well suited for bit serial processor
s implemented in FPGAs without carry chains because all of its routing is to nearest neighb
ors with the exception of the input. The serial input must be sign extended to a length equal
to the sum of the lengths of the serial input and parallel input to avoid overflow, which mea
ns this multiplier takes more clocks to complete than the scaling accumulator version. This
is the structure used in the venerable TTL serial by parallel multiplier.
Ripple Carry Array Multipliers
• Row ripple form
• Unrolled shift-add algorithm
• Delay is proportional to N
A ripple carry array multiplier (also called row ripple form) is an unrolled embodiment of the classic shift-ad
d multiplication algorithm. The illustration shows the adder structure used to combine all the bit products i
n a 4x4 multiplier. The bit products are the logical and of the bits from each input. They are shown in the
form x,y in the drawing. The maximum delay is the path from either LSB input to the MSB of the product, a
nd is the same (ignoring routing delays) regardless of the path taken. The delay is approximately 2*n.
This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic in many
FPGAs, and is therefore larger and slower than other implementations.
Row Adder Tree Multipliers
• Optimized Row Ripple Form
• Fundamentally same gate count as row ripple form
• Row Adders arranged in tree to reduce delay
• Routing more difficult, but workable in most FPGAs
• Delay proportional to log2(N)
Row Adder tree multipliers rearrange the adders of the row ripple multiplier to equalize the number of adder
s the results from each partial product must pass through. The result uses the same number of adders, b
ut the worst case path is through log2(n) adders instead of through n adders. In strictly combinatorial mult
ipliers, this reduces the delay. For pipelined multipliers, the clock latency is reduced. The tree structure o
f the routing means some of the individual wires are longer than the row ripple form. As a result a pipeline
d row ripple multiplier can have a higher throughput in an FPGA (shorter clock cycle) even though the laten
cy is increased.
Carry Save Array Multipliers
• Column ripple form
• Fundamentally same delay and gate count as row ripple form
• Gate level speed ups available for ASICs
• Ripple adder can be replaced with faster carry tree adder
• Regular routing pattern
Look-Up Table (LUT) Multipliers
• Complete times table of all possible input combinations
• One address bit for each bit in each input
• Table size grows exponentially
• Very limited use
• Fast - result is just a memory access away
Look-Up Table multipliers are simply a block of memory containing a complete multiplication table of all possible input combi
nations. The large table sizes needed for even modest input widths make these impractical for FPGAs.
The following table is the contents for a 6 input LUT for a 3 bit by 3 bit multiplication table.
• The LUT could be replaced by any other multiplier implementation, since LUT is being used as a multiplier.
This gives the insight into how to combine multipliers of an arbitrary size to obtain a larger multiplier.
• The LUT multipliers shown have matched radices (both inputs are octal). The partial products can also ha
ve mixed radices on the inputs provided care is taken to make sure the partial products are shifted properly
before summing. Where the partial products are obtained with small LUTs, the most efficient implementati
on occurs when LUT is square (ie the input radices are the same). For 8 bit LUTs, such as might be found
in an Altera 10K FPGA, this means the LUT radix is hexadecimal. For 4 bit LUTs, found in many FPGA logi
c cells, the ideal radix is 2 bits (This is really the only option for a 4 LUT: a 1 bit input reduces the LUT to a
n AND gate, and since each LUT cell has 1 output, it can only use one bit on the other input).
• A more compact but slower version is possible by computing the partial products sequentially using one LU
T and accumulating the results in a scaling accumulator. Note that in this case, the shifter would need a s
pecial control to obtain the proper shift on all the partials
Computed Partial Product
Multipliers
• Partial product optimization for FPGAs having small LUTs
• Fewer partial products decrease depth of adder tree
• 2 x n bit partial products generated by logic rather than LUT
• Smaller and faster than 4 LUT partial product multipliers
A partial product multiplier constructed from the 4 LUTs found in many FPGAs is not very efficient becaus
e of the large number of partial products that need to be summed (and the large number of LUTs required).
A more efficient multiplier can be made by recognizing that a 2 bit input to a multiplier produces a produc
t 0,1,2 or 3 times the other input. All four of these products are easily generated in one step using just an
adder and shifter. A multiplexer controlled by the 2 bit multiplicand selects the appropriate product as sho
wn below. Unlike the LUT solution, there is no restriction on the width of the A input to the partial product.
This structure greatly reduces the number of partial products and the depth of the adder tree. Since the
0,1,2 and 3x inputs to the multiplexers for all the partial products are the same, one adder can be shared b
y all the partial product generators. This structure works well in coarser grained FPGAs like the Xilinx 4K se
ries.
2 x n bit computed partial product implemented in Xilinx Virtex using special MULT
AND gate in carry chain logic
Constant Coefficient
Multipliers
• Multiplies input by a constant
• LUT contains custom times table 5 bit input * 67
• Width of constants do not affect depth of adder tre
e
• All LUT inputs available for multiplicand input 00 01 10 11
• More efficient than full multiplier
• Size is constant regardless of value of constant (as
suming equal constant bit widths) 000 0 536 1072 1608
A full multiplier accepts the full range of inputs for
each multiplicand. If one of the multiplicands is a
001 67 603 1139 1675
constant, then it is far more efficient to construct a
times table that only has the column corresponding
to the constant value. These are known as consta 010 134 670 1206 1742
nt (K) Coefficient Multipliers or KCM's. The exampl
e below multiplies a 5 bit input (values 0 to 31) by
a constant 67. Note that with a constant multiplie 011 201 737 1273 1809
r, all of the LUT inputs are available for the variable
multiplicand. This makes the KCM more efficient t
han a full multiplier (fewer partial products for a giv 100 268 804 1340 1876
en width).
When the LUT does not offer enough inputs to acc 101 335 871 1407 1943
ommodate the desired variable width, several identi
cal LUTs may be combined using the partial produ
cts techniques discussed above. In this case, the 110 402 938 1474 2010
constant multiplicand is full width, so the partial pr
oducts will be m x n where m is the number of LUT
inputs and n is the width of the constant. 111 469 1005 1541 2077
Limited Set LUT Multipliers
• Multiplies input by one of a small set of constant 5 bit input * 67 5 bit input * 85
s
000 001 010 011 100 101 110 111
• Similar to KCM multiplier
• LUT input bit(s) select which constant to use 000 0 536 1072 1608 0 680 1360 2040
• Useful in modulators, other signal processing ap 001 67 603 1139 1675 85 765 1445 2125
plications
010 134 670 1206 1742 170 850 1530 2210
In signal processing, there are often instances w
here one multiplicand is taken from of a small se 011 201 737 1273 1809 255 935 1615 2295
t of constant values. In these cases, the KCM m 100 268 804 1340 1876 340 1020 1700 2380
ultiplier can be extended so that the LUT contain
s the times tables for each constant. One or mo 101 335 871 1407 1943 425 1105 1785 2465
re of the LUT inputs select which constant is use
d, while the remaining inputs are for the variable 110 402 938 1474 2010 510 1190 1870 2550
multiplicand. The example below is a 6 LUT con 111 469 1005 1541 2077 595 1275 1955 2635
taining times tables for the constants 67 and 85.
One bit of the input selects which times table i
s used. The remaining inputs are the 5 bit variabl
e multiplicand (values from 0 to 31). Again, the
input width can be extended using the partial pro
duct techniques discussed previously.
Constant Multipliers from Adders
• Adder for each '1' bit in constant
• Subtractor replaces strings of '1' bits using Booth recoding
• Efficiency, size depend on value of constant
• KCM multipliers are usually more efficient for arbitrary constant values
The shift-add multiply algorithm essentially produces m 1xn partial products and sums them together with a
ppropriate shifting. The partial products corresponding to '0' bits in the 1 bit input are zero, and therefore
do not have to be included in the sum. If the number of '1' bits in a constant coefficient multiplier is smal
l, then a constant multiplier may be realized with wired shifts and a few adders as shown in the 'times 10' e
xample below.
0 0000000
1 1011001
0 0000000
1 +1011001
1101111010
In cases where there are strings of '1' bits in the constant, adders can be eliminated by using Booth recodi
ng methods with subtractors. The 'times 14 example below illustrates this technique. Note that 14 = 8+4+
2 can be expressed as 14=16-2, which reduces the number of partial products.
0 0000000
0 0000000
-1 1110100111
1 1011001
0 0000000
1 1011001
0 0000000
1 +1011001
1 +1011001
10011011110
10011011110
Constant Multipliers from Adders
• Combinations of partial products can sometimes also be shifted and added in order to reduce the number
of partials, although this may not necessarily reduce the depth of a tree. For example, the 'times 1/3'
approximation (85/256=0.332) below uses less adders than would be necessary if all the partial products
were summed directly. Note that the shifts are in the opposite direction to obtain the fractional partial
products.
• Clearly, the complexity of a constant multiplier constructed from adders is dependent upon the constant.
For an arbitrary constant, the KCM multiplier discussed above is a better choice. For certain 'quick and
dirty' scaling applications, this multiplier works nicely.
Wallace Trees
• Optimized column adder tree
• Combines all partial products into 2 vectors (carry and sum)
• Carry and sum outputs combined using a conventional adder
• Delay is log(n)
• Wallace tree multiplier uses Wallace tree to combine 1 x n partial products
• Irregular routing
• Not optimum in many FPGAs
A Wallace tree is an implementation of an adder tree designed for minimum propagation del
ay. Rather than completely adding the partial products in pairs like the ripple adder tree do
es, the Wallace tree sums up all the bits of the same weights in a merged tree. Usually full
adders are used, so that 3 equally weighted bits are combined to produce two bits: one (th
e carry) with weight of n+1 and the other (the sum) with weight n. Each layer of the tree th
erefore reduces the number of vectors by a factor of 3:2 (Another popular scheme obtains
a 4:2 reduction using a different adder style that adds little delay in an ASIC implementatio
n). The tree has as many layers as is necessary to reduce the number of vectors to two (a
carry and a sum). A conventional adder is used to combine these to obtain the final produ
ct. The structure of the tree is shown below. The red numbers after each full adder in the il
lustration indicate the bit weights of each signal. For a multiplier, this tree is pruned becau
se the input partial products are shifted by varying amounts.
Wallace Trees
A section of an 8 input wallace tree. The wallace tree combines the 8 partial product inputs
to two output vectors corresponding to a sum and a carry. A conventional adder is used t
o combine these outputs to obtain the complete product..
Wallace Trees
• If you trace the bits in the tree (the tree in the illu
stration is color coded to help in this regard), yo
u will find that the Wallace tree is a tree of carry-
save adders arranged as shown to the left. A ca
rry save adder consists of full adders like the mo
re familiar ripple adders, but the carry output fro
m each bit is brought out to form second result v
ector rather being than wired to the next most si
gnificant bit. The carry vector is 'saved' to be c
ombined with the sum later, hence the carry-sav
e moniker.
• A Wallace tree multiplier is one that uses a Walla
ce tree to combine the partial products from a fi
eld of 1x n multipliers (made of AND gates). It t
urns out that the number of Carry Save Adders in
a Wallace tree multiplier is exactly the same as u
sed in the carry save version of the array multipli
er. The Wallace tree rearranges the wiring howe
ver, so that the partial product bits with the long
est delays are wired closer to the root of the tree
. This changes the delay characteristic from o(n
*n) to o(n*log(n)) at no gate cost. Unfortunately
the nice regular routing of the array multiplier is a
lso replaced with a ratsnest.
Wallace Trees
• A Wallace tree by itself offers no performance advantage over a ripple adder tree
• To the casual observer, it may appear the propagation delay though a ripple adder tree is th
e carry propagation multiplied by the number of levels or o(n*log(n)). In fact, the ripple ad
der tree delay is really only o(n + log(n)) because the delays through the adder's carry chai
ns overlap. This becomes obvious if you consider that the value of a bit can only affect bit
s of the same or higher significance further down the tree. The worst case delay is then fr
om the LSB input to the MSB output (and disregarding routing delays is the same no matter
which path is taken). The depth of the ripple tree is log(n), which is the about same as the
depth of the Wallace tree. This means is that the ripple carry adder tree's delay characteristi
c is similar to that of a Wallace tree followed by a ripple adder! If an adder with a faster ca
rry tree scheme is used to sum the Wallace tree outputs, the result is faster than a ripple ad
der tree. The fast carry tree schemes use more gates than the equivalent ripple carry struct
ure, so the Wallace tree normally winds up being faster than a ripple adder tree, and less lo
gic than an adder tree constructed of fast carry tree adders. An ASIC implementation may
also benefit from some 'go faster' tricks in carry save adders, so a Wallace tree combined
with a fast adder can offer a significant advantage there.
• A Wallace tree is often slower than a ripple adder tree in an FPGA
• Many FPGAs have a highly optimized ripple carry chain connection. Regular logic connecti
ons are several times slower than the optimized carry chain, making it nearly impossible to i
mprove on the performance of the ripple carry adders for reasonable data widths (at least 1
6 bits). Even in FPGAs without optimized carry chains, the delays caused by the complex r
outing can overshadow any gains attributed to the Wallace tree structure. For this reason,
Wallace trees do not provide any advantage over ripple adder trees in many FPGAs. In fact
due to the irregular routing, they may actually be slower and are certainly more difficult to r
oute.
Booth Recoding
• Booth recoding is a method of reducing the
number of partial products to be summed.
Booth observed that when strings of '1' bits
occur in the multiplicand the number of partial
products can be reduced by using subtraction.
For example the multiplication of 89 by 15 shown
below has four 1xn partial products that must be
summed. This is equivalent to the subtraction
shown in the right panel.
1 1011001 1 -1011001
1 1011001 1 0000000
1 1011001 1 0000000
1 1011001 1 0000000
0 +0000000 0 +1011001
10100110111 10100110111