0% found this document useful (0 votes)
353 views

Digital Signal Processing With Field Programmable Gate Arrays

This document discusses various number representation systems used in digital signal processing, including fixed-point, floating-point, logarithmic, residue number systems, and canonical signed digit representations. It also covers computer arithmetic topics like addition, multiplication, and distributed arithmetic techniques to optimize hardware implementations for signal processing applications.

Uploaded by

daruri_sumanth
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
353 views

Digital Signal Processing With Field Programmable Gate Arrays

This document discusses various number representation systems used in digital signal processing, including fixed-point, floating-point, logarithmic, residue number systems, and canonical signed digit representations. It also covers computer arithmetic topics like addition, multiplication, and distributed arithmetic techniques to optimize hardware implementations for signal processing applications.

Uploaded by

daruri_sumanth
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Digital Signal Processing with

Field Programmable Gate


Arrays
2. Computer Arithmetic
Survey of number representations
NUMBER SYSTEMS

Fixed-point Floating-point

conventional unconventional conventional unconventional

19-Bit Splash II
Two’s complement Signed digit (SD) IEEE 754
format

Single precision
Logarithmic Number
One’s complement 8bit Exponent
System (LNS)
23bit Mantissa

Double precision
Residue Number
Signed Magnitude 11bit Exponent
System (RNS)
52bit Mantissa

Diminished by one
Fixed point vs Floating point

Advantages Disadvantages

Area & power Dynamic range and


Fixed point
efficient accuracy

Expensive in terms
of area and power
Floating point High precision
( computationally
intensive )
Conventional coding of signed
binary numbers
Binary 2C 1C D1 SM
011 3 3 4 3
010 2 2 3 2
001 1 1 2 1
000 0 0 1 0
111 -1 -0 -1 -3
110 -2 -1 -2 -2
101 -3 -2 -3 -1
100 -4 -3 -4 -0
Which representation is the best?
Why?
• Virtually all modern computers operate
based on 2's complement
representation.
• Why?
1. hardware for doing the most
common operations is faster
(the most common operation is
addition)
2. hardware is simpler
Canonical Signed Digit(CSD)
• Cost of adder and subtractor is identical in hardware
• CSD notation uses 1, 0 and -1 to represent numbers
100 1
• 7 = 0111 and can also be represented as
• reduces the number of non-zero bits in a number
• reduces the size of multipliers
• Useful only for hardwired (constant) multiplications
• Great for communication and signal processing
• applications (filters, fft, etc.)
Carry-free Adder

xkyk 00 01 01 01 01 11 11

Neither At least Neither At least


xk-1yk-1 - is one is is one is - -
1 1 1 1
ck(carry of
the Kth bit)
0 1 0 0 1 1 1
uk(Interim
sum) 0 1 1 1 1 0 0
Logarithmic Number Systems
• Multiplication, division, roots, powers is simple
• Reduce
Multiplication (addition ↓), Division (subtraction ↓), Powers (mul
tiplication ↓), Roots (division ↓)
• More Complex
Add, Subtract
• Major Problem
지수 (Logarithm) 와 진수 (Antilogarithm - logX 의 X) 는 Conventi
onal Number 로 상호 빠르게 틀림없이 변환되는 것을 충분히
허용한다 . 이 변환은 항상 부정확한 결과로 근사치를 이끌어 낸다 .
따라서 , binary logarithms 는 단지 매우 적은 변환이 요구되어지는
특별한 어플리케이션 (e.g., real-time digital filter) 전용의
산술연산 유닛에 유용하다 . 그러나 , 많은 multiplication 과 divisi
on 이 실행된다 .
Residue Number System
• Addition, subtraction, multiplication 에서 Carry-free 한 고유의 탁월한
특성을 갖는 integer number system 이다 .
• 난해한 수의 길이에 상관없이 비슷하게 Add, subtract, or multiply
되어진다 .
• 불행하게도 division, comparison, sign detection 같은 다른 산술 연산은
매우 복잡하고 느리다 .
• 소수 표현은 매우 부적당하다 .
• 따라서 , 일반적인 사용을 위해서는 심각하게 고려하지 않는다 .
• 그렇지만 , 디지털 필터의 많은 유형과 같은 몇몇 특수목적
어플리케이션에는 addition 과 multiplication 수는 대체로 invocations o
f magnitude comparison, overflow detection, division 의 수보다 크거나
같다 .
• However, for some special-purpose applications such as many types o
f digital filters, in which the number of additions and multiplications is s
ubstantially greater than the number of invocations of magnitude camp
arison, overflow detection, division, and alike.
• 매우 흥미로운 표현방식이다 .
Adder
• Reduce the carry delays
carry-skip, carry lookahead, conditional
sum, carry-select adders –> can speed
up addition and can be used with older
generation FPGA families(do not provid
e internal fast carry logic)
Pipelined Adders
Modulo Adders
Multiplier
Distributed arithmetic isn't magic.
  Let's demystify it:
• Distributed arithmetic is a bit level rearra
ngement of a multiply accumulate to hid
e the multiplications.  It is a powerful te
chnique for reducing the size of a parall
el hardware multiply-accumulate that is
well suited to FPGA designs.  It can als
o be extended to other sum functions s
uch as complex multiplies, fourier transf
orms and so on.
Distributed arithmetic
• In most of the multiply accumulate applications in signal processing, one of the multiplicands for each
product is a constant. Usually each multiplication uses a different constant.
• Using our most compact multiplier, the scaling accumulator, we can construct a multiple product term
parallel multiply-accumulate function in a relatively small space if we are willing to accept a serial input. In
this case, we feed four parallel scaling accumulators with unique serialized data. Each multiplies that data
by a possibly unique constant, and the resulting products are summed in an adder tree as shown below.

• If we stop to consider that the scaling accumulator multiplier is really just a sum of vectors, then it
becomes obvious that we can rearrange the circuit.
Distributed arithmetic
• Here, the adder tree combines the 1 bit partial products before they are accumulated by the scaling
accumulator. All we have done is rearranged the order in which the 1xN partial products are summed. Now
instead of individually accumulating each partial product and then summing the results, we postpone the
accumulate function until after we’ve summed all the 1xN partials at a particular bit time. This simple
rearrangement of the order of the adds has effectively replaced N multiplies followed by an N input add
with a series of N input adds followed by a multiply. This arithmetic manipulation directly eliminates N-1
Adders in an N product term multiply-accumulate function. For larger numbers of product terms, the
savings becomes significant.
Distributed arithmetic
• Further hardware savings are available when the coefficients Cn are constants. If that is true, then the adde
r tree shown above becomes a boolean logic function of the 4 serial inputs.  The combined 1xN products
and adder tree is reduced to a four input look up table. The sixteen entries in the table are sums of the con
stant coefficients for all the possible serial input combinations. The table is made wide enough to accomm
odate the largest sum without overflow. Negative table values are sign extended to the width of the table, a
nd the input to the scaling accumulator should be sign extended to maintain negative sums.
Distributed arithmetic
• Obviously the serial inputs limit the performance of such a circuit.  As with most hardware applications, we
can obtain more performance by using more hardware.  In this case, more than one bit sum can be
computed at a time by duplicating the LUT and adder tree as shown here. The second bit computed will
have a different weight than the first, so some shifting is required before the bit sums are combined. In this
2 bit at a time implementation, the odd bits are fed to one LUT and adder tree, while the even bits are
simultaneously fed to an identical tree. The odd bit partials are left shifted to properly weight the result and
added to the even partials before accumulating the aggregate. Since two bits are taken at a time, the
scaling accumulator has to shift the feedback by 2 places.
Distributed arithmetic
• This paralleling scheme can be extended to compute more than two bits at a time.  In the extreme case, all
input bits can be computed in parallel and then combined in a shifting adder tree.  No scaling accumulator
is needed in this case, since the output from the adder tree is the entire sum of products.   This fully parall
el implementation has a data rate that matches the serial clock, which can be greater than 100 MS/S in tod
ay's FPGAs.
Distributed arithmetic
• Most often, we have more than 4 product terms to accumulate. Increasing the size of the LUT might look at
tractive until you consider that the LUT size grows exponentially. Considering the construction of the logic
we stuffed into the LUT, it becomes obvious that we can combine the results from the LUTs in an adder tre
e. The area of the circuit grows by roughly 2n-1 using adder trees to expand it rather than the 2n growth ex
perienced by increasing LUT size. For FPGAs, the most efficient use of the logic occurs when we use the n
atural LUT size (usually a 4-LUT, although and 8-LUT would make sense if we were using an 8 input block
RAM) for the LUTs and then add the outputs of the LUTs together in an adder tree, as shown below.
Coordinate Rotation Digital
Computer (CORDIC)
CORDIC is an iterative algorithm for calculating trig functions including
sine, cosine, magnitude and phase. It is particularly suited to hardware
implementations because it does not require any multiplies.

• Birth of CORDIC
CORDIC was introduced by Volder in 1959 to calculate trigonometric va
lues like sine, cosine, etc. In 1971, Walther extended this algorithm to c
alculate hyperbolic, logarithmic and other functions.

• Advantages
This algorithm uses only minimal hardware (adder and shift) for comput
ation of all trigonometric and other function values. It consumes fewer r
esources than any other techniques and so the performance is high. Th
us, almost all scientific calculators use the CORDIC algorithm in their c
alculations.
CORDIC Principle
• Principle

CORDIC works by rotating the coordinate system through constant angles until the angle is reduced to zero. So with this principle
we are changing the given angle each time to reduce to zero. Here we are using addition, subtraction and shift to calculate the fu
nction values. Now let us see, how we can calculate sine and cosine values using CORDIC. Consider a vector C with coordinate
(X, Y) that is to be rotated through an angle σ. The new coordinate (X′,Y′) after rotation is

This equation can be represented in tangent form as


X/cos(σ) = X-Y x tan(σ)
Y/cos(σ) = Y-X x tan(σ)
The angle is broken into smaller and smaller pieces,
such that the tangent of the angle is always power of 2.
The pre-calculated angles are also added to the total angle
and thus the above equation can be written as
X(i+1) = t(i) x (X(i)-Y/2i)
Y(i+1) = t(i) x (Y(i)-X/2i)
where t(i) = cos(arctan(1/2i)) i varies from 0 to n
According to the above iterative equation t(i) will converge to a ‘constant’ after first few iterations (i.e., when i get varies). So it is
better to pre-calculate this ‘constant’ for a greater value of n as:
T = cos(arctan(1/20)) x cos(arctan(1/21)) x ..... x cos(arctan(1/2n))
Calculated value of T will always be 0.60725293500888. We can use any precision for T. But the accuracy of the calculation of si
ne and cosine depends on the precision we use and so it is recommended to use at least 6 precision in your calculation.
While we program, the value of the angle arctan(1/2i) can be pre-calculated and stored in an array. This value can be used in the
iterations and it avoids the calculation at the iterative time.
CORDIC algorithm
• The steps for CORDIC algorithm are:

1. Get the angle and store it in Angle. Store the pre-calculated arctan
values in an array
2. Assign X = 0.607252935 (i.e., X=T), Y=0
3. Find X′ and Y′
4. If sign of Angle is positive then
X = X - Y′
Y = Y + X′
else ( If sign of Angle is negative )
X = X + Y′
Y = Y - X′
5. Repeat steps (3) and (4) till the Angle approaches 0
6. Print “Value of cos =” X
7. Print “Value of sin =” Y
8. Exit
Scaling Accumulator Multipliers
• Parallel by serial algorithm
• Iterative shift add routine
• N clock cycles to complete
• Very compact design
• Serial input can be MSB or LSB first depending on direction of shift in accumulator
• Parallel output

A scaling accumulator multiplier performs multiplication using an iterative shift-add


routine. One input is presented in bit parallel form while the other is in bit serial form. Each
bit in the serial input multiplies the parallel input by either 0 or 1. The parallel input is held
constant while each bit of the serial input is presented. Note that the one bit multiplication
either passes the parallel input unchanged or substitutes zero. The result from each bit is
added to an accumulated sum. That sum is shifted one bit before the result of the next bit
multiplication is added to it.

1     1011001
0      0000000
1    1011001 
1  +1011001  
   10010000101
Serial by Parallel Booth Multipliers
• Bit serial adds eliminate need for carry chain
• Well suited for FPGAs without fast carry logic
• Serial input LSB first
• Serial output
• Routing is all nearest neighbor except serial input which is broadcast
• Latency is one bit time
The simple serial by parallel booth multiplier is particularly well suited for bit serial processor
s implemented in FPGAs without carry chains because all of its routing is to nearest neighb
ors with the exception of the input. The serial input must be sign extended to a length equal
to the sum of the lengths of the serial input and parallel input to avoid overflow, which mea
ns this multiplier takes more clocks to complete than the scaling accumulator version. This
is the structure used in the venerable TTL serial by parallel multiplier.
Ripple Carry Array Multipliers
• Row ripple form
• Unrolled shift-add algorithm
• Delay is proportional to N
A ripple carry array multiplier (also called row ripple form) is an unrolled embodiment of the classic shift-ad
d multiplication algorithm.  The illustration shows the adder structure used to combine all the bit products i
n a 4x4 multiplier.  The bit products are the logical and of the bits from each input.  They are shown in the
form x,y in the drawing.  The maximum delay is the path from either LSB input to the MSB of the product, a
nd is the same (ignoring routing delays) regardless of the path taken.  The delay is approximately 2*n.

This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic in many
FPGAs, and is therefore larger and slower than other implementations.
Row Adder Tree Multipliers
• Optimized Row Ripple Form
• Fundamentally same gate count as row ripple form
• Row Adders arranged in tree to reduce delay
• Routing more difficult, but workable in most FPGAs
• Delay proportional to log2(N)
Row Adder tree multipliers rearrange the adders of the row ripple multiplier to equalize the number of adder
s the results from each partial product must pass through.   The result uses the same number of adders, b
ut the worst case path is through log2(n) adders instead of through n adders.  In strictly combinatorial mult
ipliers, this reduces the delay.  For pipelined multipliers, the clock latency is reduced.   The tree structure o
f the routing means some of the individual wires are longer than the row ripple form.  As a result a pipeline
d row ripple multiplier can have a higher throughput in an FPGA (shorter clock cycle) even though the laten
cy is increased.
Carry Save Array Multipliers
• Column ripple form
• Fundamentally same delay and gate count as row ripple form
• Gate level speed ups available for ASICs
• Ripple adder can be replaced with faster carry tree adder
• Regular routing pattern
Look-Up Table (LUT) Multipliers
• Complete times table of all possible input combinations
• One address bit for each bit in each input
• Table size grows exponentially
• Very limited use
• Fast - result is just a memory access away
Look-Up Table multipliers are simply a block of memory containing a complete multiplication table of all possible input combi
nations.  The large table sizes needed for even modest input widths make these impractical for FPGAs.
The following table is the contents for a 6 input LUT for a 3 bit by 3 bit multiplication table.

000 001 010 011 100 101 110 111


000 000000 000000 000000 000000 000000 000000 000000 000000
001 000000 000001 000010 000011 000100 000101 000110 000111
010 000000 000010 000100 000110 001000 001010 001100 001110
011 000000 000011 000110 001001 001100 001111 010010 010101
100 000000 000100 001000 001100 010000 010100 011000 011100
101 000000 000101 001010 001111 010100 011001 011110 100011
110 000000 000110 001100 010010 011000 011110 100100 101010
111 000000 000111 001110 010101 011100 100011 101010 110001
Partial Product LUT Multipliers
• Works like long hand multiplication
• LUT used to obtain products of digits
• Partial products combined with adder tree
Partial Products LUT multipliers use partial product techniques similar to those used in longhand
multiplication (like you learned in 3rd grade) to extend the usefulness of LUT multiplication.  Consider the
long hand multiplication:

      67    67       67      67


  x 54   x 54   x 54   x 54
28 28 28 28
240    240    240    240
350 350 350 350
+3000 +3000 +3000 +3000
3618 3618 3618 3618
Partial Product LUT Multipliers
• By performing the multiplication one digit at a time and then shifting and summing the individual partial pro
ducts, the size of the memorized times table is greatly reduced.  While this example is decimal, the techniq
ue works for any radix.  The order in which the partial products are obtained or summed is not important. 
The proper weighting by shifting must be maintained however.
• The example below shows how this technique is applied in hardware to obtain a 6x6 multiplier using the 3x
3 LUT multiplier shown above.  The LUT (which performs multiplication of a pair of octal digits) is duplicate
d so that all of the partial products are obtained simultaneously.  The partial products are then shifted as n
eeded and summed together.  An adder tree is used to obtain the sum with minimum delay. 

• The LUT could be replaced by any other multiplier implementation, since LUT is being used as a multiplier. 
This gives the insight into how to combine multipliers of an arbitrary size to obtain a larger multiplier. 
• The LUT multipliers shown have matched radices (both inputs are octal).   The partial products can also ha
ve mixed radices on the inputs provided care is taken to make sure the partial products are shifted properly
before summing.  Where the partial products are obtained with small LUTs, the most efficient implementati
on occurs when LUT is square (ie the input radices are the same).  For 8 bit LUTs, such as might be found
in an Altera 10K FPGA, this means the LUT radix is hexadecimal.   For 4 bit LUTs, found in many FPGA logi
c cells, the ideal radix is 2 bits (This is really the only option for a 4 LUT: a 1 bit input reduces the LUT to a
n AND gate, and since each LUT cell has 1 output, it can only use one bit on the other input).
• A more compact but slower version is possible by computing the partial products sequentially using one LU
T and accumulating the results in a scaling accumulator.  Note that in this case, the shifter would need a s
pecial control to obtain the proper shift on all the partials
Computed Partial Product
Multipliers
• Partial product optimization for FPGAs having small LUTs
• Fewer partial products decrease depth of adder tree
• 2 x n bit partial products generated by logic rather than LUT
• Smaller and faster than 4 LUT partial product multipliers
A  partial product multiplier constructed from the 4 LUTs found in many FPGAs is not very efficient becaus
e of the large number of partial products that need to be summed (and the large number of LUTs required).
  A more efficient multiplier can be made by recognizing that a 2 bit input to a multiplier produces a produc
t  0,1,2 or 3 times the other input.  All four of these products are easily generated in one step using just an
adder and shifter.  A multiplexer controlled by the 2 bit multiplicand selects the appropriate product as sho
wn below.  Unlike the LUT solution, there is no restriction on the width of the A input to the partial product.
  This structure greatly reduces the number of partial products and the depth of the adder tree.   Since the
0,1,2 and 3x inputs to the multiplexers for all the partial products are the same, one adder can be shared b
y all the partial product generators.  This structure works well in coarser grained FPGAs like the Xilinx 4K se
ries.

2 x n bit partial product generated with adder and multiplexer


Computed Partial Product
Multipliers
• The Xilinx Virtex device includes an extra gate in the carry chain logic that allows a 4 input LUT plus the carry chain to
perform a 2xN partial product, thereby achieving twice the density otherwise attainable.  In this case, the adder (consi
sting of the XOR gates and muxes in the carry chain) adds a pair of 1xN partial products obtained with AND gates.  Th
e extra AND gate in the carry logic allows you to put AND gates on both inputs of the adder while maintaining a 4 input
function.

2 x n bit computed partial product implemented in Xilinx Virtex using special MULT
AND gate in carry chain logic
Constant Coefficient
Multipliers
• Multiplies input by a constant
• LUT contains custom times table 5 bit input * 67
• Width of constants do not affect depth of adder tre
e
• All LUT inputs available for multiplicand input 00 01 10 11
• More efficient than full multiplier
• Size is constant regardless of value of constant (as
suming equal constant bit widths) 000 0 536 1072 1608
A full multiplier accepts the full range of inputs for
each multiplicand.  If one of the multiplicands is a
001 67 603 1139 1675
constant, then it is far more efficient to construct a
times table that only has the column corresponding
to the constant value.  These are known as consta 010 134 670 1206 1742
nt (K) Coefficient Multipliers or KCM's. The exampl
e below multiplies a 5 bit input (values 0 to 31) by
a constant 67.  Note that with a constant multiplie 011 201 737 1273 1809
r, all of the LUT inputs are available for the variable
multiplicand.  This makes the KCM more efficient t
han a full multiplier (fewer partial products for a giv 100 268 804 1340 1876
en width).

When the LUT does not offer enough inputs to acc 101 335 871 1407 1943
ommodate the desired variable width, several identi
cal LUTs may be combined using the partial produ
cts techniques discussed above.  In this case, the 110 402 938 1474 2010
constant multiplicand is full width, so the partial pr
oducts will be m x n where m is the number of LUT
inputs and n is the width of the constant. 111 469 1005 1541 2077
Limited Set LUT Multipliers
• Multiplies input by one of a small set of constant 5 bit input * 67 5 bit input * 85
s
000 001 010 011 100 101 110 111
• Similar to KCM multiplier
• LUT input bit(s) select which constant to use 000 0 536 1072 1608 0 680 1360 2040
• Useful in modulators, other signal processing ap 001 67 603 1139 1675 85 765 1445 2125
plications
010 134 670 1206 1742 170 850 1530 2210
In signal processing, there are often instances w
here one multiplicand is taken from of a small se 011 201 737 1273 1809 255 935 1615 2295
t of constant values.  In these cases, the KCM m 100 268 804 1340 1876 340 1020 1700 2380
ultiplier can be extended so that the LUT contain
s the times tables for each constant.  One or mo 101 335 871 1407 1943 425 1105 1785 2465
re of the LUT inputs select which constant is use
d, while the remaining inputs are for the variable 110 402 938 1474 2010 510 1190 1870 2550
multiplicand.  The example below is a 6 LUT con 111 469 1005 1541 2077 595 1275 1955 2635
taining times tables for the constants 67 and 85.
  One bit of the input selects which times table i
s used. The remaining inputs are the 5 bit variabl
e multiplicand (values from 0 to 31).  Again, the
input width can be extended using the partial pro
duct techniques discussed previously.
Constant Multipliers from Adders
• Adder for each '1' bit in constant
• Subtractor replaces strings of '1' bits using Booth recoding
• Efficiency, size depend on value of constant
• KCM multipliers are usually more efficient for arbitrary constant values
The shift-add multiply algorithm essentially produces m 1xn partial products and sums them together with a
ppropriate shifting.  The partial products corresponding to '0' bits in the 1 bit input are zero, and therefore
do not have to be included in the sum.   If the number of '1' bits in a constant coefficient multiplier is smal
l, then a constant multiplier may be realized with wired shifts and a few adders as shown in the 'times 10' e
xample below. 
0      0000000
1     1011001
0    0000000 
1  +1011001  
    1101111010
In cases where there are strings of '1' bits in the constant, adders can be eliminated by using Booth recodi
ng methods with subtractors.  The 'times 14 example below illustrates this technique.  Note that 14 = 8+4+
2 can be expressed as 14=16-2, which reduces the number of partial products.
0       0000000
0     0000000
-1   1110100111
1     1011001
0     0000000
1    1011001 
0    0000000 
1  +1011001  
1   +1011001  
     10011011110
      10011011110
Constant Multipliers from Adders
• Combinations of partial products can sometimes also be shifted and added in order to reduce the number
of partials, although this may not necessarily reduce the depth of a tree.  For example, the 'times 1/3'
approximation (85/256=0.332) below uses less adders than would be necessary if all the partial products
were summed directly.   Note that the shifts are in the opposite direction to obtain the fractional partial
products.

• Clearly, the complexity of a constant multiplier constructed from adders is dependent upon the constant. 
For an arbitrary constant, the KCM multiplier discussed above is a better choice.  For certain 'quick and
dirty' scaling applications, this multiplier works nicely.
Wallace Trees
• Optimized column adder tree
• Combines all partial products into 2  vectors (carry and sum)
• Carry and sum outputs combined using a conventional adder
• Delay is log(n)
• Wallace tree multiplier uses Wallace tree to combine 1 x n partial products
• Irregular routing
• Not optimum in many FPGAs
A Wallace tree is an implementation of an adder tree designed for minimum propagation del
ay.  Rather than completely adding the partial products in pairs like the ripple adder tree do
es, the Wallace tree sums up all the bits of the same weights in a merged tree.  Usually full
adders are used, so that 3 equally weighted bits are combined to produce two bits: one (th
e carry) with weight of n+1 and the other (the sum) with weight n.  Each layer of the tree th
erefore  reduces the number of vectors by a factor of 3:2 (Another popular scheme obtains
a 4:2 reduction using a different adder style that adds little delay in an ASIC implementatio
n).  The tree has as many layers as is necessary to reduce the number of vectors to two (a
carry and a sum).  A conventional adder is used to combine these to obtain the final produ
ct.  The structure of the tree is shown below. The red numbers after each full adder in the il
lustration indicate the bit weights of each signal.  For a multiplier, this tree is pruned becau
se the input partial products are shifted by varying amounts.
Wallace Trees

A section of an 8 input wallace tree. The wallace tree combines the 8 partial product inputs
to two output vectors corresponding to a sum and a carry.  A conventional adder is used t
o combine these outputs to obtain the complete product..
Wallace Trees
• If you trace the bits in the tree (the tree in the illu
stration is color coded to help in this regard), yo
u will find that the Wallace tree is a tree of carry-
save adders arranged as shown to the left.  A ca
rry save adder consists of full adders like the mo
re familiar ripple adders, but the carry output fro
m each bit is brought out to form second result v
ector rather being than wired to the next most si
gnificant bit.  The carry vector is 'saved' to be c
ombined with the sum later, hence the carry-sav
e moniker.
• A Wallace tree multiplier is one that uses a Walla
ce tree to combine the partial products from a fi
eld of 1x n multipliers (made of AND gates).  It t
urns out that the number of Carry Save Adders in
a Wallace tree multiplier is exactly the same as u
sed in the carry save version of the array multipli
er.  The Wallace tree rearranges the wiring howe
ver, so that the partial product bits with the long
est delays are wired closer to the root of the tree
.  This changes the delay characteristic from o(n
*n) to o(n*log(n)) at no gate cost. Unfortunately
the nice regular routing of the array multiplier is a
lso replaced with a ratsnest.
Wallace Trees
• A Wallace tree by itself offers no performance advantage over a ripple adder tree
• To the casual observer, it may appear the propagation delay though a ripple adder tree is th
e carry propagation multiplied by the number of levels or o(n*log(n)).  In fact, the ripple ad
der tree delay is really only o(n + log(n)) because the delays through the adder's carry chai
ns overlap.  This becomes obvious if you consider that the value of a bit can only affect bit
s of the same or higher significance further down the tree.   The worst case delay is then fr
om the LSB input to the MSB output (and disregarding routing delays is the same no matter
which path is taken).  The depth of the ripple tree is log(n), which is the about same as the
depth of the Wallace tree. This means is that the ripple carry adder tree's delay characteristi
c is similar to that of a Wallace tree followed by a ripple adder!  If an adder with a faster ca
rry tree scheme is used to sum the Wallace tree outputs, the result is faster than a ripple ad
der tree.  The fast carry tree schemes use more gates than the equivalent ripple carry struct
ure, so the Wallace tree normally winds up being faster than a ripple adder tree, and less lo
gic than an adder tree constructed of fast carry tree adders.  An ASIC implementation may
also benefit from some 'go faster' tricks in carry save adders, so a Wallace tree combined
with a fast adder can offer a significant advantage there.
• A Wallace tree is often slower than a ripple adder tree in an FPGA
• Many FPGAs have a highly optimized ripple carry chain connection.  Regular logic connecti
ons are several times slower than the optimized carry chain, making it nearly impossible to i
mprove on the performance of the ripple carry adders for reasonable data widths (at least 1
6 bits).  Even in FPGAs without optimized carry chains, the delays caused by the complex r
outing can overshadow any gains attributed to the Wallace tree structure.  For this reason,
Wallace trees do not provide any advantage over ripple adder trees in many FPGAs. In fact
due to the irregular routing, they may actually be slower and are certainly more difficult to r
oute.
Booth Recoding
• Booth recoding is a method of reducing the
number of partial products to be summed.  
Booth observed that when strings of '1' bits
occur in the multiplicand the number of partial
products can be reduced by using subtraction. 
For example the multiplication of 89 by 15 shown
below has four 1xn partial products that must be
summed.   This is equivalent to the subtraction
shown in the right panel.

1   1011001 1      -1011001
1      1011001 1       0000000
1     1011001 1     0000000
1    1011001 1    0000000
0  +0000000          0  +1011001     
10100110111      10100110111

You might also like