Parallel Square and Cube Computations
Parallel Square and Cube Computations
net/publication/2462716
Article in Circuits, Systems and Computers, 1977. Conference Record. 1977 11th Asilomar Conference on · February 1970
DOI: 10.1109/ACSSC.2000.911207 · Source: CiteSeer
CITATIONS READS
55 743
2 authors, including:
Michael J. Flynn
Stanford University
333 PUBLICATIONS 9,491 CITATIONS
SEE PROFILE
All content following this page was uploaded by Michael J. Flynn on 20 November 2014.
=1
i
i
2 n n
2 (3)
a
2= X1
n
ai 22i +
X2 X1
n n
ai aj 2i+j+1 (1)
multiplier.
The interconnect delay and number of routing channels
required for the squaring unit is less than the interconnect
i =0 =0 j=i+1
i delay and number of routing channels for a multiplier since
The height of the squaring unit partial product array can only one operand needs to be distributed throughout the par-
be expressed as: tial product array.
When comparing the partial product array from the
P P Aheight = b2 +1 2
n n
(2) squaring unit to a booth-2 multiplier the ppa height and
number of input bits are comparable. However, the squaring
The number of input bits for the squaring unit partial unit does not require the logic and delay for the booth recod-
product array is: ing nor the multiplicand multiple selection. Therefore, the
2
1
PPA Delay The third reduction technique can be applied to the par-
PPA Area
0.9 tial product terms that have three different input bits. Box
3 in figure 3 indicates six boolean terms that each have one
a0 -bit, one a1 -bit, and one a2 -bit. The six terms in box 3
0.8
Relative Area and Delay
1 1X a3 − − a2 − − a1 − − a0
a3 a2 a3 a1 a3 a0 a3 a1 a2 a0 a3 a0 a2 a0 a1 a0
3.1 Partial Product Array 2 3X a3 a2 a3 a2 a0 a2 a1 a2 a1 a1 a0
3 3X a3 a2 a1 a3 a1 a0 a2 a1 a0
To compute a precise cube of an n-bit operand using
two serial multiplications requires an nxn bit multiplica-
2
tion followed by a nxn bit multiplication. The cube of an
Figure 3. Cubing Unit PPA Reduction
operand can be computed in parallel similar to a multiply.
The hardware required to compute the reduced parallel
The middle portion of figure 3 shows the expanded paral-
cube of an operand is similar to that of a multiplier. The
lel partial product array used to compute the cube of a 4-bit
operand. There are n3 bits in the non-reduced cubing unit
terms with a weighting of three can be summed to a carry
save result using a Wallace tree. Then using a carry free
partial product array.
(5,5,4) counter stage the three times multiple of the carry
The boxes in figure 3 identify three reduction techniques
save result may be computed and summed with the n one-
that can be applied to the cubing unit partial product array.
bit terms. The final carry save result may then be summed
These reduction techniques eliminate terms from the cube
using a carry propagate adder.
partial product array and reduce the overall height of the
partial product array. Therefore, the latency and hardware
needed to sum the partial products is significantly reduced. 3.2 Analysis of the Reduced Cubing Unit
The first reduction technique is performed on the par-
tial product terms that include three identical bits such as After applying the three reduction techniques, the cube
a0 a0 a0 . These terms can be replaced by single bit terms
can be represented by equation 4.
X1 X2 X1
such as a0 . For an n-bit operand there are n three bit terms
that can be replaced by n single-bit terms. n n n
The second reduction technique can be applied to the a
3= ai 23i + 3 ai aj (22i+j + 2i+2j )
X2 X2 X1
partial product terms that include two identical bits. Box i =0 =0 = +1 i j i
2 in figure 3 indicates the three terms with two a0 -bits and n n n
one a1 -bit. The three terms in box 2 can be replaced by one +3 ai aj ak (2i+j+k+1 ) (4)
a0 a1 term with a weighting of 3. i =0 j=i+1 k=j+1
3
The height of the reduced cubing unit partial product ar- delay to sum the partial products, the carry propagate addi-
ray for the terms with a weighting of three is approximated tions, and the three times multiple that is required by the re-
by equation 5 duced parallel cube technique. The area analysis compares
81 2 + 41
the number of CSA’s required to sum the partial product ar-
P P Aheight n n (5) rays. The Wallace tree circuitry constitutes the majority of
the unit area.
The number of bits in the parallel cube unit after the re- Figure 5 shows the number of gate delays each unit re-
ductions have been applied is expressed in equation 6. Re- quires for various operand lengths. The reduced parallel
call that the number of bits in the non-reduced parallel cube cubing unit achieves the best performance for all operand
unit is n3 and the number of bits in the partial product ar- lengths.
ray for the multipliers that are required to compute the exact
cube is n2 . 3 100
Reduced Cube
= 16 3 + 21 2 + 31
Multipy−Multiply
90 Square−Multiply
P P Abits n n n (6) 80
Gate Delay
60
multiply-multiply unit required to perform the cubing func-
tion. The cubing unit requires fewer ppa bits than a 50
200 30
Parallel Cube
Reduced Parallel Cube
180 Multiplier 20
160 10
Bits in PPA (1000 bits)
140 0
0 10 20 30 40 50 60
120
Operand Length (bits)
100
Figure 5. Performance of Cubing Units
80
4
2 3.5
CSA’s in PPA Reduction (1000 CSA’s)
1.6
1.2
2
1.5
0.8
0.6 1
0.4
0.5
0.2
0 0
0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60
Operand Length (bits) Operand Length (bits)
Figure 6. CSA Area for Cubing Units Figure 7. Parallel Cube and Square-Multiply
Relative to a Multiply-Multiply
4 Conclusions