0% found this document useful (0 votes)
4 views6 pages

Parallel Square and Cube Computations

Uploaded by

jdiianngg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Parallel Square and Cube Computations

Uploaded by

jdiianngg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2462716

Parallel Square and Cube Computations

Article in Circuits, Systems and Computers, 1977. Conference Record. 1977 11th Asilomar Conference on · February 1970
DOI: 10.1109/ACSSC.2000.911207 · Source: CiteSeer

CITATIONS READS
55 743

2 authors, including:

Michael J. Flynn
Stanford University
333 PUBLICATIONS 9,491 CITATIONS

SEE PROFILE

All content following this page was uploaded by Michael J. Flynn on 20 November 2014.

The user has requested enhancement of the downloaded file.


Parallel Square and Cube Computations

Albert A. Liddicoat and Michael J. Flynn


Computer Systems Laboratory, Department of Electrical Engineering Stanford University
Gates Building 353 Serra Mall, Stanford, CA 94305, USA
[email protected] and [email protected]

Abstract sion of the approximated third-order iteration exceeds the


second-order iteration by l( 2) bits.
Typically multipliers are used to compute the square and If the cube may be computed directly and in parallel with
cube of an operand. A squaring unit can be used to compute the computation of the square, then the latency of the third-
the square of an operand faster and more efficiently than a order iteration is reduced. Furthermore, if the desired pre ci-
multiplier. This paper proposes a parallel cubing unit that sion may be obtained in a single iteration, then the unit may
computes the cube of an operand 25 to 30% faster than be readily pipelined as described by Liddicoat [5].
can be computed using multipliers. Furthermore, the re- Wong [9] proposes high-radix fast division using accu-
duced squaring and cubing units are mathematically mod- rate quotient approximations. The following Taylor series
eled and the performance and area requirements are stud- approximation is used to compute an accurate reciprocal
ied for operands up to 54 bits in length. The applicability of 1
approximation. = Y =1 
= Yh Y = Yh+
2 2 3 :::
Y = Yh +
the proposed cubing circuit is discussed with relation to the Here, Yh is the most significant bits of Y extended with 1’s
current Newton-Raphson and Taylor series function evalu- such that Yh  Y and Y  = Y Yh . The reciprocal
ation units. approximation is then used to successively compute quo-
tient approximations. In Wong’s proposed technique sep-
arate lookup tables are used to store the powers for = Yh1
1. Introduction 
while the powers of Y are computed using serial multi-
plications. The reciprocal approximation is on the critical
Iterative techniques such as the Newton-Raphson and path in the divide unit. The overall latency for the division
Taylor series expansion can be used to compute the recipro- can be reduced by the parallel computation of the square

and cube of the powers of Y . Additionally, the powers
1
cal, square root, inverse square root, and other elementary
functions. Using higher-order function approximations de- of = Yh may also be computed using parallel squaring and
creases the number of iterations required to achieve a de- cubing units to reduce the table requirements.
sired precision. Using fast and efficient parallel squaring Ercegovac [1] et. al. propose a method to compute the
and cubing units reduces the number of iterations and over- reciprocal, square root, inverse square root, and other ele-
all latency of the computation of elementary functions. mentary functions based on argument reduction and series
The typical Newton-Raphson method converges to expansion. In this scheme, small multiplications are used to
the reciprocal quadratically using the following iteration compute the Taylor series expansion for each function ap-
Xi+1 = (2
Xi bXi) [6]. The error in the reciprocal proximation. The cube is computed using two serial k -bit
approximation decreases by Ei+1 = 2
b  Ei for each iter- multiplications. Where k is between one quarter and one
1
ation (with E0 < ). Rabinowitz [7] proposed a higher- third of the length of the operand size. The cubing computa-
order Newton-Raphson reciprocal approximation, Xi+1 = tion is on the critical path and therefore, the overall latency
Xi + (1Xi ) + (1
bXi Xi bXi) + + (1
2 ::: Xi bXi )n
. may be reduce by using parallel squaring and cubing units.
Flynn [2] shows that for the nth order Newton-Raphson it- Ienne [3] proposed a circuit that serially computes the
eration the error decreases as Ei+1 b  Ein+1
= square of a variable. This approach has been extended to
Ito [4] proposed fast converging methods for division compute the square of an operand in parallel using a partial
and square root based on the higher-order Newton-Raphson product array for the reduction. The ppa required for the
iteration. To approximate the third-order Newton-Raphson parallel squaring unit is about half the height and size of a
iteration, an estimate of the cube of (1 )
bXi is determined direct multiplier partial product array. This work analyzes
2
using a l xl bit lookup table. They report that the preci- the reduced parallel squaring unit. Then a parallel cubing
unit is proposed and analyzed. The parallel squaring and
cubing units are then compared with standard direct multi-
=
X = 1( 2 + ) 
n
n
2
pliers for both latency and area. P P Abits

=1
i
i
2 n n
2 (3)

2. Parallel Squaring Unit 2.3 Comparison to Direct Multiplier


2.1 Partial Product Array
The squaring unit partial product array height is approx-
imately one half the height of the direct multiplier partial
Generally a multiplier unit is used to compute the square product array. Additionally, the number of input bits in the
of an operand. However, the square of an operand can be squaring unit partial product array is approximately one half
computed directly with a specialized unit that has a faster that of the direct multiplier.
latency and smaller area implementation.
The input logic for the direct multiplier and the squaring
The middle portion of figure 1 shows the partial product
unit both require a single two input “AND” gate. The in-
put logic for the direct multiplier requires n2 “AND” gates
array for a parallel square using a multiplier ppa. The boxes
indicate which terms may be combined using the equiva- 2
lence ai aj aj ai + =2
ai aj . In the lower portion of figure 1
while the input logic for the squaring unit requires n 2 n
2 ai aj is represented by placing ai aj one column to the left
“AND” gates. Less than half of the input logic gates are
required by the squaring unit as compared to the direct mul-
which has a weighting of two times that of the current col-
tiplier. The partial product array reduction for the direct
umn. The square of operand a can be computed with the
multiplier requires more than twice the number of CSA’s
reduced partial product array shown in the lower portion of
than are required to reduce the partial product array terms
figure 1.
for the squaring unit. The final CPA is the same for both
a5 a4 a3 a2 a1 a0 units.
a5 a4 a3 a2 a1 a0 The delay differs only in the partial product array reduc-
x
a5 a0 a4 a0 a3 a0 a2 a0 a1 a0 a0 a0
tion. A Wallace Tree [8] was used to reduce the partial prod-
a5 a1 a4 a1 a3 a1 a2 a1 a1 a1 a0 a1 uct array for both the squaring unit and the direct multiplier.
a5 a2 a4 a2 a3 a2 a2 a2 a1 a2 a0 a2 For most of the operand widths, the squaring unit latency is
a5 a3 a4 a3 a3 a3 a2 a3 a1 a3 a0 a3 two CSA delays faster than the direct multiplier. This is due
a5 a4 a4 a4 a3 a4 a2 a4 a1 a4 a0 a4 to the fact that a Wallace tree can reduce twice as many par-
a5 a5 a4 a5 a3 a5 a2 a5 a1 a5 a0 a5 tial products by doubling the tree hardware and computing
the carry save sum on both halves of the partial product ar-
a5 a4 a5 a3 a5 a2 a5 a1 a5 a0 a4 a0 a3 a0 a2 a0 a1 a0 − a0
a4 a3 a4 a2 a4 a1 a3 a1 a2 a1 a1
ray in parallel and then combining the two carry save sums
a5 − −
a4 − a3 a2 − a2 using two additional CSA’s.
a3 Figure 2 shows the latency and area of the squaring unit
relative to the direct multiplier. The squaring unit par-
Figure 1. PPA Reduction for Squaring Unit tial product array latency is generally 20-35% faster on
operands of 10 to 54 bit widths. While the partial product
array reduction area for the squaring unit is about 50- 70%
2.2 Analysis less than that of the direct multiplier. Since the area required
for the input logic and partial product array is much larger
The partial product array for the squaring unit can be than the area required for the CPA, the squaring unit can be
expressed mathematically for an n-bit number as: implemented in less than half the area required by a direct

a
2= X1
n
ai 22i +
X2 X1
n n
ai aj 2i+j+1 (1)
multiplier.
The interconnect delay and number of routing channels
required for the squaring unit is less than the interconnect
i =0 =0 j=i+1
i delay and number of routing channels for a multiplier since
The height of the squaring unit partial product array can only one operand needs to be distributed throughout the par-
be expressed as: tial product array.
When comparing the partial product array from the
P P Aheight = b2 +1  2
n n
(2) squaring unit to a booth-2 multiplier the ppa height and
number of input bits are comparable. However, the squaring
The number of input bits for the squaring unit partial unit does not require the logic and delay for the booth recod-
product array is: ing nor the multiplicand multiple selection. Therefore, the

2
1
PPA Delay The third reduction technique can be applied to the par-
PPA Area
0.9 tial product terms that have three different input bits. Box
3 in figure 3 indicates six boolean terms that each have one
a0 -bit, one a1 -bit, and one a2 -bit. The six terms in box 3
0.8
Relative Area and Delay

0.7 can be replaced with one a0 a1 a2 term with a weighting of 3


0.6
shifted one column to the left. Figure 3 shows the reduced
partial product array to compute the cube of a 4-bit operand
0.5
after applying the three reduction techniques.
0.4
a3 a2 a1 a0
0.3 a3 a2 a1 a0
x
a3 a2 a1 a0
x
0.2
a3 a0 a0 a2 a0 a0 a1 a0 a0 a0 a0 a0
0.1 1
a3 a0 a1 a2 a0 a1 a1 a0 a1 a0 a0 a1
0 a3 a1 a0 a2 a1 a0 a1 a1 a0 a0 a1 a0
0 10 20 30 40 50 60
a3 a2 a0 a2 a2 a0 a1 a2 a0 a0 a2 a0 2
Operand Length (bits)
a3 a1 a1 a2 a1 a1 a1 a1 a1 a0 a1 a1
a3 a0 a2 a2 a0 a2 a1 a0 a2 a0 a0 a2
Figure 2. PPA Area and Delay for the Squaring a3 a3 a0 a2 a3 a0 a1 a3 a0 a0 a3 a0
Unit relative to a Direct Multiplier a3 a2 a1 a2 a2 a1 a1 a2 a1 a0 a2 a1
a3 a1 a2 a2 a1 a2 a1 a1 a2 a0 a1 a2 3
a3 a0 a3 a2 a0 a3 a1 a0 a3 a0 a0 a3
a3 a3 a1 a2 a3 a1 a1 a3 a1 a0 a3 a1
squaring unit would require less area and have better per- a3 a2 a2 a2 a2 a2 a1 a2 a2 a0 a2 a2
formance than a booth-2 multiplier. a3 a1 a3 a2 a1 a3 a1 a1 a3 a0 a1 a3
a3 a3 a2 a2 a3 a2 a1 a3 a2 a0 a3 a2
a3 a2 a3 a2 a2 a3 a1 a2 a3 a0 a2 a3
3 Parallel Cubing Unit a3 a3 a3 a2 a3 a3 a1 a3 a3 a0 a3 a3

1 1X a3 − − a2 − − a1 − − a0
a3 a2 a3 a1 a3 a0 a3 a1 a2 a0 a3 a0 a2 a0 a1 a0
3.1 Partial Product Array 2 3X a3 a2 a3 a2 a0 a2 a1 a2 a1 a1 a0

3 3X a3 a2 a1 a3 a1 a0 a2 a1 a0
To compute a precise cube of an n-bit operand using
two serial multiplications requires an nxn bit multiplica-
2
tion followed by a nxn bit multiplication. The cube of an
Figure 3. Cubing Unit PPA Reduction
operand can be computed in parallel similar to a multiply.
The hardware required to compute the reduced parallel
The middle portion of figure 3 shows the expanded paral-
cube of an operand is similar to that of a multiplier. The
lel partial product array used to compute the cube of a 4-bit
operand. There are n3 bits in the non-reduced cubing unit
terms with a weighting of three can be summed to a carry
save result using a Wallace tree. Then using a carry free
partial product array.
(5,5,4) counter stage the three times multiple of the carry
The boxes in figure 3 identify three reduction techniques
save result may be computed and summed with the n one-
that can be applied to the cubing unit partial product array.
bit terms. The final carry save result may then be summed
These reduction techniques eliminate terms from the cube
using a carry propagate adder.
partial product array and reduce the overall height of the
partial product array. Therefore, the latency and hardware
needed to sum the partial products is significantly reduced. 3.2 Analysis of the Reduced Cubing Unit
The first reduction technique is performed on the par-
tial product terms that include three identical bits such as After applying the three reduction techniques, the cube
a0 a0 a0 . These terms can be replaced by single bit terms
can be represented by equation 4.

X1 X2 X1
such as a0 . For an n-bit operand there are n three bit terms
that can be replaced by n single-bit terms. n n n
The second reduction technique can be applied to the a
3= ai 23i + 3 ai aj (22i+j + 2i+2j )
X2 X2 X1
partial product terms that include two identical bits. Box i =0 =0 = +1 i j i
2 in figure 3 indicates the three terms with two a0 -bits and n n n
one a1 -bit. The three terms in box 2 can be replaced by one +3 ai aj ak (2i+j+k+1 ) (4)
a0 a1 term with a weighting of 3. i =0 j=i+1 k=j+1

3
The height of the reduced cubing unit partial product ar- delay to sum the partial products, the carry propagate addi-
ray for the terms with a weighting of three is approximated tions, and the three times multiple that is required by the re-
by equation 5 duced parallel cube technique. The area analysis compares

 81 2 + 41
the number of CSA’s required to sum the partial product ar-
P P Aheight n n (5) rays. The Wallace tree circuitry constitutes the majority of
the unit area.
The number of bits in the parallel cube unit after the re- Figure 5 shows the number of gate delays each unit re-
ductions have been applied is expressed in equation 6. Re- quires for various operand lengths. The reduced parallel
call that the number of bits in the non-reduced parallel cube cubing unit achieves the best performance for all operand
unit is n3 and the number of bits in the partial product ar- lengths.
ray for the multipliers that are required to compute the exact
cube is n2 . 3 100
Reduced Cube

= 16 3 + 21 2 + 31
Multipy−Multiply
90 Square−Multiply

P P Abits n n n (6) 80

Figure 4 plots the number of bits required for the non- 70

reduced cubing unit, the reduced cubing unit, and the

Gate Delay
60
multiply-multiply unit required to perform the cubing func-
tion. The cubing unit requires fewer ppa bits than a 50

multiply-multiply for operand lengths of less than 15-bits. 40

200 30
Parallel Cube
Reduced Parallel Cube
180 Multiplier 20

160 10
Bits in PPA (1000 bits)

140 0
0 10 20 30 40 50 60

120
Operand Length (bits)

100
Figure 5. Performance of Cubing Units
80

Figure 6 graphs the number of CSA’s required to imple-


60
ment each of the cubing units and is plotted as a function
40 of the input operand length. For operand lengths of less
20
than 13-bits, the cubing unit requires fewer CSA’s in the
ppa reduction than the multiply-multiply unit. However for
0
0 10 20 30 40 50 60 a 54-bit operand, the cubing unit requires a factor of 3.3
Operand Length (bits) more CSA’s than the multiply-multiply unit. For operands
of length less than 20-bits, the cubing unit provides a higher
Figure 4. PPA bits for the Cube Computation performance unit with a comparable area implementation.
Figure 7 shows the area and delay of the reduced paral-
lel cubing unit and the square-multiply unit relative to the
3.3 Cubing Unit Comparison multiply-multiply unit. The parallel cubing unit performs
better than both the square-multiply unit or the multiply-
The reduced parallel cube is compared to the traditional multiply unit for all operand lengths. However, the reduced
approach of two serial direct multiplications. In addition, parallel cube unit’s hardware requirement grows faster with
the reduced parallel cube is compared to a method of form- increased operand length than the other methods and for
ing the cube by squaring an operand using the previously larger operand sizes the performance improvement has to
described reduced squaring unit followed by a serial mul- be considered along with the increase in the area required
tiplication. All of the cubing methods produce the exact to implement the reduced parallel cubing unit.
result. We see that the square-multiply unit performs about 10%
Both performance and area of the three methods are faster than the multiply-multiply unit and requires only 83%
compared. The performance is measured by the number of the area for implementation. Therefore, the square-
of gate delays required to produce the final result. The gate multiply unit is better suited than the multiply-multiply unit
delays include the ppa input gate delay, the Wallace tree to compute the cube for all operand lengths.

4
2 3.5
CSA’s in PPA Reduction (1000 CSA’s)

Reduced Cube Cube Area


Multipy−Multiply Cube Delay
1.8 Square−Multiply Sqr−Mult Area
3 Sqr−Mult Delay

1.6

Relative Area and Delay


2.5
1.4

1.2
2

1.5
0.8

0.6 1

0.4

0.5
0.2

0 0
0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60
Operand Length (bits) Operand Length (bits)

Figure 6. CSA Area for Cubing Units Figure 7. Parallel Cube and Square-Multiply
Relative to a Multiply-Multiply
4 Conclusions

The reduced squaring unit generally computes the square References


of an operand 12-30% faster than a direct multiplier for 10
to 54 bit operands. Additionally, the squaring unit requires [1] M. D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand.
less than half of the area needed to implement a direct mul- Reciprocal, Square Root, Inverse Square Root, and Some El-
tiplier. Therefore, when the square of an operand must be ementary Functions Using Small Multipliers. In IEEE Trans-
actions on Computers, volume 49, pages 628–637, July 2000.
computed on a dedicated hardware unit, the reduced squar-
[2] M. Flynn. On Division by Functional Iteration. In IEEE
ing unit provides 2.4 times higher performance per area as Transactions on Computers, volume C-19, pages 702–706,
compared to a direct multiplier. August 1970.
The reduced cubing unit is the fastest method studied to [3] P. Ienne and M. Viredaz. Bit-Serial Multipliers and Squarers.
compute the cube of an operand. The reduced cubing unit In IEEE Transactions on Computers, volume 43, pages 1445–
is 25-30% faster than the direct multiply-multiply. How- 1450, December 1994.
ever, the area required to implement the reduced cube grows [4] M. Ito, N. Takagi, and S. Yajima. Efficient Initial Approxima-
more rapidly than the area required to implement the cube tions and Fast Converging Methods for Division and Square
using multipliers. For operands with length less than 15 Root. In Proc. 12th IEEE Symp. Computer Arithmetic, pages
bits, the reduced cubing unit requires less area to imple- 2–9, July 1995.
[5] A. A. Liddicoat and M. J. Flynn. Pipelinealbe Division Unit.
ment the cube than direct multipliers. However, for 54 bit
Technical Report CSL-TR-00-809, Computer Systems Labo-
operands the reduced cubing unit requires approximately ratory, Stanford University, 2000.
three times the area of the cube implementation using direct [6] S. Oberman. Division Algorithms and Implementations. In
multipliers. Therefore both performance and area require- IEEE Transactions on Computers, volume 46, pages 833–
ments must be considered. 854, August 1997.
Alternately, the cube can be computed with a reduced [7] P. Rabinowitz. Multiple-Precision Division. In Communica-
squaring unit followed by a multiplier. This method per- tions of the ACM, volume 4, page 98, February 1961.
forms 10% faster and requires 83% of the area to implement [8] C. S. Wallace. A Suggestion for a Fast Multiplier. In IEEE
the cube than by using direct multipliers. Transactions on Computers, pages 14–17, February 1964.
[9] D. Wong. Fast Division Using Accurate Quotient Approxi-
In section 1 several higher-order iterative function evalu- mations to Reduce the Number of Iterations. In IEEE Trans-
ation techniques were discussed. These techniques included actions on Computers, volume 41, pages 981–995, August
Newton-Raphson and Taylor series techniques to compute 1992.
functions such as the reciprocal, square root, inverse square
root, and other elementary functions. The squaring unit
and proposed cubing unit should decrease the latency of the
higher-order function evaluation and may reduce the area
needed for implementation of such units.

View publication stats

You might also like