Switching Characteristics of Generalized Array Multiplier Architectures and Their Applications To Low Power Design
Switching Characteristics of Generalized Array Multiplier Architectures and Their Applications To Low Power Design
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
3-1-1999
Switching Characteristics of Generalized Array Multiplier Architectures and their Applications to Low Power Design
Khurram Muharnmad
Purdue University School of Electrical and Computer Engineering
Dinesh Somasekhar
Purdue University School of Electrical and Computer Engineering
Kaushik Roy
Purdue University School of Electrical and Computer Engineering
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.
SWITCHING CHARACTERISTICS OF GENERALIZED ARRAY MULTIPLIER ARCHITECTURES AND THEIR APPLICATIONS TO LOW POWER DESIGN
Switching Characteristics of Generalized Array Multiplier Architectures and their ,Applications to Low Power ~ e s i ~ n l
Khurram Muharnmad, Dinesh Somasekhar and Kaushik Roy
Enlail: [email protected], [email protected] and [email protected] School of Elect.rica1 and Computer Engineering, Purdue University, West. Lafayette, IN 47907 February 22, 1999.
Abstract This paper presents several new array multiplier architectures for reducing the switching activity in general digital signal processing applications. A general cellular structure is described which can be used to obtain any array multiplier suitable for a given application. The switching activity at the output nodes of the cells in this structure is analyzed and compared with a tree multiplier based on 4 : 2 compressors. It is shown that the relative inlprovement in power is a function of statistical properties of the signal. It is also shown that selection of appropriate array architecture can give up to 40% reduction in switching activity compared to a tree multiplier, and more than 3 times less switching activity compared t o the widely used least-szgnzficant-bzt-first array multiplier for commonly occurring situations. We also outline applications of the proposed multipliers t o the areas of low power quantization, reconfigu~.ablecomputing and high-level synthesis for low power.
'This work was supported in part by DARPA (F33615-95-C-1625), NSF CAREER award (9501869-MIP),Rockwell, AT&T and Lucent foundation.
With the recent trend in increasing mobility and performance in small hand-held mobile communicat,ion and portable computing equipment, low power has become an important design factor. New features are continually provided using DSP algorithms which are dominated by three basic operations; add, shaft and
multiply. Many DSP algorithms can be implemented such that the data is processed in carry save (CS)
format. as this format yields zero cost of accumulation [I] in multzply-and-accumulate (MAC) operation. The conversis3n of the result to normal binary forrn can be delayed for as long as possible for the given algorithm since it results in a significantly faster implementation. Consider, for example, a digital filter implementat:.on. In such an application, the intermediate result which is the accumulatior~ of a given inner product of d,sta and the coefficient can be kept stored in CS format, with the CS to binary conversion taking place only after the final result is computed in CS form. Consequently, multiplier architectures processing data in CS format are of particular interest. Multiplica1,ion operations are considered to be the dominant computation in DSP algorithms [2], [3]. Since, computation directly results in dynamic power consumption [4] it is an equally important factor when considering dynamic power dissipation of such algorithms. In general, high-performance DSP architectures aire required in mobile unit,s which process data a t high transmission rates, or in a port,able computer providing advance multimedia features. For this reason, such units are generally constructed with pipelined array m~lt~ipliers. If the latency of t,he pipelined architecture is an important consideration, a pipeli.ned tree multiplier can be used. Both types of multipliers can be easily pipelined using the conventional register based approach, or by using wave pipelining. Over t,he past few years, a number of papers have addressed multiplier topologies for a variety of applications [I], [6], [7]. In particular, array structures prl3posed in [6] address pipelining of recursive digital filters using most signijicant bit (MSB) first digit serial arithmetic. However, to the best of our knowledge, no work has been reported in literature which address dynamic switching activity trade-offs between popular multiplier architectuires as a function of statistical properties of inputs. In this paper, we esplore array structures from the point of view of dynamic power dissipation. Contrary to the expectation that any ordering of array multiplier would yield similar dynamic power dissipation performance, we will show that more than 3 times reduction in switching activity may be possible compared t o t,he commonly used least significant bit (LSB) first array multipliers (also known as ri g ht-left multipliers), depending on the signal characteristic of input signals. This is because a salient feature of computation in DSP algorithms is that the computations are governed by the statistical properties of the underlying process generating dat,a. In general, data signals are correlated and consequently, rapid crhanging data is seldom processed. Hence, we will explore the effects of signal statistics on the output swit,ching activit,y in various array structures in order to assess the feasibility of using a given structure under the condition of known or predictable signal statistics. We will show that re-ordering of partial product addition can result
in significant reduction in switching activity (hence, dynamic power) if the signal statistics are known a
przori. This observation leads to new array multiplier architectures which form hybrids of MSB-first and
LSB-first strl~ctures.We also discuss the application of such multipliers to low power iniplementation of DSP algorithms and to the general area of reconfigurable computing. The main objective of this work is to identify what type of architectures are best suited for processing signals with known statistical properties for reduced dynamic power dissipation? There are three major contributions: of this work:
r
We propose hybrid-array structures which combine LSB-first and MSB-first types of array multipliers. For appropriate signal conditions, these structures are shown to significantly reduce dynamic power dissipation.
T h e switching characteristics of array multipliers are compared with a tree multiplier based on 4 : 2 compres:jors as well as the most commonly used LSB-first multiplier to show the region of strength of each zrchitecture. Hence, this work can be used to formulate an appropriate strategy for selecting the best order of partial product addition for reducing power dissipation in a given LISP task. Alternatively, when processing signals with known statistical properties, one can formulate a strategy for applying signals to the multiplier inputs in an order which most effectively reduces dynamic power dissipatilsn.
The architectures presented in this paper provide new insights to the general area of low power design and reconfigurable computing.
This paper is organized in to five sections. Section I1 describes the array multiplier architectues considered in this work. Section I11 presents a simulation based study of the switching characteristics of output nodes in the architectures considered. The signal models used to compute the performance of these multipliers are also explained in this section. Section IV discusses the applications of these strucl ures to general signal procesr;ing algorithms. Finally, section V concludes this paper.
We will f i n t present a simple approach for obtaining various types of array multipliers. Figure 1 shows
a template for a cellular array structure which serves as the basis for generating different types of 8-bit
array multipliers. Each location in this matrix can be occupied by a cell which can be an a.nd gate (AND), a half a d d e r :H.4) or a full a d d e r (F.4). In the sequel, the cell at location i, j will be referred to as ci,j. As an example, the cells on four corners are shown labeled in the figure. Let A = ao, a , , . . . , a ~ - 1 and
B = bo, b l , . . . , b N P l represent the input vectors applied at right and top, respectively. The output is
represented by P = po, p l , . . . , p 2 ~ - ~ Then . each partial product ai . bj, where i, j = 0 , 1 , .. . , N - 1 must be added in the appropriate relative position to obtain the correct value of P. In figure 1 we have shown the structure of LSB-first type array multiplier by the colored cells comprising a parallelogram. In this figure, the continuous lines show presence of connections, while the dashed lines show absence of them.
Hence, t h e aztive connections in a CS type of array multiplier are shown using the contii~uous lines. T h e connections i i o m primary inputs t o appropriate cells are not shown explicitly, and are assumed implicit t o reduce clutter. By counting the number of active inputs, one can determine the type of cell. Hence, the cells in row #O are all AND gates, whereas the seven rightmost cells in row # I are HAS. T h e cells accepting three active inputs are FAs. Note that the inputs are counted by considering tlie implicit input
ai .
bj which is not shown. T h e resulting CS array multiplier structure is shown on the right in figure 1
for clarity.
Now, the goal of an array multiplier is t o add the partial products from cells which occupy t.he same column in t h e cellular array structure shown in figure 1 . T h e order in which these partial products are added is not important, we only need t o ensure t h a t only the products in t h e same colurrln are added (in addition t o the carry's generated from the cells in the adjacent column on right). Hence, one can exchange rows #3 and #7 as shown in figure 1. Cells in row #3 after moving t o row #7 are shown by cells shaded by circles. 'I'lne cells in row #7 after moving t o row #3 are shown by dark colored cells. Now, we only need t o ensure t h a t carry's generated from next rows are correctly added, which may require extra cells. Let
R=
multiplier shown in figure 1. T h e MSB-first multiplier can also be expressed similarly by the ordering
r1 .- N
-
1 - i: for i = 0 , 1 , . . . , N - 1. Clearly, there are N! ways t o construct carry save array multipliers.
Each of these multipliers mays be constructed using propagation of carry in eit,her ripple form or CS form or a combination of these. This formulation is the basis of generating various architectures of interest which are evaluated for their switching activity performance in this paper.
A. LSB-First Multipliers
The LSB-first multiplier can be constructed either using the CS format shown in figure 1, or by using ripple carry structure. We will refer the former as LSB-first CS multiplier and the latter as the LSBfirst R P multiplier, respectively. LSB-first R P multiplier is the most well-known and widely used array structure for multiplication and is obtained from the cellular array of figure 1 by turning off the diagonal lines (by nlalting then1 dashed) and turning on the horizontal dashed lines (by making them continuous)
+ 1, . . . , 2 N - i 2 (right-most cell excepted) in the LSB-first CS multiplier of figure 1. The vector. merge row (row # N + 1)
which connect cell
ci,,
to
ci,,+I
ci,j,
i = 0 , 1, . . . , N - 1 and j = N - i, N - i
is no longer I-equired. The advantage of using CS format is the reduction in propagation delay through the multiplier. LSB-first R P multiplier has 30% longer critical path as compared to the LSB-first CS multiplier. Irl this work, we consider both since our objective is to highlight the switching characteristics of various array multipliers.
An MSB-first multiplier place MSBs of A input at the top row positions as shown in figure 2. The main idea is to flip the cells in the cellular array of figure 1 along a horizontal axis such that row # i is moved to row #(N - 1 - i ) , for i = 0, 1 , . . . , N - 1. This results in a MSB-first multiplier [ B ] . The multiplier can be const1:ucted by propagating the carry in either CS form, or can be ripple in a fashion identical to the LSB-first R P multiplier. The multiplier using CS format has been presented in [B] for pipelining recursive digital filters. A major advantage of the MSB-first CS multiplier is that the d e l a j ~ through vector merge stage can be reduced by taking advantage of the fact that the MSB-first array produces the MSBs before the LSBs. Hence, a carry-select structure can be constructed in the region occupied by cells for i
ci,j
>
j to improve the vector merge delay. Consequently, MSB-first CS array multiplier can improve
the speed of multiplication [ B ] . The observation that MSBs of product are available before the LSBs is 1 1 contrast to a fundamental to the construction of the MSB-first R P multiplier shown in figure 2(b). 1 LSB-first R P multiplier, it has the same propagation delay as the LSR-first CS multiplier and offers an attractive alternative to it.
C. H y b r i d Multipliers
R which is not monot,one. Note that there is only one monotonically increasing ordering of the elements of R and it leads to the LSB first
A hybrid multiplier is obtained by any ordering of elements of structure. Siinilarly, the only monotonically decreasing ordering leads to the MSB first structure. Any ordering other than these two leads to a hybrid array multiplier. In this paper, we consider only two types for hybrid structures. The first structure places L consecutive LSB bits of operand A as L top
most rows. This structure is shown on left in figure 3. The second structure places L consecutive MSB
bits of operand A as L top rnost rows and is shown on right in figure 3. We will refer to the former as
Fig. 2. Structures for MSB-first multipliers; ( l e f t ) MSB-first CS multiplier, and, ( r i g h t ) MSB-first R P multiplier.
hybrid LSB-first multiplier and the latter as hybrid MSB-first multiplier, respectively. Botli of these can be
constructed either by using ripple carry or by using CS format. Hence, there are four wa,ys to implement a hybrid multiplier which puts L top most rows of one type of multiplier above the other (i.e. LSB-first, over MSB-first or vice versa). The multiplier on left in figure 3 puts L = 3 top rows of the LSB-first CS multiplier over N - L = 5 top rows of the MSB-first CS multiplier. We will refer to such a multiplier as hybrid LSB-frst CS/CS multiplier with L = 3. Similarly, the multiplier on right in 3 puts L = 3 top most rows of MSB-first R P multiplier over N - L top most rows of LSB-first CS multiplier. This multiplier will be referrcd to as hybrid MSB-first RP/CS multiplier with L = 3. We can obtain three more types of L = 3 hytlrid multipliers for each of these cases by considering the remaining three combinations of adding carrys in the two parts of the multiplier. Each type of hybrid multiplier implementation requires a different overhead and has a different length of critical path. Since the goal of this work is to develop an understanding of the swit'ching trade-offs in various multipliers, we will only consider implementations which place L consecutive rows of one type of multiplier over the other. The reason for focusing on such architectures is because DSP applications, in general, process data streams whose properties can only be predicted or controlled over a part of the word-length. For example, if the signal strength reduces, consecutive MSBs of the data--stream become zeros (assuming a sign-magnitude representation). Similarly, "less important" data values may be further quantized by truncating some LSBs, thereby resulting in the data-stream having zeros a t the corresponding locations. It will be shown that the proposed hybrid multipliers yield substantial improvement in switching activity reduction compared to a tree multiplier (constructed using 4 : 2 compressors) as well as the simple LSB-first or MSB-first multipliers under appropriate signal conditions. The multiplier structure shown on left in figure 3 is entirely CS structure, and its speed can be increased by using a carry select structure similar to the one proposed in 161. The multiplier on right in 3 has the same delay as a LSB-first CS array multipl:.er despite the fact that the MSB-first part ripples the carry. The reason for considering this structure is that it requires a smaller overhead cells required to ensure that all partial product sums and
Fig. 3. Structures for hybrid multipliers; (left) hybrid LSB-first CS/CS multiplier, a n d , (right) hybrid MSB-first RP/CS multiplier.
We first investigate the switching characteristics of the multipliers presented in the previous section qualitatively. Let us first consider the LSB-first multipliers. A close observation of the multiplier in figure 1 shows that if successive inputs are applied such that their LSBs are zeros in operand A , the corresponding top rows of the multiplier will be turned off as the evaluated partial products would all be zeros. Ac.y input which has a0 = 1 will place the vector B at the output of the first row of partial product outputs. These values will propagate downwards even if the next LSBs in A are id1 zeros. Hence, switching activity can only be reduced if successive inputs applied at the input A ensure that when a bit a j is 1, all ai's are zeros for i
<
inputs are such that M MSB bits are zeros, then the cells a i , j such that j = i
+ k for k := 0 , 1 , .. . , L - 1
along the diagonal (columns of partial product generators) in the cellular array are all turned off. Hence, no sum or carry output transitions in these cells. Hence, low over-all switching activity can be ensured if the inputs applied to this multiplier are ordered to ensure that they cause smaller switching activity. Similar 0b:servations are made for the MSB-first and hybrid multipliers. The "best" input conditions for these multipliers are summarized in table I and can be verified by a careful study of figures 1
-
3.
Next, in order to obtain a quantitative behavior of these multipliers, we will use two signal models which are described next.
A. Signal Models
In the first model we only vary the signal strength to determine the switching characi,eristics. Hence, successive sainples of signals are assumed to be uncorrelated and drawn from a uniform distribution. It has been shown in [lo] that the switching activity in the LSB-first RP multiplier prim;~rilydepends on the input signal strength. Hence, we apply all possible combinations of fixed signal strengths in an N-bit
Multiplier
C
I
S IGNAL
LSBs zeros LSBs zeros MSBs zeros MSBs zeros MSBs & LSBs zeros
N - 1 cells
Not Required
MSBs zeros
MSBS zeros
MSBs zeros MSBs zeros
3 ( -~ 1) cells
N - 1 cells
3N - 2 L - 1 L
-
1
I
7 1 1
Wiring
None
1 cells
None
PRESENTED. HYBRID
MULTIPLIERS ASSUME T H A T
multiplier by sweeping the space of possible signal strengths at the two inputs. We obtain d a t a for these points by generating samples comprising of i-bits from a uniform distribution, where i is varied from 1 to N. The N x N possible combinations of siginal strengths of the two operands are obtained by applying signal of strength i-bits as operand A and j-bits as operand B, where i, j = 1 , 2 , . . . , N. This model will be referred to as the U model and it can be used to assess the merits of using the presented multipliers for signals which can be represented by N or less bits and/or which can be re-quantized by discarding some LSBs without significantly degrading the system performance. The second model generates correlated signals from a zero mean Gaussian distribution. These samples are represented using sign-magnitude (SM) number representation and only the magnitude of the number is applied at the inputs of the multiplier. The signal correlation in operand A is represented by p~ and the correlation in B is represented by p ~ Four . situations arise by considering all possible combinations of high and low correlations in the signals at the two inputs. The high correlation value is considered to be 0.95, and low correlation equal t o 0. This model will be referred to as the Q model.
driven by the gate output). These relative weighting factors were obtained by considering the pin loading of a typical rnodule in the array configuration. In addition, the switches at the input pins were counted separately for the given simulation and multiplied by N to account for input buffer drivers. The total switch count:j at all outputs (including input pins), weighted by the corresponding factor were summed to obtain the sviitching metric for the multiplier. These weightings yield a metric which expresses the total switched capacitance in the multiplier for the given input conditions.
.4 similar inetric was obtained for the tree multiplier by using using the same input signals. We will
let SArray and STree denote the switching metrics for the array and tree multipliers, respectively, for the given input signal conditions. Then the relative advantage of using the array multiplier is defined as
The above quantity is expressed as a percentage and shows the advantage of using the array multiplier over a tree fcr the given signal condition. We will refer to this quantity as percentage switching reduction. The rationale behind this normalization is to clearly indicate the relative performance of each type of array multiplier with respect to the tree structure and to quantify percentage reduction in switching activity for given signal condition. Similar quantity can be obtained for comparing the relative performance of any two multipliers. Figure 4 shows one such metric computed using the LSB-first CS multiplier as the reference for normalization. The figure shows the relative advantage of using the indicated hybrid multipliers ill comparison to the LSB-first CS multiplier by using SLSB-First cs in place of STreein equation 1. This quantity will be represented by
YA,,,~,
base-line for comparison in array multipliers. It is noted that switching reduction of up to 200% (3X smaller) is possible when using a hybrid multiplier in comparison to the LSB-first CS multiplier, under appropriate signal conditions. The result:; presented in this section were obtained by using 1000 randomly generated vectors using the
U model. These results give rise to a surface as a function of the number of bits in the applied inputs.
as in This surface is best shown by slicing it into different regions and showing every slice ir~dividually figure 4. An even better representation is to place each slice along-side as a bar chart as shown in the remaining figures. The abscissa in these figures show the number of bits in the samples (drawn from a uniform distribution) applied at the A input. The data samples were applied at the multiplier inputs by computed the aligning their LSB with the zeroth indexed row/column. Hence, the successive simulatio~ls switching metrics for inputs with increasing widths until metrics for all the grid points of the switching metric surface were computed. The metrics were normalized to obtain relative switching ireduction shown in the figures. The bars in each figure are composed of N groups. The position of a group corresponds to the number c'f bits in the samples applied at A . Each group, in turn, is composed of N bars. The position of a bar insicle a group indicates the number of bits in the samples applied at the B input. Hence, as we scan a figure from left towards right, the strength of the input signal at B input repeatedly increases and
Signal Strenglh of A
S~gnal Slrenglh of 3
Signal Strength of 3
Signal Strength of A
Fig. 4.
qilrra:, for 16-bit hybrid array multipliers. Figure above: shows the percentage switching reduction for Hybrid
LSB-first C!S/CS with L = l , and, figure below: shows this surface for Hybrid MSB-First R P / C S m.ultiplier with L = l (normaliza1;ion is performed with respect t o LSB-first CS multiplier).
falls, while the strength of the signal applied a t A continually increases. B . l Results Using the Figures 5
--
U Signal Model
7 show q~,,, as a function of signal strength in the LSB-first and MSB--first multipliers
for N = 8 , 1 6 and 32, respectively. We observe a consistent trend of the relative performance for each of these mult,ipliers. Each of these multipliers gives gains in switching reduction for difFerent operating conditions. A.s pointed out in table I, LSB-first multipliers would give an improvement when the LSBs of
A input, or bilSBs of B input are zeros. The first situation does not arise with this signal model, because
it would require MSBs t o be I s and LSBs to be 0s. Such a signal can only be generated by quantizing (rounding/truncating) the LSBs. However, the second condition is more realistic and we note that up t o 25% reduction in switching activity is possible over tree multiplier when the signal strength of A is high, and B is small as they result in left-most columns of multipliers turning off. Despi.te the overhead of vector merge state, the CS multiplier out-performs the R P multiplier as evident by a close inspection of these figures. T h e MSB-first multiplier shows the gains in switching activity reduction when the signal strength a t the A input is low. Hence, the top most rows do not switch. Larger gains are observed when the signal strength a t the B input is large. T h e R P type multiplier clearly outperforms the CS multiplier because of smaller overhead cells. Further, the relative gains under favorable signal conditions are higher as compared to the LSB-first multipliers. Finally, both favorable situations appear a t the inputs in the U signal model because the MSB-first multipliers reduce switching when the MSBs of both inputs are 0s ( a situation which frequently arises in DSP a pp lications). It is seen that close to 40% reduction in switching
-20
-30
-40
-30
1 2 3 4 5 6 S g n 3 Sllsnglh of A n LSB-FlrsI RP Mulllpl~ei 7 6
Fig. 5.
v~~~~ for
(left) 8-bit LSB-first array multiplier, and, (right) 8-bit MSB-first array multiplier as a function of the
14
16
18
18
18
Fig. 6.
LSB-first array multiplier, and, (rightj 16-bit MSB-first array multiplier as a function of the
activity is possible in the MSB-first R P multiplier when the signal strength of A is very small and B is very strong. The savings are consistent across 8, 16 and 32 bit multipliers. We can also compare the relative performance of MSB-first and LSB-first multipliers. Figure 4 shown earlier indicates that LSB-first multiplier out-performs the MSB-first multiplier by up to 30% when signal strengths a t both A and B inputs are very small. However, MSB-first multiplier oui,-performs LSBfirst multiplier for most situations giving larger relative advantage in switching reducltion. Note that these results favor MSB-first type multiplier from switching activity point of view for most common signal conditions. One may notice that many multipliers used in DSP do not need all 2N product bits (especially i r ~ floating point units) and MSB-first multiplier is an attractive choice since by construction it also furnishes the MSB part of the product very quickly.
"
"
'
'
'
"
27
29
31
'
27
'
29
'
31
'
27
29
31
27
29
31
Fig. 7.
for ~ ( l e~ f t ) 32-bit ~ LSB-first array multiplier, and, ( r i g h t ) 32-bit MSB-first array multiplier as a function of the
We next consider hybrid multipliers. The main objective of employing the hybrid multipliers presented in this paper is t o take advantage of a signal whose L LSB bits are zeros. Such signal values may arise in many ways ill typical DSP applications. As an example, computations may be organized as floating point type of operations in which the normalized mantissa of operands is multiplied using an array multiplier and values are expressed by using only a few MSB bits in the mantissa, depending on the accuracy required (ua;-zable preczszon arzthmetzc). Example of such a system is a digital filter implementation employing scaled coefficients for reducing performance degradation due to coefficient quantization [3]. Another exainple of truncation of signal's L LSB bits is a situation where the resulting degradation in accuracy can be tolerated for the application at hand. Again, examples of such a system is an FIR filters whose objective is to meet given filter specifications, however, the implementation is nnade by using a multiplier which is bigger than the least number of bits required to meet these specifications [8], [9]. This situation car easily arise in general DSP implementations where shared multipliers are used for more than one applications and resources are not exclusively dedicated to only one task. For these reasons the switching performance of the hybrid multipliers was computed by truncating L LSB bits of the signal and setting them to zeros. If L LSB bits are not set to zeros, the hybrid multiplier's switching performance will lie between that of LSB-first and MSB-first multipliers. Next, we analyze the results shown in figures 8
-
multipliers, respectively. The figures on left show the results for multipliers with L = 1, and the figures on right show the results obtained for multipliers with L = 2. It is seen that hybrid MSB--first and hybrid LSB-first multipliers show improvement in performance for different signal conditions. The former shows most improvsment when the signal strength is small for A and large for B. The latter sllows most gains when the converse is true. The reduction in switching activity is more pronounced in the Hybrid LSB-first multiplier despite the overhead of cells required to ensure correct operation. This is due to the fact that
-301
-8
Fig. 8. I)T,,, for 8-bit hybrid array multipliers as a function of the signal strength in the operands. (left:) L = l , and, ( r i g h t )
L=2.
'
16
5 I8
I
-
16
18
20loo-
--
o-
-- -
$ -101 -20 -
'
-30
0
2 4 6 8 10 12 14 sqnal Slrenglh d A n Wrd-MSB-Fin Mlltlpller ( L d ) 16
18
'
4 0 0
16
18
Fig. 9. q~~~~ for 16-bit hybrid array multipliersas a function of the signal strength in the operands. ( l e f t ) L = l , and, ( r i g h t )
L=2.
the L LSB bit truncated signals obtained through the 24 model are more effective in turning off larger part of the multiplier since LSB-first part of the multiplier precedes the MSB-first in the former case. Significant r~duct,ion in switching activity is achieved in both cases. Further, the trends adreconsistent for all sizes of multiplier. Figures 11 shows q~~~~for 8 and 16-bit multipliers, respectively, with L = 3. Figure 1;: shows q~~~~for
16 and 32-bit multipliers, respectively, for hybrid multipliers wit,h L = 4. The missing bars indicate that
the A operaind under t,he indicated signal conditions were zeros (small power, large truncation). Hence, no operations are necessary. However, the region of switching reduction moves to the the mid-region of A signal povier. The relative switching activity reduction becomes larger as L increases The trends are consist,ent for all hybrid LSB-first CS/CS and hybrid MSB-first RP/CS multipliers for all sizes and values
27
29
31
27
29
31
Fig. 10.
q ~ ~ for ~ 32-bit < ? hybrid array multipliers as a function of the signal strength in the operands. ( l e f t ) L = l , and,
( r i g h t ) L=2.
as
1 2 3 4 5 6 7 Spnal Srrenglh 01 A n biybnd-LSB-Flm Mllllplier [ L S ) 8
-40 0
16
18
16
18
Fig. 11. q ~ ~ for ~8 ~ ( l e f t ) and 16-bit ( r i g h t ) hybrid array multipliers with L = 3 as a function of the signal strength in the operands.
of L. T h e reduction in switching activity under favorable signal conditions is as large as 40%. The relative performance of a large multiplier for small and large values of L is shown in figure 13. This figures shows q~~~~ for 32-bit hybrid multipliers for L = 1 (figures on left) and L = 8 (figures on right), respectively. Since the indicated truncation for small signal strength complete1.y annihilates its value, the first seven groups of bars are missing in the figures on right. No operations are necessary in this region of operation and no srvitching activity results in the hybrid multiplier, if such operands are applied. Switching reduction of up t o 35% are achievable in the L = 8 case in comparison to about 30% for the L = 1 case. Although the results shown in figures 8
-
of hybrid LSB-first multiplier over hybrid MSB-first multipliers, one must remember that the decision to choose the best multiplication scheme is dependent on the input signal conditions. The relative switching
4 0 0 2
'
r
16 I6 1 3 5 7 9 11 13 15 17 19 21 23 25 Slgnal Slrmglh of A m Hybrld-LSB-Flrrl Mlll~plnr ( L . 4 ) 27 29 31
4 0 0 2
'
16
18
27
29
31
for 16 (left) and 32-bit (right) hybrid array multipliers with L = 4 as a function of the signal strength in the
27
29
31
2 7
29
31
.mL;-3";;
$1;Ij1;;71;;1;3;s~;9 1 j - l8
Slqnal Slrmgth of A m Hybrid-MSB-Flm Mull~pllsr ( L . 1 )
'
d n 29 31
Fig. 13. VT,,, for 32-bit hybrid array multipliers with L = 1 (left) and the operands.
activity reduction also depends on the choice of multiplier used in normalization in equation 1. Figure
14 demonstr<%tes this point by showing q ~ , . , .for ~ ~32-bit hybrid array multipliers with L = 2 and L = 4,
respectively. Notice that the switching activity reduction in hybrid MSB-first multiplier, although smaller than its courlterpart, is more consistent as signal strength of A varies. Hence, for a given application, the latter may b ? preferred despite its general inferior performance t o the hybrid LSB-first multiplier. B.2 Switchirlg Activity for Correlated Signals We now consider the performance of the presented multipliers using the model. For this purpose we
applied d a t a samples obtained from Gaussian distribution for different signal strengths varying from 1 to
- 1 bits. Four situations were chosen t o reflect the effect of correlation in the signal by considering
Slgnal Strength 01 A
Signal Slrength 01 A
S~gnel Slrenglh al A
3 4 5 6 7 Sgrlal Slrenglh a1 A
3 4 5 6 7 Signal Strenglh 01 A
3 4 5 8 7 Slgna Slrenglh of A
3 4 5 8 7 Slgnal Slrength a1 A
Fig. 16.
qpee
for ( l e f t ) 8-bit MSB-first CS, and, ( r i g h t ) 8-bit MSB-first RP, array multi p liers as a furlction of the signal and p g are shown with each plot.
strength of A increases. T h e effect of increasing p~ is an "equalization" of a t A . However, the differences are very small.
In the case of MSB-first multipliers shown in figure 16, we notice that higher p~ causes an "equalized"
q~~~~for smitll signal strengths of A . Hence, better gains are obtained as signal strength of A increases,
and these gams drop quickly as A becomes stronger. The effect of p~ is not discernible Similar results are seen in the hybrid multipliers shown in figures 17 - 18 which shows the effect of correlated signals on the performance of hybrid multipliers. In all these examples, the effect of p~ is negligible, however, high
p~ causes the gains t o equalize in the region where the hybrid multiplier out-performs the tree multiplier.
It is noted that consideration of extremely high correlations do not make much sense because a better approach in (,hiscase is t o difference the d a t a and reduce its dynamic range. Hence, by adding overhead of add operatzon one can significantly reduce the size of the operands in multiplication. T h e results shown in this section clearly indicate that signal correlations have a small effect on the switching activity for all multipliers. It is actually the signal strength a t the inputs which almost completely determines the switching in the multiplier. This is confirmed by a similar observation made for LSB-firsl; RP multipliers in [lo] B.3 Area Comparison The LSB-first CS and MSB-first CS multipliers were implemented in CMOS using 0 . 6 , ~ technology. Both of these structures were implemented after inverter elimination simplifications for the: partial product generator rows [4]. Cells were implemented for both non-inverted and inverted outputs [41 and the bottom
m o s t row constituted a vector merge adder for converting CS format t o regular repre~ental~ion. The layout
areas of the i;wo multipliers is shown in table I1 for purpose of comparison. MSB-first Cis adds a wiring overhead which results in an increased area. This is because the carry signal must be propagated one cell
8 -20
1 2 3 4 5 6 7 S g V l Strenglh of A 6 1 2 3 4 5 6 7 Sgnal Slrenglh of A 8 -30 1 2 3 4 5 6 7 Slgnal Slrenglh o l A 8
8 -20
1 2 3 4 5 6 7 Slgnal Stlength of A 8
3 4 5 8 7 S g l a l Slrenglh ol A
3 4 5 6 7 Sgnal Slrenglh of A
3 4 5 6 7 S~gnaI Slrenglh 0 1A
3 4 5 6 7 Sgnal Slrsnglh 01 A
Fig. 17.
hybrid array multipliers as a function of the signal strength in the operands. (lej't) Hybrid LSB-first
p~
and
pg
W -L d
B 2
SlgTal Slrsnglh 0 1A
O L 3 4 5 6 7 Srgnal Slrenglh 01 A
A 8
8 -20
1 2 3 4 5 8 7 S g m l Strenglh 01 A 8 1 2 3 4 5 8 7 S g ~Slrenglh l 0 1A 8 3 0 1 2 3 4 5 8 7 Slgnd Strength 01 A 8
-20
-30
3 4 5 6 7 Sgnal Slrsnglh of A
Fig. 18. 7 T r e e for 8-bit hybrid array rnultipliersas a function of the signal strength in the operands. ( l e j t ) Hybrid LSB-first CS/CS with L=2, a n d , ( r i g h t ) Hybrid MSB-First R P / C S with L=2. Values of
p~
and
pg
further in a secta angular layout. These values can be used to approximately estimate the area overhead of using hybrid multipliers.
IV. A PPLICATION
TO
In the previous sections, we have provided a qualitative as well as quantitative assessment of the switching activity reduction which can be obtained by using the proposed multiplier structures for various signal conditions. These results can assist in the design for low-power as they show the relative strengths and weaknesses of different multiplier architectures. In this section, we will briefly discus:; the application of this work t o low power quantizatzon, reconfigurable computing and high-level synthesis for low-power
N =8
63,508 84,948 33.8%
N = 12
138,040 178,406 29.3 %
N = 16
241,073 306,009 26.9 %
N = 24
532,982 663,954 24.6 %
N = 32
939,007 1,158,672 23.3%
]
IN 0 . 6 ~
TABLE I1
L AYOUT
A R E A I N (prn)'
OF
LSB-FIRST A N D MSB-FIRST CS
M U L T IPL I ER S FOR V A R I O U S
N x N-BITMUL.TIPLIERS
TECHNOLOGY.
B. Reconfigurable Computing
The cellula.r array structure presented in section I1 is the most general template using which any array multiplier car1 be formed. In applications where reconfigurability is sought for the application a t hand, one may use the ~.nderlying structure proposed in this paper to form any of N! possible multiplier architectures. It is noted that reconfigurability desired specifically for reduction of switching activity may not achieve that goal becsuse of the overheads involved. In general, these overheads reduce the speed of application as well as increzse the overhead power. However, for specific applications where structure of d a t a stream is
well-known, re-configurable multiplier may be employed which eliminates the undesired rows of multiplier (to form a n appropriate hybrid multiplier) in order t o increase the speed of multiplicatiori. In such a case, the interpretation of array multipliers presented in section I1 and the template described in figure 1 can prove to be extremely useful.
We presen1,ed several new array multiplier architectures for reducing switching activity !In general digital signal processing applications. A general cellular structure was presented which can be used to obtain any array multiplier suitable for the given application. This structure provides a unified view of all
N ! possible N-bit array multipliers. The switching activity a t the output nodes of the cells in various
nlultiplier structures was analyzed and compared with a tree nlultiplier based on 4 : 2 cornpressors as well as a LSB-first CS array multiplier. It was shown that the relative improvement in power is a function of statistical properties of the input signals. It was also shown that selection of appropriate airray architecture can give up t,o 40% reduction in switching activity compared to a tree multiplier, and more than 3 times reduction in switching activity compared to the widely used LSB-first array multiplier for commonly occurring situations. We also outlined applications of the proposed multipliers and the presented results t o the areas of low power quantization, reconfigurable computing and high-level synthesis for low power. Hence, the proposed architectures can prove to be extremely useful structures for low power DSP system design.
[I] E. E. Swar1,zlander. '.Computer Arithmetic," IEEE C o m p u t e r S o c i e t y P r e s s , 1990. [2] S. Haykin, "Adaptive Filter Theory," Prentice Hall, N J , 1996.
[3] J . G. Proakis and D. G . Manolakis, ''Digital Signal Processing: Principles, illgorithms, and ilpplications," McMillan
[7] J . K. Jain, L. Song and K.K. Parhi, "Efficient Semisystolic Architectures for Finite-Field Arithmetic," IEEE Trans.
In Proc. of 1997 IEEE International Conference on Computer Design (ICCD '97), pp. 196-201, Austin, Texas.
[9] K. Muham.mad and K. Roy. "Low Power Digital Filters Based On Constrained Least Squares Solution," In Proc. o f the
31st Asilonzar Conference on Signals, Systems and Computers, 1997, Monterey, California
Invited Paper.
[lo] M. Lundberg, K . Muhammad, K. Roy and S. K. Wilson, "High-level Modeling of Switching Activit,~ With Application to Low-power DSP System Synthesis," To appear in the 1999 Proc. IEEE International Conference On ilcoustics,