A Reconfigurable Architecture of A High Performance 32-Bit MAC Unit For Embedded DSP
Abstract: aspects:
Tlus paper describes a reconfignrable architecture of a 1. Two 16-bit multipliers with modified Booth
lugh-perfonnance pipelined 32-bit Multiply-Accumulate arithmetic and Wallace Tree stlucture are combined to
Unit (MAC). wluch is designed for a powerful embedded implement 32-bit multiplication. Thus 16-bit and 32-bit
Digital Signal, Processor (DSP). The MAC unit we computations can both be handled by the MAC
design can can) out two 16-bit multiplications in one efficiently.
clock cycle. The 32x16. 32x32. 32x16+80 and 32x32tSO 2. A 2-stage global pipeline is added in the MAC unit
operations can be implemented in two clock cycles. to reach a high data tlnuughput. In the fist global pipeline
These characteristics allow the DSP being applied stage, a self-timed local pipeline is designed to accelerate
efficiently in different s i h i a t i o ~ .A 2 stage pipeline is the paltinl product generation of 32-bit multiplication and
designed for this MAC unit to reach high throughputs. MAC operation, which can decrease the latency by 25%.
Tlus MAC is syntliesizable and bas already been used in This paper is organized as follows: Section 2 gives the
an embedded DSP core. architecture specikation and operation the
reconfignrable MAC unit. Section 3 describes
1. Introduction components of the MAC unit in detail. including the
Nowadays. the real-time multiiiiedia system for speech. Booth-Wallace tree Multiplier and Accumulator. Section
video. and image processing is i n dire need of powerful 4 shows how the global and local pipelines work in this
DSPs. In speech codes. such as G.729b. GSM-AMR, MAC, including the self-timed clock generate system. In
Mp.3 and AC-3. the 16-bit MAC operatioils are r e q u i d . section 5 the summary of this MAC unit are presented.
But in wirelessihandheld devices and many other
applications of DSP chip. the 32-bit and even higher 11. Architecture
widths computation sliould be available. As the key A reconfgurable architecture is designed for handling
building of DSP. the multiply-accuniulator (MAC) Unit 16-bit and 32-bit computation simultaneously. Fig 1
designed with new methods to reach not only the
n i i i ~ be
t gives out its complete architecture. The main building
Iugh speed and low power requirements. but also the blocks are two 16-bit Booth-Wallace Tree multipliers and
bigh efficient?, for different applications. In this paper, an an accumulator, containing a 16-bit cany look-ahead
efficient .reconfigunble architecture of a high adder (CLA) and a four inputs adder with Wallace Tree
perronilance 32-bit MAC unit is described. This MAC compressor, wluch uses carry select adders (CSA). The
can c a m out hvo 16-bit niultiplicatioils in one cycle. The inputs of MAC are 32-bit multiplicand A. 32-bit
-32x16. 32x32. 32x16+80 and 32xi2tSO operations are multiplier B and an SO-bit accumulator C. While the
also supported and can be implemented in hv0 clock results are IO-bit MP1, MP2 and an SO-bit R comes from
cycles. the MAC operation. MP1 and Mp2 are two multiply
Compared with traditional MAC unit, this results when the MAC unit carrying two 16-bit
reconfiyrable architecture is innovative in following two multiplication at the same time. R is the result of the
32x16. 32x32. 32sl6+8O and 32x32+80 operations. The
dash line iii the middle of Fig I breaks the two pipeline and -&OBI iinpleinent in the first pipeline'stage. The
stages. final addition of the 4 products and the accumulator are
completed in second pipeline stage.
For an u-stage pipeline can speed up the circuit n times,
a 2-stage local pipeline is used in the first global pipeline
stage. We use 2-stage pipelined 16-bit multipliers and set
them working in pipeline mode. This method can
decxease the PPs generation delay by 25% while adding
just a little internal registers. area. The local pipeline
control will be described in detail in Section 4.
A =(-l)AN-J"-' +CA,2'
" - "
.4 = (.4& ,....4.v(&* 4,.4& =A1 .2 2 + .% 8 4
.~N " .
B= .,... B,v,,B.,,I-,...B,B,), = B, 2 * +Bo (1)
Where: w~ieie 4= O .
.4, =(.'I4,",,),
, : A , =(A,,,- ,....4& According to equation 3, Booth algorithm divides the
" , multiplicand e v e v 3 bits (with o w b i t overlapped) and
B, =CBN.I...BN12)2
Bo =(BN,2-l...Bo)2
encodes them as table 1 shows. The five outputs {O, B,
Then. the inultiplication can be denoted as: -B, 2B, -2B) donate 5 different number multiplied by the.
- * . multiplier B.
+ io&
.4 B = .i,k, e, 2".+ 2 2 (A,B, + .dB,)
(2) So, PPs of the 16x16 unsignedlsigned multiplication
As equation 2 shows. an N s .N operation can be can he reduced huice.
aclueved mainly by 4 NI2 s NI2 multiplications, and also The Wallace Tree is a reduction architecture that uses
additional shift as well as add operations. So, we can use the cany save adder technique to handle the addition of
two 16-bit inultipliers twice to generate partial products the PPs. A cany save adder accepts three inputs with the
(PP) needed. These partial products are shifted and added'
to aclueve the final result. To balance the pipeline, the same weight, x,, ''j, and ',. generates two outputs ,' .
., "
with the same weight, and ';+I, with a double of the
twice 16-bit ;nultiplications of 4.r "1. .'o*Ba, '1'*BO,
weight. Well, a .row of these adders consist of a 3-2 1 "
compressor. Tlie addition time can be reduced soundly. dO*Ao and that of C are sent to 16-bit CLA adder as
log 1 n
' For an n-bit multiplier. tliere.\vill be' 7 carry save two addends. As in Fig 2, ',,, is the c a w out of the
addcrs in it. 16-bit CLA adder, whose sunmation is the low 16
In the 16-bit unsignectlsigned multiplier. 9 paltial signifcant bits of the final result.
products are sent into 9-to-2 Wallace Tree Compressor
and 2 outputs are imported by the Final Adder to achieve
the final product.
extension of sign bits of the PPs of A'Bn and &*'I, lines represents the first global pipeline stage. The hvo
thin dash lines in this zpne divide this global stage into 3
o n ~ yone row according to tie MAC operation
~ I U C I Iis
local pipeline stages, which are controlled by the
mode (with . the, operands are unsignedunsigned,
self-timed clock from the sub-region clock generator.
unsignedsigned, signeusigned and signedunsigned
respectively). Tlie low 16 significant bits of PP of
period and stop the clock after final products cany out.
Synthesized by SMIC 0.18um process library and after
placement and route work, this self-time sub-region clock
generator work stably and reach about 660Mhz.
V. Summary
To m e t the requirement of lugh tluougliput and high
efficiency of a powerful SIMD DSP>a high perfonumice
and multi-function 32-bit multiply-accumulate unit has
been designed. It can implement two 16 X 16 operation
sirnultaneously and one 32x16, one 32x32, one
Figure 3 . The pipeline of 32.~32 and 32x32180 32x16+80, one 32x32+80 operations on unsignedlsigned
operations operands In the multiplier. Booth encoding and Wallace
Tree partial products coinpress technology have been
The sub-region clock generator is described in figure used, and a 2-stage local pipeline and sub-region clock
4. genecltor are embedded in the global pipeline. This
The sub-region clock is generated by the dummy cell MAC unit bas been synthesized and simulated to confinu
~vluchCoMeCted like a ring oscillator. The delay of the the correction of this architecture. After synthesis,
dunuiuy cell closely matched to 1/2 niultiplier and placernent and route, this reconfigurable MAC can work
internal register delay. decides period of the local clock. stably at 220Mle.
It also decides the global clock frequency. Because the
