Low Power Datapath Architecture For Multiply - Accumulate MAC Unit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT-2019), MAY 17th &

18th 2019

Low Power Datapath Architecture for Multiply –


Accumulate (MAC) Unit
SPOORTHI H R Dr. NARENDRA C P CHANDRA MOHAN U
Dept. of Electronics and Communication Dept. of Electronics and Communication Chief Technology Officer
Engineering Engineering Banashree RENEWABLE ENERGY
Bangalore Institute of Technology Bangalore Institute of Technology System PVT. Ltd.
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Abstract— This paper presents a high pace and low power To achieve a two cycle MAC operation in a VLSI system
datapath architecture for Multiply – Accumulate (MAC) unit design, we have chosen a Baugh – Wooley multiplier, because
using Baugh – Wooley multiplier and with the 4:2 Compressor. to increase the performance level of computational speed in
Generally, MAC contains Partial – Product generation terms (PP DSP processor. We designed this MAC unit with Baugh –
unit) and a reduction tree in first stage. In the following stage
wooley multiplier and it’s modified one (with compressor),
there exists an accumulation stage of an adder with the sign
extension. Operand sizes of 8, 16 and 32 bits are used over a because to reduce the critical path delay by inserting an extra
MAC architecture that performs a multiplication and an pipeline register, inside the Partial Product (PP) unit or
accumulation operation. A lower operating frequency for the between PP unit and its adder. By designing of this MAC unit,
proposed architecture can be used to down size the gates in the we can get some improvisation in the parameters like area,
available time slack, resulting in a reduction in lower power, delay, speed and power some extent, when compared to
delay, and area. We proposes a new architecture of MAC with reference one. [3] The most important anxiety of manageable
compressor and it efficiently performs either multiply- gadgets is the battery life, which influences the real-time
accumulate or multiply operations for N bit operands. The new processing applications.
proposed architecture was realized with respect to the ISE Xilinx
PlanAhead tool 14.7 and Cadence RTL compiler by giving the
comparison between Ripple carry adder, Carry save adder and
Brent-Kung adder.

Keywords—Low power, multiply – accumulate, compressor,


Brent –kung adder, cadence RTL compiler.

I. INTRODUCTION

A DSP processor is essentially intended to support fast


execution of the repetitive as well as numerically intensive
computation of digital signal processing algorithms. The
increasing prominence of transportable system and needs to
limit the power utilization in a high density of Very Large
Scale Integration (VLSI) chips domain have led to rapid and
novel developments in low power design. [1] An issue of low
power design is becoming a most important anxiety in high
performance digital systems such as microprocessor, digital
signal processing, and other data rigorous applications . Hence,
the power efficient multiplier plays a major role in VLSI
system design. Fig. 1: Two – Cycle MAC architecture [2].

The computation of MAC unit is a common digital block It is the moment to explore the demanding on criteria of this
and power efficient architecture used for above such low power, area and high performance DSP signal processor
applications. Generally, MAC architecture can efficiently chips. An efficient 4:2 compressor is used in proposed MAC
operated to perform in many filters, OFDM algorithms, and architecture, in order to minimize the area, delay and power
other algorithms estimation that requires FIR or FFT/IFFT consumption of the multiplier design. To get datapath
computations. [6] A MAC architecture basically contains optimization in MAC unit, in Baugh – wooley algorithm, the
multiplier and an accumulate adder stage as shown in Fig. 1. PP unit reduction tree can be replaced with high speed
Here, the inputs are fed into the multiplier, and their compressors. On the other hand, the MAC architecture have a
consecutive products are summed by an accumulate adder [2]. special feature i.e. product sign extension located in
subsequent stage, together with the accumulate adder and the

978-1-7281-0630-4/19/$31.00 ©2019 IEEE


391
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
saturation unit. Therefore, the feedback of product result is required. This verilog implementation of work is efficiently
obtained within the second pipeline stage. applicable for VLSI system design to obtain the results of
different mathematical concepts. The designing of any
multiplier based on Graphical User Interface (GUI) at different
II. METHODOLOGY CMOS technology level, which gives the optimization of
Here, we discuss about the overview of the baugh-wooley transistor or MOS chips due to various result of enhancement
algorithm and its architecture for 8, 16 and 32 bits. in the field of complexity reduction [5]. An architecture of
Baugh – Wooley multiplier method was developed to [4] 16bit Baugh-wooley multiplier as shown in Fig. 4.
design direct multiplication of two’s complement numbers.
When multiplying two’s complements numbers directly, each
of the partial products to be added is a signed number. The
essential point of this multiplier is the sign bits of all the
multiplier and multiplicand is unsigned. This algorithm was
designed by using the conventional logical full adders. Hence,
two’s complement numbers of 4 bits are multiplied and finally
we obtain the products as (p0 –p7) as shown in Fig. 2.

Fig. 4: Architecture for 16 bit Baugh – wooley multiplier [5].

III. PROPOSED MAC ARCHITECTURE

Fig. 2: 4bit Baugh – Wooley multiplier algorithm.

Many algorithms have been designed for two’s complement


numbers multiplication. But, one of the best algorithms is the
Baugh-wooley multiplier algorithm [9] because it allows
highest speediness for the multiplier logic and also has all the
partial products with positive signed bits only. When
multiplying the partial products then added with the signed
number, to obtain final product as signed bit extension that
gives the sum part along with the carry part by reduction in the
carry save adders. This negative signed extension bits have a
less area, and for signed 8bits the Baugh-wooley algorithm as
shown in Fig. 3.

Fig. 5: Proposed Multiply – Accumulate (MAC) architecture.

The proposed compressor based MAC architecture as shown


in above Fig. 5. This proposed architecture performs two
operations like multiplication and multiply – accumulate, i.e.,
Fig. 3: 8bit Baugh – wooley multiplier algorithm.
inputs are fed into the REG_a and REG_b to generate partial
product terms and stored these terms in pipeline register for
We have designed architecture for 16 bit and 32 bit Baugh-
further process. Then, adder in second stage is replaced with
wooley algorithm by using full adder unit cell which is done in
the low power 4:2 compressor for to get a high speed
Xilinx tool as written in verilog code. Since, the Baugh- computational level in partial products. Also, that stage to be
wooley multiplier is used significantly in applications as DSP added with a RCA, CSA and Brent-kung adder to operate
processor, signal and systems where density of signal is multiply – accumulate operation. Thus, still the critical path

392
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
delay is depends on the Partial Product (PP) unit, but the delays carry out path due to rippling operation of carries in full adder
of these two stages are comparable. Here, both stages are stages.
remains faster, because having a larger operand sizes, which
allows us to offered the accumulate adder to provide D. Brent – Kung Adder.
accommodation of extra guard bits in second stage. Brent Kung adder is a parallel (Gray cell and Black cell)
prefix adder. It is high performance carry tree adder in which
Proposed MAC architecture gives a number of advantages
pre-computing of propagation and generation of signal takes
in parameters of power, area, pace, delay and energy.
place. Due to the complexity delay through the carry path,
 Minimum interconnection of gates in the multiplier which
these tree adders are more favorable in terms of speed as
produces the sum path will reduces the interconnect delay.
compared to other adders. It consumed less area and has
 For the proposed MAC architecture needs no carry
maximum depth. The parallel prefix adder means using Gray
propagate adder to compute.
cell and Black cell (i.e. Generator and Propagator used in both
When, we comparing to a basic existing three cycle
architecture, our architecture which allows us to insert the cells) over the half adders. Thus, above cells are used to
compressor by eliminating both adder and one pipeline register compute the carry- out part of the particular bit stage. These
level without mortifying the speed, because it uses short carries will be assisting to find the sum part of that stage.
interconnects. Since we have designed for a 32bit adder, there number of
stages are 9. So, the diagram below Fig. 7 and Fig. 8, which
A. 4:2 Compressor. shows the 4bit and 16bit Brent-kung adder, here fan-out being
minimized in each bit stage and further stages are also, being
The 4:2 compressors have been extensively used for high
reduced.
speed multipliers. Because of their usual connection, and these
4:2 compressors are ideal for the structure of regular
architectures. The main operation of this compressor is to
reduce the time delay of the PP in accumulator stage of the
multiplier. Generally, the compressors we called as column
compressors, i.e. theses circuits have potentiality to add 3,5,7
bits at a time. A regular 4:2 compressor as shown in Fig. 6.,
illustrated as by taking four inputs and one transitional Cin
input , which generates one Sum bit, one Cout bit and one more
transitional Carry bit. Hence, high speed multipliers are uses
4:2 compressor to lower the latency of PP reduction part and
their advantageous to decrease the delay, area, as well as power
which leads to enhance the performance of on the whole
system.

Fig. 7: 4bit Brent- Kung adder [9].

Fig. 6: 4:2 Compressor using Full adder [13].

B. Ripple Carry Adder.


The Ripple carry adder is generally, which is being
connected to the N- bit full adders and perform the parallel
addition. The time delay of RCA will change w.r.t the length
of the carry propagator path. Fig. 8: 16bit Brent – Kung adder [12].
C. Carry Save Adder. IV. RESULTS AND DISCUSSIONS
Carry save adder circuit is to calculate the sum of two or
The proposed MAC architecture unit using 4:2 compressors
more numbers, it can be advantageous to not to propagate the
were designed and verified through the Simulation and
carry product. It just uses a set of full adders and half adders to
Synthesis tool of ISE Xilinx version 14.7. Results of this
compute. It will be optimized the induced delay, where it is in

393
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
compressor based MAC units are also verified as per the
standard design methodology of Cadence RTL compiler.
Baugh-wooley multiplier with compressor designs and
multiplier with different adders are synthesized in Spartan 3
family, to tabulate the area, delay, speed and power. Hence,
the simulation and synthesis results are shown below and
compare this compressor based MAC results with respect to
adders in terms of above parameters.
Fig. 12: Simulation result of Baugh-wooley multiplier with Brent-kung
adder.

The above Fig. 11 and Fig. 12 shows the results of RTL


schematic representation and simulation of 32bit Baugh-
wooley multiplier with compressor based MAC unit with
different adders. Table II. Shows that the comparison results of
MAC unit i.e., with and without compressor of Baugh-wooley
multiplier and adders.

TABLE I. SYNTHESIS AND COMPARISON RESULTS OF MAC UNIT IN


Fig. 9: Simulation result of 32bit Baugh-wooley multiplier with compressor. TERMS OF AREA, DELAY, POWER.

Fig. 10: RTL schematic representation for 32bit MAC with compressor.
The above Fig. 9 and Fig. 10 shows the results of simulation
and RTL schematic representation of 32bit Baugh-wooley TABLE II. SYNTHESIS AND COMPARISON RESULTS OF MAC UNIT WITH
multiplier with compressor based MAC unit. Here, we DIFFERENT ADDERS.
obtained both signed and unsigned results from Xilinx, then
RTL schematic representation from cadence tool. Table I.
shows that the comparison results of MAC unit i.e., with and
without compressor of Baugh-wooley multiplier. By using 4:2
compressor we can reduce the delay, area and power when
compare to conventional Baugh-wooley multiplier.

V. CONCLUSION
A proficient adders and compressor based datapath MAC
architecture has been verified effectively for 8, 16 and 32bits.
Thus, we propose a new high pace, area, delay and low power
Fig. 11: RTL schematic representation for Baugh-wooley multiplier with efficient MAC architecture which will be an enhancement than
Brent-kung adder.
the existing conventional architecture by replacing the 4:2
compressors. The proposed low power datapath MAC
architecture performed both operations have yielded better
proficient results in terms of area, delay and power in
synthesis domain of Xilinx and Cadence RTL compiler.

394
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
VI. FUTURE SCOPE Networks (CICN), 2011 International Conference on, vol., no.,
pp.754,757, 7-9 Oct. 2011.
In future enhancement, the above compressor based MAC [12] Noel Daniel Gundi, “Implementation of 32 bit Brent Kung Adder using
architecture can be verify at circuit level in both FPGA and Complementary Pass Transistor Logic”, thesis approved - June 2008.
ASIC domains, which will be an optimizing the computational [13] Shubham Gogoria, Karthikeyan A, “Implementation of Baugh-Wooley
results. In FPGA domain the logical gates are mapped to the Algorithm and Compressors in signed Multipliers”, International
Look up Tables (LUTs), where to evaluate the place and route Journal of Advanced Research in Electrical, Electronics and
Instrumentation Engineering, Vol. 4, Issue 8, August 2015.
of the logic gates.
ACKNOWLEDGMENT
I would like to thank and acknowledge the unbridled
enthusiasm of my guide Dr. Narendra C P, Associate
Professor, Department of ECE, BIT, Bangalore, for his
encouragement and valuable guidance throughout of my
project paper. I would like to thank my external guide Chandra
Mohan U, Chief Technology Officer, Banashree Renewable
Energy System PVT, Ltd., Bangalore, for providing his
continuous support and suggestion. Also I would like to thank
my parents and friends for their support and advice.
REFERENCES

[1] Pramod S. Aswale, Mukesh P. Mahajan, Manjul V. Nikumbh, Omkar S.


Vaidya, “Implementation of Baugh-Wooely Multiplier and Modified
Baugh Wooely Multiplier Using Cadence (Encounter) RTL”,
International Journal of Science, Engineering and Technology Research
(IJSETR), Volume 4, Issue 2, February 2015.
[2] Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P., "A High-Speed,
Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture
and Its Application to a Double-Throughput MAC Unit," Circuits and
Systems I: Regular Papers, IEEE Transactions on , vol.57, no.12,
pp.3073,3081, Dec. 2010.
[3] Kothakonda Lavanya, Keerthi Srilekha, “Design of Efficient
Compressor & Adder Based MAC Architecture for DSP Applications”,
September 2016, IJIRT, Volume 3 Issue 4,ISSN: 2349-6002.
[4] AditiPandey, AvinashRai, “16 Bit Implementation of Asynchronous
Twos Complement Array Multiplier Using Modified Baugh-Wooley
Algorithm and Architecture”, AditiPandey et al. / Indian Journal of
Computer Science and Engineering (IJCSE), Vol. 8 No. 3 Jun-Jul 2017.
[5] Shaik.Noorjahan Begum, M. Rambabu, “Design and Implementation of
16-Bit Baugh-Wooley Multiplier with GDI Technology”, International
Journal of Advanced Scientific Technologies in Engineering and
Management Sciences (IJASTEMS-ISSN: 2454-356X)
Volume.2,Issue4,April.2016.
[6] T. T. Hoang, M. Själander, and P. Larsson-Edefors, “High-speed,
energy- efficient 2-cycle multiply-accumulate architecture,” in Proc.
IEEE Int. SOC Conf. (SOC), Sep. 2009, pp. 119–122.
[7] Yangbo Wu; Weijiang Zhang; Jianping Hu, "Adiabatic 4-2 compressors
for low-power multiplier," Circuits and Systems, 2005. 48th Midwest
Symposium on, vol., no.,pp.1473,1476 Vol. 2, 7-10 Aug. 2005.
[8] A. Abdelgawad and M. Bayoumi, “High speed and area-efficient
multiply accumulate (MAC) unit for digital signal processing
applications,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May
2007, pp. 3199–3202.
[9] Rakesh Kumar, Pradeep Kumar, “An Efficient Baugh-Wooley
Multiplication Algorithm for 32-bit Synchronous Multiplication”,
International Journal of Advanced Engineering Research and Science
(IJAERS) [Vol-1, Issue-2, July 2014].
[10] Pramodini Mohanty “An Efficient Baugh-Wooley Architecture for
Signed & Unsigned Fast Multiplication”, NIET Journal of Engineering
& Technology, Vol. 1, Issue 2, 2013.
[11] Jaina, D.; Sethi, K.; Panda, R., "Vedic Mathematics Based Multiply
Accumulate Unit," Computational Intelligence and Communication

395
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.

You might also like