Low Power Datapath Architecture For Multiply - Accumulate MAC Unit
Low Power Datapath Architecture For Multiply - Accumulate MAC Unit
Low Power Datapath Architecture For Multiply - Accumulate MAC Unit
18th 2019
Abstract— This paper presents a high pace and low power To achieve a two cycle MAC operation in a VLSI system
datapath architecture for Multiply – Accumulate (MAC) unit design, we have chosen a Baugh – Wooley multiplier, because
using Baugh – Wooley multiplier and with the 4:2 Compressor. to increase the performance level of computational speed in
Generally, MAC contains Partial – Product generation terms (PP DSP processor. We designed this MAC unit with Baugh –
unit) and a reduction tree in first stage. In the following stage
wooley multiplier and it’s modified one (with compressor),
there exists an accumulation stage of an adder with the sign
extension. Operand sizes of 8, 16 and 32 bits are used over a because to reduce the critical path delay by inserting an extra
MAC architecture that performs a multiplication and an pipeline register, inside the Partial Product (PP) unit or
accumulation operation. A lower operating frequency for the between PP unit and its adder. By designing of this MAC unit,
proposed architecture can be used to down size the gates in the we can get some improvisation in the parameters like area,
available time slack, resulting in a reduction in lower power, delay, speed and power some extent, when compared to
delay, and area. We proposes a new architecture of MAC with reference one. [3] The most important anxiety of manageable
compressor and it efficiently performs either multiply- gadgets is the battery life, which influences the real-time
accumulate or multiply operations for N bit operands. The new processing applications.
proposed architecture was realized with respect to the ISE Xilinx
PlanAhead tool 14.7 and Cadence RTL compiler by giving the
comparison between Ripple carry adder, Carry save adder and
Brent-Kung adder.
I. INTRODUCTION
The computation of MAC unit is a common digital block It is the moment to explore the demanding on criteria of this
and power efficient architecture used for above such low power, area and high performance DSP signal processor
applications. Generally, MAC architecture can efficiently chips. An efficient 4:2 compressor is used in proposed MAC
operated to perform in many filters, OFDM algorithms, and architecture, in order to minimize the area, delay and power
other algorithms estimation that requires FIR or FFT/IFFT consumption of the multiplier design. To get datapath
computations. [6] A MAC architecture basically contains optimization in MAC unit, in Baugh – wooley algorithm, the
multiplier and an accumulate adder stage as shown in Fig. 1. PP unit reduction tree can be replaced with high speed
Here, the inputs are fed into the multiplier, and their compressors. On the other hand, the MAC architecture have a
consecutive products are summed by an accumulate adder [2]. special feature i.e. product sign extension located in
subsequent stage, together with the accumulate adder and the
392
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
delay is depends on the Partial Product (PP) unit, but the delays carry out path due to rippling operation of carries in full adder
of these two stages are comparable. Here, both stages are stages.
remains faster, because having a larger operand sizes, which
allows us to offered the accumulate adder to provide D. Brent – Kung Adder.
accommodation of extra guard bits in second stage. Brent Kung adder is a parallel (Gray cell and Black cell)
prefix adder. It is high performance carry tree adder in which
Proposed MAC architecture gives a number of advantages
pre-computing of propagation and generation of signal takes
in parameters of power, area, pace, delay and energy.
place. Due to the complexity delay through the carry path,
Minimum interconnection of gates in the multiplier which
these tree adders are more favorable in terms of speed as
produces the sum path will reduces the interconnect delay.
compared to other adders. It consumed less area and has
For the proposed MAC architecture needs no carry
maximum depth. The parallel prefix adder means using Gray
propagate adder to compute.
cell and Black cell (i.e. Generator and Propagator used in both
When, we comparing to a basic existing three cycle
architecture, our architecture which allows us to insert the cells) over the half adders. Thus, above cells are used to
compressor by eliminating both adder and one pipeline register compute the carry- out part of the particular bit stage. These
level without mortifying the speed, because it uses short carries will be assisting to find the sum part of that stage.
interconnects. Since we have designed for a 32bit adder, there number of
stages are 9. So, the diagram below Fig. 7 and Fig. 8, which
A. 4:2 Compressor. shows the 4bit and 16bit Brent-kung adder, here fan-out being
minimized in each bit stage and further stages are also, being
The 4:2 compressors have been extensively used for high
reduced.
speed multipliers. Because of their usual connection, and these
4:2 compressors are ideal for the structure of regular
architectures. The main operation of this compressor is to
reduce the time delay of the PP in accumulator stage of the
multiplier. Generally, the compressors we called as column
compressors, i.e. theses circuits have potentiality to add 3,5,7
bits at a time. A regular 4:2 compressor as shown in Fig. 6.,
illustrated as by taking four inputs and one transitional Cin
input , which generates one Sum bit, one Cout bit and one more
transitional Carry bit. Hence, high speed multipliers are uses
4:2 compressor to lower the latency of PP reduction part and
their advantageous to decrease the delay, area, as well as power
which leads to enhance the performance of on the whole
system.
393
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
compressor based MAC units are also verified as per the
standard design methodology of Cadence RTL compiler.
Baugh-wooley multiplier with compressor designs and
multiplier with different adders are synthesized in Spartan 3
family, to tabulate the area, delay, speed and power. Hence,
the simulation and synthesis results are shown below and
compare this compressor based MAC results with respect to
adders in terms of above parameters.
Fig. 12: Simulation result of Baugh-wooley multiplier with Brent-kung
adder.
Fig. 10: RTL schematic representation for 32bit MAC with compressor.
The above Fig. 9 and Fig. 10 shows the results of simulation
and RTL schematic representation of 32bit Baugh-wooley TABLE II. SYNTHESIS AND COMPARISON RESULTS OF MAC UNIT WITH
multiplier with compressor based MAC unit. Here, we DIFFERENT ADDERS.
obtained both signed and unsigned results from Xilinx, then
RTL schematic representation from cadence tool. Table I.
shows that the comparison results of MAC unit i.e., with and
without compressor of Baugh-wooley multiplier. By using 4:2
compressor we can reduce the delay, area and power when
compare to conventional Baugh-wooley multiplier.
V. CONCLUSION
A proficient adders and compressor based datapath MAC
architecture has been verified effectively for 8, 16 and 32bits.
Thus, we propose a new high pace, area, delay and low power
Fig. 11: RTL schematic representation for Baugh-wooley multiplier with efficient MAC architecture which will be an enhancement than
Brent-kung adder.
the existing conventional architecture by replacing the 4:2
compressors. The proposed low power datapath MAC
architecture performed both operations have yielded better
proficient results in terms of area, delay and power in
synthesis domain of Xilinx and Cadence RTL compiler.
394
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.
VI. FUTURE SCOPE Networks (CICN), 2011 International Conference on, vol., no.,
pp.754,757, 7-9 Oct. 2011.
In future enhancement, the above compressor based MAC [12] Noel Daniel Gundi, “Implementation of 32 bit Brent Kung Adder using
architecture can be verify at circuit level in both FPGA and Complementary Pass Transistor Logic”, thesis approved - June 2008.
ASIC domains, which will be an optimizing the computational [13] Shubham Gogoria, Karthikeyan A, “Implementation of Baugh-Wooley
results. In FPGA domain the logical gates are mapped to the Algorithm and Compressors in signed Multipliers”, International
Look up Tables (LUTs), where to evaluate the place and route Journal of Advanced Research in Electrical, Electronics and
Instrumentation Engineering, Vol. 4, Issue 8, August 2015.
of the logic gates.
ACKNOWLEDGMENT
I would like to thank and acknowledge the unbridled
enthusiasm of my guide Dr. Narendra C P, Associate
Professor, Department of ECE, BIT, Bangalore, for his
encouragement and valuable guidance throughout of my
project paper. I would like to thank my external guide Chandra
Mohan U, Chief Technology Officer, Banashree Renewable
Energy System PVT, Ltd., Bangalore, for providing his
continuous support and suggestion. Also I would like to thank
my parents and friends for their support and advice.
REFERENCES
395
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 11,2024 at 17:29:20 UTC from IEEE Xplore. Restrictions apply.