Design of High-Speed Area Efficient Mac Unit Using Reversible Logic
Design of High-Speed Area Efficient Mac Unit Using Reversible Logic
Design of High-Speed Area Efficient Mac Unit Using Reversible Logic
V Guravaiah
Assistant professor, Department Of ECE, VLSI System Design, Avanthi Institute of Engg. & Tech., India.
Email: [email protected]
Abstract—we propose a low-power high-speed pipeline a multiplier and an accumulator (i.e., an accumulate
multiply-accumulate (MAC) architecture. In a adder). An N-bit MAC unit includes an N-bit
conventional MAC, carry propagations of additions multiplier and a (2N+α-1)-bit accumulator (adder),
(including additions in multiplications and additions in where α is the number of guard bits used to avoid
accumulations) often lead to large power consumption
and large path delay. To resolve this problem, we
overflow (caused by long sequences of multiply-
integrate a part of additions into the partial product accumulate operations). A lot of previous works paid
reduction (PPR) process. In the proposed MAC attention to the optimization of multiplier and the
architecture, the addition and accumulation of higher optimization of adder, respectively. A multiplier
significance bits are not performed until the PPR process usually has three steps. The first step is the partial
of the next multiplication. To correctly deal with the product generation (PPG) process. For example, AND
overflow in the PPR process, a small-size adder is gates can be used to generate a partial product matrix
designed to accumulate the total number of carries. (PPM) for an unsigned multiplication. The second step
Compared with previous works, experimental results is the partial product reduction (PPR) process. By
show that the proposed MAC architecture can greatly
reduce both power consumption and circuit area under
using the Dadda tree approach or the Wallace tree
the same timing constraint. approach, the PPM can be reduced to become two
rows. The third step is the final addition. An adder
Index Terms— MAC architecture, bits, adder, power (called the final adder) is used to perform the
consumption summation of the final two rows.
I. INTRODUCTION A. Multiplier
Multiplication is a fundamental operation in most
The multiply-accumulate (MAC) unit is a fundamental signal processing algorithms. Multipliers have large
block for digital signal processing (DSP) applications. area, long latency and consume considerable power.
Especially, in recent years, the development of real- Therefore low-power multiplier design has an
time edge applications has become a design trend. important part in low-power VLSI system design. A
Thus, there is a strong demand for high-speed low- system is generally determined by the performance of
power MAC units. A conventional MAC unit is the multiplier because the multiplier is generally the
composed of two individual blocks: slowest element and more area consuming in the
system. Hence optimizing the speed and area of the
multiplier is one of the major design issues. However,
Manuscript received Oct 10, 2022; Revised Oct 25, 2022; Accepted area and speed are usually conflicting constraints so
Nov 4, 2022
8
International Journal of Engineering Innovations in Advanced Technology
ISSN: 2582-1431 (Online), Volume-4 Issue-4, December 2022
3. Final addition
Figure 1.MAC architecture
II. EXISTING METHOD
In this section, we present the proposed two-stage We have implemented a tool (a C++ program) to
(i.e., two cycle) MAC architecture. The first stage automatically generate the proposed N-bit MAC in
performs the PPG process, the PPR process (based on Verilog RTL description. The users can specify the
the PPM that combines the PPG result and the value of N and the value of k for automatic generation,
accumulation result), the (2N-k-1)-bit addition (i.e., a where k denotes the number of higher significance bits
part of the final addition) and the αbit addition (for whose additions (accumulation) are not performed in
dealing with the overflow in the PPR process). Then, the final addition. Note that the value of k is equal to
the second stage performs the (k+α)-bit addition to the bit width of register REG2. In our experiments, we
produce the accumulation result. The main features of specify the value of N to be 16 (i.e., 16-bit MAC).
Besides, we assume that the maximum number of
the proposed architecture are below. To reduce the
multiplications in each multiply-accumulate operation
lengths of carry propagations, we integrate a part of
is 256. Thus, the number of guard bits (i.e., the value
additions into the PPR process. To handle overflow
of α) is set to be 8. We have implemented several
in the PPR process, a α-bit adder is used to count the
different configurations of the proposed MAC
total number of carries. By applying the gating
architecture. For the convenience of presentation, we
technique, the second stage can only be executed in the
use the term Ours_k for the naming of each
last cycle (of the entire sequence of multiply-
configuration, where k represents the bit width of
accumulate operations) for power saving. The register REG2. In our experiments, these Verilog RTL
proposed two-stage pipeline MAC unit is displayed in descriptions are synthesized to gate-level netlists and
Fig. 2. Our PPM (for the PPR process) is composed of targeted to TSMC 40 nm cell library by using
two PPMs: one PPM is derived by the PPG and the Synopsys Design Compiler.For comparisons, we also
other PPM is derived by the accumulation. For an
implemented the following two MAC architectures:
unsigned MAC unit, in the PPG process, “AND” gates the conventional MAC architecture and the state-of-
can be directly used to generate the PPM. For a signed the-art MAC architecture. In the conventional MAC
MAC unit, because the influences of the sign bit architecture, the MAC unit is composed of two
should be taken into account, several PPG algorithms individual blocks (i.e., a multiplier and an
have been proposed to generate the signed PPM. In the
accumulator). On the other hand, in the state-of-the-art
proposed architecture, the Baugh-Woolley algorithm
MAC architecture, the multiplier and the accumulator
is adopted in the PPG process to generate the signed are tightly integrated (i.e., a carry-save format is sent
PPM. to the accumulator without being added to only one
vector).
9
International Journal of Engineering Innovations in Advanced Technology
ISSN: 2582-1431 (Online), Volume-4 Issue-4, December 2022
The systolic array has been widely used in the going for this algorithm. Comparing all previously
hardware acceleration for matrix multiplication. In occurring algorithms, this algorithm will produce the
recent years, several research efforts have been paid to optimized output. There are two cases, semi carry save
map the inference of a convolutional neural network to addition and full carry save addition. In this semi carry
a systolic array. Note that a systolic array is composed save addition, the given inputs are in binary and the
of multiple processing elements (PEs). Each PE inter outputs alone in carry save. Whereas in full carry
corresponds to a MAC unit. In this section, we address addition, both inputs and inter outputs are in carry
the application of the proposed MAC architecture to a save. On comparing, it can be seen that semi carry save
systolic array. Figure gives the block diagram of the is the most advantageous one because it has only one
PE based on the conventional MAC architecture. Note carry save and hence it has less area and high speed
that the PE is a two-stage (i.e., two-cycle) pipeline which is required for designing an VLSI based
design. The inputs of the PE are x and y. The block multipliers.
MUL denotes the multiplier. In the first stage, the
multiplier performs the multiplication. Then, the Consider the modulus N to be a k-bit odd number and
output of the multiplier is stored in a register. In the an extra factor R is to be defined as 2k mod N, where
second stage, the accumulator performs the 2k−1 ≤ N < 2k. Given two integers a and b, where a, b
accumulation. Then, the accumulation result is stored
in register result.
III.PROPOSED METHOD
In data transmission applications, the widely used
public-key cryptosystem is a simple and efficient
Montgomery multiplication algorithm such that the In this existing system, carry save addition with semi-
low-cost and high-performance. In which includes carry approach is described. In which all the
encryption and decryption process. The Montgomery multiplicands are not recycled, that is whatever the
multiplier receives and outputs the data with binary multiplicand is needed to be multiplied at that time
representation and uses only one-level carry-save alone is used for determining the output. The carry
adder (CSA) to avoid the carry propagation at each save approach has higher benefits since it is the basic
addition operation. This CSA is also used to perform key for operating a Montgomery modular multiplier.
operand pre-computation and format conversion from In such a way, using this semi carry save type only one
the carry save format to the binary representation, carry level adder is implemented which may be two
leading to a low hardware cost and short critical path serial half adders or a full adder can be used based on
delay at the expense of extra clock cycles for the requirement. It thereby reduces the number of
completing one modular multiplication. To overcome clock cycles and hence less delay. So the output will
the weakness, A configurable CSA (CCSA), which be optimized and it can be implemented using Verilog
could be one full-adder or two serial half-adders, is coding.
proposed to reduce the extra clock cycles for operand
pre-computation and format conversion by half. When
modular multiplier is done with CCSA technique and
it has some drawbacks. The drawbacks are short
critical path, high power consumption. To overcome
the drawbacks the CCSA is replaced with PASTA
(Parallel Self Timed Adder) in the Montgomery
modular multiplier. The PASTA adder can achieve
less power consumption.
10
International Journal of Engineering Innovations in Advanced Technology
ISSN: 2582-1431 (Online), Volume-4 Issue-4, December 2022
The above architecture is the semi-carry save Quotient Pre-computation: Ai+1, Ai+2 and
based Montgomery multiplier. In which the loop is qi+1, qi+2 should be known already in order that the
reduced on comparing to the existing one. It consists unwanted steps in the (i +1) iteration can be reduced
of two multiplexers, one multiplier, one configurable by determining i iteration. So as to pre compute the
carry save adder, flip-flops, skip detector and zero quotients. Another method is using skip detector so
detector. that it will pre computes the values. And also since the
shortest path in this multiplier is lengthened, it has to
Illustrates the block diagram of proposed be minimized. As modulus N is an odd number, it can
semi carry save multiplier. It is first used to be used directly for the multiplication. So that time is
precompute the four-to-two carry save additions. Then consumed highly.
the required multiplication can be performed. The
modulus N and inputs will be allowed inside the To increase the Speed of Operation we are replacing
twomultiplexers. This partial product is then allowed the CSA with PASTA (Parallel self timed adder) in the
inside the multiplier. Those partial outputs then enter proposed architecture.
into configurable carry save adder, where the carry
save addition operation is performed. They are stored
in the flip flops temporarily. When another partial
output is executed, then that will be stored in the flip
flop. The Skip detector will skip the previous
multiplication which is not required in the operation so
as to reduce the number of clock cycles. The partial
product from SM3 is allowed to the multiplexers M4
and M5. Later on it allows inside the flip flops for
temporary storage, then to the skip detector. The
output can be obtained from semi carry. This process
is repeated until the output is obtained. The zero
detectors can also be used to detect zero in many
situations, which is most required. The complexity is
very less compared to the previous one.
11
International Journal of Engineering Innovations in Advanced Technology
ISSN: 2582-1431 (Online), Volume-4 Issue-4, December 2022
proposed architecture. Each state is represented by on the conventional PE (i.e., the conventional MAC
(Ci+1 Si) pair where Ci+1, Si represent carry out and architecture), the systolic array based on the proposed
sum values, respectively, from the ith bit adder block. PE (i.e., the proposed MAC architecture) can greatly
During the initial phase, the circuit merely works as a reduce both circuit area and power consumption under
combinational HA operating in fundamental mode. It the same timing constraint.SCS-based multipliers
is apparent that due to the use of HAs instead of FAs, maintain the input and Output operands of the
state (11) cannot appear. Montgomery MM in the carrysave format to escape
from the format conversion, leading to fewer clock
The proposed architecture of Montgomery cycles but smaller area than FCS-based multiplier. In
Modular Multiplication using PASTA adder, which the existed architecture disadvantages are carry
consists of one one-level Parallel Self Timed propagation delay and extra clock cycles. To
Adder(PASTA) architecture, two 4-to-1 multiplexers overcome the disadvantages we go for PASTA adder.
(M1 and M2) one simplified multiplier SM3, one skip The PASTA adder is using in Montgomery Modular
detector Skip_D, one zero detector Zero_D, and six Multiplier in these advantages are low hard ware cost
registers. Zero detector Zero_D is used to detect SC is short critical path delay and required clock cycles are
equal to zero. The Skip_D is composed of four XOR reduced for completing one MM operation.
gates, three AND gates, one NOR gate, and two 2-to-
1 multiplexers the skip detector is used to detect the REFERENCES
unnecessary multiplication operations.
[1] V.Gupta, D. Mohapatra, A. Raghunathan, and K.
The design has been implemented using Xilinx Roy, “Low-power digital signal processing using
approximate adders,” IEEE Trans. Comput.-Aided
Verilog coding. For further verification, the design can
Design Integr. Circuits Syst., vol. 32, no. 1, pp.
be done using Cadence. It can be clearly understand by 124–137,Jan. 2013.
the waveform shown below. It can be proven that it has
reduced area complexity and speed complexity on [2] [2] E. J. King and E. E. Swartzlander, Jr., “Data-
comparing to all other multipliers. The method has dependent truncation scheme for parallel
been implemented using a configurable carry save multipliers,” in Proc. 31st Asilomar Conf. Signals,
adder so as to prove the maximum delay to be less Circuits Syst., Nov. 1998, pp. 1178–1182.
comparing all. The delay and area can be minimized
as much as possible as comparing to all other previous
[3] [3] K.-J. Cho, K.-C. Lee, J.-G. Chung, and K. K.
existing architectures.
Parhi, “Design of low-error fixed-width modified
booth multiplier,” IEEE Trans. Very Large Scale
IV.CONCLUSION Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522–531,
May 2004.
This paper presents low-power high-speed two-
stage pipeline MAC architecture for real-time DSP [4] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and
applications. Our basic idea is to integrate a part of C. Lucas, “Bio-inspired imprecise computational
additions (including a part of the final addition in the blocks for efficient VLSI implementation of soft-
multiplication and a part of the addition in the computing applications,” IEEE Trans. Circuits
accumulation) into the PPR process. As a result, Syst. I, Reg. Papers, vol. 57, no. 4, pp. 850–862,
critical path delays and power dissipations caused Apr. 2010.
bycarry propagations can be reduced. To correctly deal
with the overflow during the PPR process, an α-bit [5] Momeni, J. Han, P. Montuschi, and F. Lombardi,
“Design and analysis of approximate compressors
accumulator is used to count the total number of
for multiplication,” IEEE Trans. Comput., vol. 64,
carries. Experimental results consistently show that no. 4, pp. 984–994, Apr. 2015.
the proposed approach works well in practice. The
proposed MAC architecture is applicable to both the [6] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu,
design of an unsigned MAC unit and the design of a T. Park, and N. S. Kim, “Energy-efficient
signed MAC unit. Note that the only differences approximate multiplication for digital signal
between the unsigned MAC unit and the signed MAC processing and classification applications,” IEEE
unit are the PPM structure and the α-bit addition Trans. Very Large Scale
mechanism. Moreover, the proposed MAC Integr. (VLSI) Syst., vol. 23, no. 6, pp. 1180–1184,
Jun. 2015.
architecture is also applicable to the systolic array (for
performing the matrix multiplication). Implementation
data show that, compared with the systolic array based
12
International Journal of Engineering Innovations in Advanced Technology
ISSN: 2582-1431 (Online), Volume-4 Issue-4, December 2022
13