Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit Using EXOR Full Adder and 4:2 Compressor
Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit Using EXOR Full Adder and 4:2 Compressor
Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit
using EXOR Full Adder and 4:2 Compressor
V. Mohanapriya1, S. Purushothaman2, S. Tamilarasi3, P. Vinitha4
1PG Student , Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India) .
2Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).
3Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).
4Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract: MAC (Multiply Accumulate Unit) column addition, and the carry-lookahead (CLA) adder is
computation plays a important role in (DSP) Digital often used to reduce the critical path in the accumulator
Signal Processing. The MAC is common step that or the final addition stage of the multiplier. Meanwhile, a
computes the product of two numbers and add that MAC operation is performed in the machine learning
product to an accumulator. Generally, the Pipelined algorithm to compute a partial sum that isthe
architecture is used to improve the performance by accumulation of the input multiplied by the weight. In a
reducing the length of the critical path. But, more MAC unit, the multiply and accumulate operations are
number of flip flops are used when using the pipeline usually merged to reduce the number of carry-
architecture that reduces the efficiency of MAC and propagation steps from two to one [10]. Such a structure,
increases the power consumption. On the basis of however, still comprises a long critical path delay that is
machine learning algorithm, this paper proposes a feed approximately equal to the critical path delay of a
forward-cutset-free (FCF) pipelined MAC architecture multiplier. It is well known that pipelining is one of the
that is specialized for a high- performance machine most popular approaches for increasing the operation
learning accelerator, and also proposes the new design clock frequency. Although pipelining is an efficient way
concept of MFCF_PA using the concept of column addition to reduce the critical path delays, it results in an increase
stage with the 4:2 compressor. Therefore, the proposed in the area and the power consumption due to the
design method reduces the area and the power insertion of many flip- flops. In particular, the number of
consumption by decreasing the number of inserted flip- flip-flops tends to be large because the flip-flops must be
flops for the pipelining when compared to the existing inserted in the feed forward-cutset to ensure functional
pipelined architecture for MAC computation. Finally, the equality before and after the pipelining. The problem
proposed feed forward cutset free pipelined architecture worsens as the number of pipeline stages is increased.
for MAC is implemented in the VHDL and synthesized in The main idea of this paper is the ability to relax the
XILINX and compared in terms of area, power and delay feedforward-cutset rule in the MAC design for machine
reports. learning applications, because only the final value is used
out of the large number of multiply–accumulations. In
Keywords: Hardware accelerator, Machine Learning, other words, different from the usage of the
Multiply–Accumulate (MAC) unit, Pipelining. conventional MAC unit, intermediate accumulation
values are not used here, and hence, they do not need to
1. INTRODUCTION
be correct as long as the final value is correct. Under such
In a machine learning accelerator, a large number of a condition, the final value can become correct if each
multiply–accumulate (MAC) units are included for binary input of the adders inside the MAC participates in
parallel computations, and timing- critical paths of the the calculation once and only once, irrespective of the
system are often found in the unit. A multiplier typically cycle. Therefore, it is not necessary to set an accurate
consists of several computational parts including a pipeline boundary. Based on the previously explained
partial product generation, a column addition, and a final idea, this paper proposes a feed forward-cutset-
addition. An accumulator consists of the carry- free(FCF) pipelined MAC architecture for a high-
propagation adder. Long critical paths through these performance machine learning accelerator.
stages lead to the performance degradation of the overall
system. To minimize this problem, various methods have
been studied. The Wallace [8] and Dadda [9] multipliers
are well-known examples for the achievement of a fast
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 943
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 944
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 946
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
a7b7 a6b7 a5b7 a4b7 a3b7 a2b7 a1b7 a0b7 a0b6 a0b5 a0b4 a0b3 a0b2 a0b1 a0b0
a7b6 a6b6 a5b6 a4b6 a3b6 a2b6 a1b6 a1b5 a1b4 a1b3 a1b2 a1b1 a1b0
a7b5 a6b5 a5b5 a4b5 a3b5 a2b5 a2b4 a2b3 a2b2 a2b1 a2b0
a7b4 a6b4 a5b4 a4b4 a3b4 a3b3 a3b2 a3b1 a3b0
a7b3 a6b3 a5b3 a4b3 a4b2 a4b1 a4b0
a7b2 a6b2 a5b2 a5b1 a5b0
a7b1 a6b1 a6b0
a7b0
Pipeline Stage
Pipeline Stage
3.5 Advantages
Less area and shorter critical path delay when 4.3 Area Report
using the concept of DADDA multiplier.
Power consumption is low.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 947
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
4.4 Simulation Output networks,” in Proc. Adv. Neural Inf. Process. Syst.,
2012, 1097–1105.
[2] K.Simonyan and A. Zisserman. “Very deep
convolution networks for large-scale image
recognition.” [Online]. Available:
https://arxiv.org/abs/1409.1556. 2014.
[3] A.Graves, A.-R. Mohamed, and G. Hinton, “Speech
recognition with deep recurrent neural
networks,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), 2013, 6645 –
6649.
[4] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze,
“Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolution neural
networks,” IEEE J. Solid-State Circuits, 52, 2017,
127–138.
[5] C. S. Wallace, “A suggestion for a fast multiplier,”
IEEE Trans. Electron. Comput., 1, , 1964, 14 –17,
Feb. 1964.
[6] L. Dadda, “Some schemes for parallel
multipliers,” Alta Frequenza, vol. 34, no. 5, pp.
349 –356, Mar.1965.
Fig –10 Simulation output of MAC unit using 4:2 [7] P. F. Stelling and V. G. Oklobdzija, “Implementing
compressor multiply-accumulate operation in multiplication
time,” in Proc. 13th IEEE Symp. Comput.
5. CONCLUSION Arithmetic, Jul. 1997, pp. 99 –106.
[8] K. K. Parhi, VLSI Digital Signal Processing Systems:
We introduced the FCF pipelining method in this Design and Implementation. New Delhi, India:
paper. In the proposed scheme, the number of flip-flops in a Wiley, 1999.
pipeline can be reduced by relaxing the feed forward-cutset
constraint, thanks to the unique characteristic of the [9] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors,
machine learning algorithm. We applied the FCF pipelining “A high-speed, energy-efficient two-cycle multiply-
method to the accumulator (FCF-PA) design, and then accumulate (MAC) architecture and its application
optimized the power dissipation of FCF-PA by reducing the to a double-throughput MAC unit,” IEEE Trans.
chance of undesired data transitions (MFCF-PA). The Circuits Syst. I, Reg. Papers, vol. 57, no. 12, pp.
proposed scheme was also expanded, and applied to the 3073–3081, Dec. 2010.
MAC unit (FCF- MAC). For the evaluation, the conventional
[10] W. J. Townsend, E. E. Swartzlander, and J. A.
and proposed MAC architectures were synthesized in a 65-
Abraham, “A comparison of Dadda and Wallace
nm CMOS technology. The proposed accumulator showed
multiplier delays,” Proc. SPIE, Adv. Signal
the reduction of area and the power consumption by 17%
Process. Algorithms, Archit., Implement. XIII, vol.
and 19%, respectively, compared with the accumulator with
5205, pp. 552–560, Dec. 2003, doi:
the conventional CLA adder-based design. In the case of the
10.1117/12.507012.
MAC architecture, the proposed scheme reduced both the
area and power by 20%. we will design MAC Unit using
MCF_PA with 4:2 compressor and XOR MUX Full adder with
compared Conventional full adder designs in the future.We
believe that the proposed idea to utilize the unique
characteristic of 4:2 compressor computation for more
efficient MAC design can be adopted in many hardware
accelerator designs.
6. REFERENCES
[1] A.Krizhevsky, I. Sutskever, and G. E. Hinton, “Image
Net classification with deep convolution neural
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 948