0% found this document useful (0 votes)
40 views

Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit Using EXOR Full Adder and 4:2 Compressor

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit Using EXOR Full Adder and 4:2 Compressor

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit
using EXOR Full Adder and 4:2 Compressor
V. Mohanapriya1, S. Purushothaman2, S. Tamilarasi3, P. Vinitha4
1PG Student , Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India) .
2Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).
3Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).
4Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu(India).

---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract: MAC (Multiply Accumulate Unit) column addition, and the carry-lookahead (CLA) adder is
computation plays a important role in (DSP) Digital often used to reduce the critical path in the accumulator
Signal Processing. The MAC is common step that or the final addition stage of the multiplier. Meanwhile, a
computes the product of two numbers and add that MAC operation is performed in the machine learning
product to an accumulator. Generally, the Pipelined algorithm to compute a partial sum that isthe
architecture is used to improve the performance by accumulation of the input multiplied by the weight. In a
reducing the length of the critical path. But, more MAC unit, the multiply and accumulate operations are
number of flip flops are used when using the pipeline usually merged to reduce the number of carry-
architecture that reduces the efficiency of MAC and propagation steps from two to one [10]. Such a structure,
increases the power consumption. On the basis of however, still comprises a long critical path delay that is
machine learning algorithm, this paper proposes a feed approximately equal to the critical path delay of a
forward-cutset-free (FCF) pipelined MAC architecture multiplier. It is well known that pipelining is one of the
that is specialized for a high- performance machine most popular approaches for increasing the operation
learning accelerator, and also proposes the new design clock frequency. Although pipelining is an efficient way
concept of MFCF_PA using the concept of column addition to reduce the critical path delays, it results in an increase
stage with the 4:2 compressor. Therefore, the proposed in the area and the power consumption due to the
design method reduces the area and the power insertion of many flip- flops. In particular, the number of
consumption by decreasing the number of inserted flip- flip-flops tends to be large because the flip-flops must be
flops for the pipelining when compared to the existing inserted in the feed forward-cutset to ensure functional
pipelined architecture for MAC computation. Finally, the equality before and after the pipelining. The problem
proposed feed forward cutset free pipelined architecture worsens as the number of pipeline stages is increased.
for MAC is implemented in the VHDL and synthesized in The main idea of this paper is the ability to relax the
XILINX and compared in terms of area, power and delay feedforward-cutset rule in the MAC design for machine
reports. learning applications, because only the final value is used
out of the large number of multiply–accumulations. In
Keywords: Hardware accelerator, Machine Learning, other words, different from the usage of the
Multiply–Accumulate (MAC) unit, Pipelining. conventional MAC unit, intermediate accumulation
values are not used here, and hence, they do not need to
1. INTRODUCTION
be correct as long as the final value is correct. Under such
In a machine learning accelerator, a large number of a condition, the final value can become correct if each
multiply–accumulate (MAC) units are included for binary input of the adders inside the MAC participates in
parallel computations, and timing- critical paths of the the calculation once and only once, irrespective of the
system are often found in the unit. A multiplier typically cycle. Therefore, it is not necessary to set an accurate
consists of several computational parts including a pipeline boundary. Based on the previously explained
partial product generation, a column addition, and a final idea, this paper proposes a feed forward-cutset-
addition. An accumulator consists of the carry- free(FCF) pipelined MAC architecture for a high-
propagation adder. Long critical paths through these performance machine learning accelerator.
stages lead to the performance degradation of the overall
system. To minimize this problem, various methods have
been studied. The Wallace [8] and Dadda [9] multipliers
are well-known examples for the achievement of a fast
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 943
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

2. EXISTING SYSTEM 2.2 Disadvantages


RECENTLY, the deep neural network (DNN) emerged as a  Number of inserted flip flops increases the
powerful tool for various applications including image pipeline stages.
classification and speech recognition. Since an enormous  Consumes larger area and high critical path delay.
amount of vector- matrix multiplication computations are  Power consumption is high.
required in a typical DNN application, a variety of
3. PROPOSED SYSTEM
dedicated hardware for machine learning have been
proposed to accelerate the computations. In a machine MAC (Multiply Accumulate Unit) computation plays a
learning accelerator, a large number of multiply– important role in (DSP) Digital Signal Processing. The MAC
accumulate (MAC) units are included for parallel is common step that computes the product of two
computations, and timing-critical paths of the system are numbers and add that product to an accumulator.
often found in the unit. Generally, the Pipelined architecture is used to improve
the performance by reducing the length of the critical
The main idea of this paper is the ability to relax the feed
path. But, more number of flip flops are used when using
forward-cutset rule in the MAC design for machine
the pipeline architecture that reduces the efficiency of
learning applications, because only the final value is used
MAC and increases the power consumption. On the basis
out of the large number of multiply–accumulations. In
of machine learning algorithm, this paper proposes a feed
other words, different from the usage of the conventional
forward- cutset-free (FCF) pipelined MAC architecture
MAC unit, intermediate accumulation values are not used
that is specialized for a high-performance machine
here, and hence, they do not need to be correct as long as
learning accelerator. The proposed design method
the final value is correct. Under such a condition, the final
reduces the area and the power consumption by
value can become correct if each binary input of the
decreasing the number of inserted flip-flops for the
adders inside the MAC participates in the calculation
pipelining when compared to the existing pipelined
once and only once, irrespective of the cycle. Therefore,
architecture for MAC computation. Finally, the proposed
it is not necessary to set an accurate pipeline boundary.
feed forward cutset free pipelined architecture for MAC is
Based on the previously explained idea, this paper implemented in the VHDL and synthesized in XILINX and
proposes a feed forward-cutset-free (FCF) pipelined compared in terms of area, power and delay reports.
MAC architecture that is specialized for a high-
3.1 Proposed FCF Pipelining
performance machine learning accelerator. The
proposed design method reduces the area and the Fig. 1 shows examples of the two-stage 32-bit pipelined
power consumption by decreasing the number of accumulator (PA) that is based on the ripple carry adder
inserted flip-flops for the pipelining. (RCA). A[31 : 0] represents data that move from the
outside to the input buffer register.
2.1 Preliminary: Feed forward-Cutset Rule for
Pipelining A Reg[31 : 0] represents the data that are stored in the
input buffer. S[31 : 0] represents the data that are stored
It is well known that pipelining is one of the most in the output buffer register as a result of the
effective ways to reduce the critical path delay, thereby accumulation. In the conventional PA structure [Fig. 1(a)],
increasing the clock frequency. This reduction is the flip-flops must be inserted along the feed forward-
achieved through the insertion of flip- flops into the data cutset to ensure functional equality. Since the accumulator
path. In addition to reducing critical path delays through in Fig.1(a) comprises two pipeline stages, the number of
pipelining, it is also important to satisfy functional equality additional flip-flops for the pipelining is 33 (gray- colored
before and after pipelining. The point at which the flip-flops flip-flops). If the accumulator is pipelined to the n-stage,
are inserted to ensure functional equality is called the feed the number of inserted flip-flops becomes 33(n−1), which
forward-cutset. confirms that the number of flip-flops for the pipelining
increases significantly as the number of pipeline stages is
Cutset: A set of the edges of a graph such that if these edges
increased.
are removed from the graph, and the graph becomes
disjointed. Fig. 1(b) shows the proposed FCF-PA. For the FCF-PA,
only one flip-flop is inserted for the two- stage pipelining.
Feed forward-cutset: A cutset where the data move in the
Therefore, the number of additional flip-flops for the n-
forward direction on all of the cutset edges.
stage pipeline is n − 1 only.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 944
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

work. In the conventional two-stage PA, the


accumulation output (S) is produced two clock- cycle
after the corresponding input is stored in the input
buffer. On the other hand, regarding the proposed
structure, the output is generated one clock cycle after
the input arrives. Moreover, for the proposed scheme,
the generated carry from the lower half of the 32-bit
adder is involved in the accumulation one clock cycle
later than the case of the conventional pipelining.
For example, in the conventional case, the generated
carry from the lower half and the corresponding inputs
are fed into the upper half adder in the same clock cycle
asshown in the cycles 4 and 5 of Fig. 2 (left). On the
other hand, in the proposed FCF-PA, the carry from the
lower half is fed into the upper half one cycle later than
the corresponding input for the upper half, as depicted
in the clock cycles 3-5 of Fig. 2 (right). This
characteristic makes the intermediate result that is
stored in the output buffer of the proposed accumulator
different from the result of the conventional pipelining
case.

Fig -2: Two-stage 32-bit pipelined-accumulation


Fig -1: Schematics and timing diagrams of two-stage examples with the conventional pipelining (left) and
32-bit accumulators. (a) Conventional PA. (b) Proposed proposed FCF-PA (right). Binary number “1” between
FCF-PA. the two 16-bit hexadecimal numbers is a carry from
the lower half.
In the conventional PA, the correct accumulation
values of all the inputs up to the corresponding clock The proposed accumulator, however, shows the same
cycle are produced in each clock cycle as shown in the final output (cycle 5) as that of the conventional one. In
timing diagram of Fig. 1(a). A two-cycle difference exists addition, regarding the two architectures, the number of
between the input and the corresponding output due to cycles from the initial input to the final output is the
the two- stage pipeline. On the other hand, in the same. The characteristic of the proposed FCF pipelining
proposed architecture, only the final accumulation method can be summarized as follows: In the case where
result is valid as shown in the timing diagram of Fig. adders are used to process data in an accumulator, the
1(b). final accumulation result is the same even if binary
inputs are fed to the adders in an arbitrary clock cycle as
Fig. 2 shows examples of the ways that the far as they are fed once and only once.
conventional PA and the proposed method (FCF- PA)
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 945
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

Meanwhile, the CLA adder has been mostly used to


reduce the critical path delay of the accumulator. The
carry prediction logic in the CLA, however, causes a
significant increase in the area and the power
consumption. For the same critical path delay, the FCF-
PA can be implemented with less area and lower power
consumption compared with the accumulator that is
based on the CLA.

3.2 Full adder designs using XNOR and XOR gates


for sum logic
A full adder design employing two stages of XNOR
gates for the sum logic, while that employing two
successive stages of XOR gates for the sum logic is
depicted.
Fig -4: Pipelined column addition structure with the
Dadda multiplier. Conventional pipelining. (b)
Proposed FCF pipelining. HA: half-adder. FA: full
adder.

Fig -3: Full adder using XOR gates and a MUX.

3.3 Modified FCF-PA for Further Power


Reductions

Although the proposed FCF-PA can reduce the area


and the power consumption by replacing the CLA, there
are certain input conditions in which the undesired data
transition in the output buffer occurs, thereby reducing
the power efficiency when 2’s complement numbers are
used. Fig. 4 shows an example of the undesired data
transition. The inputs are 4-bit 2’s complement binary Fig -5: Proposed (a) FCF-PA and (b) MFCF-PA for the
improvement of the power efficiency.
numbers. AReg [7 : 4] is the sign extension of AReg [3],
which is the sign bit of ARe g [3 : 0]. In the conventional 3.4 4:2 Compressor Design
pipelining [Fig. 4 (left)], the accumulation result (S) in cycle
3 and the data stored in the input buffer (AReg) in cycle 2 The 4:2 compressor used to reduce the number of
are added and stored in the output buffer (S) in cycle. In device computation in order to reduce the area and power
this case, the “1” in AReg [2] in cycle 2 and the “1” in S[2] of a MAC unit is depicted.
in cycle 3 are added, thereby generating a carry. The
carry is transmitted to the upper half of the S, and hence,
S[7:4] remains as “0000” in cycle.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 946
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

Half adder Full adder 4:2 compressor


Feed forward
4.2 Delay Report
cutset rule
Design 1

a7b7 a6b7 a5b7 a4b7 a3b7 a2b7 a1b7 a0b7 a0b6 a0b5 a0b4 a0b3 a0b2 a0b1 a0b0
a7b6 a6b6 a5b6 a4b6 a3b6 a2b6 a1b6 a1b5 a1b4 a1b3 a1b2 a1b1 a1b0
a7b5 a6b5 a5b5 a4b5 a3b5 a2b5 a2b4 a2b3 a2b2 a2b1 a2b0
a7b4 a6b4 a5b4 a4b4 a3b4 a3b3 a3b2 a3b1 a3b0
a7b3 a6b3 a5b3 a4b3 a4b2 a4b1 a4b0
a7b2 a6b2 a5b2 a5b1 a5b0
a7b1 a6b1 a6b0
a7b0

Pipeline Stage

a7b7 a6b7 c12 c11 c9 c7 c5 c3 c2 c1 s1 a0b3 a0b2 a0b1 a0b0


a7b6 a5b7 s12 c10 c8 c6 c4 s3 s2 a2b2 a1b2 a1b1 a1b0
a6b6 a6b5 s11 s9 s7 s5 s4 a4b1 a3b1 a2b1 a2b0
a7b5 a7b4 a7b3 s10 s8 s6 a6b0 a5b0 a4b0 a3b0

Pipeline Stage

y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 x1 a0b1 a0b0


a7b7 x12 x11 x10 x9 x8 x7 x6 x5 x4 x3 x2 a2b0 a1b0

Fig -6: MAC unit using 4:2 compressor.

3.5 Advantages

 Feed Forward Cutset Free technique decreases


the Pipeline stages. Fig –8 Delay report of MAC unit using 4:2 compressor

 Less area and shorter critical path delay when 4.3 Area Report
using the concept of DADDA multiplier.
 Power consumption is low.

4. RESULT AND DISCUSSION

4.1 Power report

Fig –9 Area report of MAC unit using 4:2 compressor


Fig-7 Power report of MAC Unit using 4:2 Compressor.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 947
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072

4.4 Simulation Output networks,” in Proc. Adv. Neural Inf. Process. Syst.,
2012, 1097–1105.
[2] K.Simonyan and A. Zisserman. “Very deep
convolution networks for large-scale image
recognition.” [Online]. Available:
https://arxiv.org/abs/1409.1556. 2014.
[3] A.Graves, A.-R. Mohamed, and G. Hinton, “Speech
recognition with deep recurrent neural
networks,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), 2013, 6645 –
6649.
[4] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze,
“Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolution neural
networks,” IEEE J. Solid-State Circuits, 52, 2017,
127–138.
[5] C. S. Wallace, “A suggestion for a fast multiplier,”
IEEE Trans. Electron. Comput., 1, , 1964, 14 –17,
Feb. 1964.
[6] L. Dadda, “Some schemes for parallel
multipliers,” Alta Frequenza, vol. 34, no. 5, pp.
349 –356, Mar.1965.
Fig –10 Simulation output of MAC unit using 4:2 [7] P. F. Stelling and V. G. Oklobdzija, “Implementing
compressor multiply-accumulate operation in multiplication
time,” in Proc. 13th IEEE Symp. Comput.
5. CONCLUSION Arithmetic, Jul. 1997, pp. 99 –106.
[8] K. K. Parhi, VLSI Digital Signal Processing Systems:
We introduced the FCF pipelining method in this Design and Implementation. New Delhi, India:
paper. In the proposed scheme, the number of flip-flops in a Wiley, 1999.
pipeline can be reduced by relaxing the feed forward-cutset
constraint, thanks to the unique characteristic of the [9] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors,
machine learning algorithm. We applied the FCF pipelining “A high-speed, energy-efficient two-cycle multiply-
method to the accumulator (FCF-PA) design, and then accumulate (MAC) architecture and its application
optimized the power dissipation of FCF-PA by reducing the to a double-throughput MAC unit,” IEEE Trans.
chance of undesired data transitions (MFCF-PA). The Circuits Syst. I, Reg. Papers, vol. 57, no. 12, pp.
proposed scheme was also expanded, and applied to the 3073–3081, Dec. 2010.
MAC unit (FCF- MAC). For the evaluation, the conventional
[10] W. J. Townsend, E. E. Swartzlander, and J. A.
and proposed MAC architectures were synthesized in a 65-
Abraham, “A comparison of Dadda and Wallace
nm CMOS technology. The proposed accumulator showed
multiplier delays,” Proc. SPIE, Adv. Signal
the reduction of area and the power consumption by 17%
Process. Algorithms, Archit., Implement. XIII, vol.
and 19%, respectively, compared with the accumulator with
5205, pp. 552–560, Dec. 2003, doi:
the conventional CLA adder-based design. In the case of the
10.1117/12.507012.
MAC architecture, the proposed scheme reduced both the
area and power by 20%. we will design MAC Unit using
MCF_PA with 4:2 compressor and XOR MUX Full adder with
compared Conventional full adder designs in the future.We
believe that the proposed idea to utilize the unique
characteristic of 4:2 compressor computation for more
efficient MAC design can be adopted in many hardware
accelerator designs.

6. REFERENCES
[1] A.Krizhevsky, I. Sutskever, and G. E. Hinton, “Image
Net classification with deep convolution neural

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 948

You might also like