0% found this document useful (0 votes)
119 views58 pages

VLSI Project Document

The document describes a project report on optimizing a feed forward cutset free pipelined multiply accumulate unit for machine learning accelerators. It discusses existing MAC unit designs and proposes a modified 4:2 compressor and pipelined FCF architecture for improved performance and reduced power consumption.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views58 pages

VLSI Project Document

The document describes a project report on optimizing a feed forward cutset free pipelined multiply accumulate unit for machine learning accelerators. It discusses existing MAC unit designs and proposes a modified 4:2 compressor and pipelined FCF architecture for improved performance and reduced power consumption.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

A PROJECT REPORT

on

OPTIMIZATION OF FEED FORWARD CUTSET FREE PIPELINED


MULTIPLY ACCUMULATE UNIT FOR MACHINE LEARNING
ACCELERATOR
submitted to

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR,


ANANTHAPURAMU
For the partial fulfillment of the requirement for the degree of

BACHELOR OF TECHNOLOGY
in

ELECTRONICS AND COMMUNICATION ENGINEERING


submitted by
Ms. D. NAGAVENI 174M1A0422
Mr. B. GIRI 174M1A0410
Ms. G. SWETHA 174M1A0430
Ms. G. RAMYA SRI 174M1A0431
under the esteemed guidance
of
Dr. S. MURALI MOHAN, M. Tech, Ph. D.
Professor, Dept. of E.C.E
.

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING


VEMU INSTITUTE OF TECHNOLOGY:. P. KOTHAKOTA
(Affiliated to JNTUA, Ananthapuramu, Approved by AICTE, New Delhi)
(An(An
ISOISO
9001:2015 Certified
9001:2015 Institution,
Certified Accredited
Institution byby
Accredited NAAC and
NAAC NBA)Tirupati
and – Chittoor
NBA)Tirupati Road,
– Chittoor Near
Road, Pakala,
Near Chittoor(Dt.)
Pakala, A.P-517
Chittoor(Dt.) 112
A.P-517 .
112.
2020-2021.
VEMU INSTITUTE OF TECHNOLOGY:. P. KOTHAKOTA
(Affiliated to JNTUA, Ananthapuramu, Approved by AICTE, New Delhi)
(An ISO 9001:2015 Certified Institution Accredited by NAAC and NBA)Tirupati – Chittoor Road, Near Pakala, Chittoor(Dt.) A.P-517 112.

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

BONAFIDE CERTICATE
This is to certify that the project report entitled “OPTIMIZATION OF FEED FORWARD
CUTSET FREE PIPELINED MULTIPLY ACCUMULATE UNIT FOR MACHINE
LEARNING ACCELERATOR”, is being submitted by Ms. D. NAGAVENI (174M1A0422),
Mr. B. GIRI (174M1A0410), Ms. G. SWETHA (174M1A0430) and Ms. G. RAMYA SRI
(174M1A0431) in partial fulfillment of degree of BACHELOR OF TECHNOLOGY in

ELECTRONICS & COMMUNICATION ENGINEERING, to the JNTUA,


Ananthapuramu. This record is a bonafide work carried out by them under my guidance and
supervision during the academic year 2020-2021.

INTERNAL GUIDE HEAD OF THE DEPARTMENT


Dr. S. MURALI MOHAN, M. Tech, Ph. D. Dr. S. MUNI RATHNAM, M. Tech, Ph.D.
Professor Professor & HOD
Department of ECE, Department of ECE,
Vemu Institute of Technology, Vemu Institute of Technology,
P. Kothakota, -517 112. P. Kothakota, -517 112.

Internal Examiner External Examiner

Date: Date:
DECLARATION

We hereby declare that the project report entitled “OPTIMISATION OF


FEED FORWARD CUTSET FREE PIPELINED MULTIPLY ACCUMULATE UNIT
FOR MACHINE LEARNING ACCELERATOR” have been done by us under the guidance
of Dr. S. MURALI MOHAN, PROFESSOR, VEMU INSTITUTE OF TECHNOLOGY,

P. KOTHAKOTA. This project work has been submitted to VEMU INSTITUTE OF


TECHNOLOGY, P. KOTHAKOTA, as a part of partial fulfillment for the award, of under
Graduate Program in Electronics and Communication Engineering.

Place: P. KOTHAKOTA
Date: Ms. D. NAGAVENI 174M1A0422
Mr. B. GIRI 174M1A0410
Ms. G. SWETHA 174M1A0430
Ms. G. RAMYA SRI 174M1A0431
ACKNOWLEDGEMENT
A Grateful thanks to Dr. K. Chandra Sekhar Naidu, Chairman, Vemu Institute of Technology
for providing education in his esteemed institution.

I express my sincere thanks to Dr. Naveen Kilari, Principal, Vemu Institute of Technology for
his good administrative support during my course of study.

With deep sense of gratitude to Dr. S. Munirathnam, HOD, Dept.of ECE, Vemu Institute of
Technology for his unconditional help during my study period.

I express my sincere thanks to Project Co-ordinator Dr. G. Elairaja, Professor, Dept. of ECE,
for his valuable encouragement to the project.

I would like to acknowledge Project guide Dr. S. Murali Mohan, Professor, Dept. of ECE, for
his supervision and valuable guidance with constant monitoring in completing the project
successfully.

This project work would not have been possible without the inspiration, moral and emotional
support of my parents. Their continuous words of encouragement, immense endurance and
understanding deserve a special vote of appreciation.

Finally, I would like to express my sincere thanks to all faculty members of ECE Department,
lab Technicians and friends, one and all who has helped me to complete the project work
successfully.
Ms. D. NAGAVENI 174M1A0422
Mr. B. GIRI 174M1A0410
Ms. G. SWETHA 174M1A0430
Ms. G. RAMYA SRI 174M1A0431
CONTENTS
Chapter No. Title Page No.

Abstract i

List of Figures ii-iv

List of Tables v

List of Abbreviations vi
Chapter 1 INTRODUCTION 1-3

Chapter 2 LITERATURE SURVEY 4-6

Chapter 3 EXISTING METHOD 7-12

3.1 Conventional MAC unit design 7

3.2 Adder 8

3.3 Pipelined Accumulator 9

3.4 Wallace Multiplier 11-12

Chapter 4 PROPOSED METHOD 13-23

4.1 Proposed MAC Architecture 13

4.2 Modified 4:2 Compressor 15

4.3 Proposed Pipelining FCF Architecture 17

4.4 Modified FCF for Power Reduction 21

Chapter 5 IMPLEMENTATION 24-32

5.1 Software Overview 24

5.2 VHDL Code 26

5.3 Verilog Code 27


Chapter No. Title Page No.

5.3.1 Syntax 27

5.3.2 Top –Level Modules 28

5.4 RTL Schematic 29

5.5 Modeling Concepts 30

5.5.1 Gate-Level Modelling 30

5.5.2 Dataflow Modelling 31

5.5.3 Behavioural Modelling 32

Chapter 6 RESULTS 33-43

6.1 Proposed 4:2 Compressor design and RTL schemantic 33

6.2 Proposed 4:2 Compressor simulation results 34

6.3 Dadda multiplier design and RTL schemantics 34

6.4 Dadda multiplier simulation results 35

6.5 Synthesis results of Proposed MAC-Power 35

6.6 Synthesis results of Proposed MAC –Area 36

6.7 Synthesis results of Proposed MAC -Delay 37

6.8 Simulation results of Proposed MAC 38

6.9 RTL Schemantics of Proposed MAC 39

Chapter 7 CONCLUSION 44

Chapter 8 REFERENCES 45-46


ABSTRACT
Modern portable electronic gadgets depend on computation intensive digital signal processing
(DSP) algorithms for their operations that put onus on design of efficient hardware architectures.
Multiply Accumulate (MAC) units form the integral part of many DSP modules that consume
more power, making it an important candidate foroptimization to achieve efficient VLSI
technologies. This work proposes a modified 4:2 compressor adder module that achieves low
power and area than its conventional counterpart. The proposed compressor-based adder module
is implemented in partial product reduction tree of the MAC unit to achieve reduction in power
dissipation and area. The proposed efficient MAC module is developed using HDL, parameters
such as: Power, Area and Delay are extracted. Efficacy of the proposed MAC module is
confirmed by implementing the proposed MAC unit in a FIR filter using System generator tool
of Xilinx. The proposed MAC unit achieves 19% reduction in area and 32% reduction in delay
when compared with conventional MAC unit.
LIST OF FIGURES

Fig. No. Figure Name Page No.


1.1 Block diagram of 3-tap FIR filter using valid pipelining 2

3.1 Block diagram of 32 bit MAC unit 7

3.2 Block diagram of conventional MAC unit 8

3.3 Block diagram of a 32- bit Carry Save Adder 9

3.4 Conventional FCF PA 9

3.5 Schematic diagram of Conventional PA 10

3.6 Example of 32 bit Pipelined Accumulator 10

3.7 Wallace tree algorithm 11

3.8 Wallace tree algorithm using Compressor technique 12

4.1 Pipelined column addition structure with Dadda multiplier (Top)

Conventional pipelining (Bottom) Proposed FCF pipelining 13

4.2 Proposed MAC unit 14

4.3 Block diagram of Exact 4:2 Compressor 16

4.4 Algorithm implementation of DADDA multiplier 16

4.5 Algorithm implementation using Compressor technique 17

4.6 One bit Accumulator register 18

4.7 Schematic diagram of two stage 32 bit accumulator 19

4.8 Timing diagram of two stage 32-bit accumulator and Example

with propose FCF-PA 20

4.9 Example of an undesired data transition in the two-stage 8-bit PA 21

4.10 Proposed FCF-PA,MFCF-PA for power improvement of power efficiency 22

4.11 Block diagram of MFCF-PA 23

5.1 Project navigator 25

Page|ii
Fig. No. Title Page No.
5.2 Sample RTL view 29

6.1 Proposed 4:2 Compressor design 33

6.2 RTL Schematics of Proposed 4:2 Compressor 33

6.3 Simulation results of Proposed 4:2 Compressor 34

6.4 Dadda multiplier design 34

6.5 Dadda multiplier RTL View 34

6.6 Dadda multiplier Simulation Results 35

6.7 Synthesis results-Power 35

6.8 Synthesis results-Area 36

6.9 Synthesis results-Delay 36

6.10 Simulation results-1 37

6.11 Simulation results-2 37

6.12 MAC unit RTL schematic View 38

6.13 Column addition stage-1 RTL schematic View 38

6.14 Proposed pipeline stage RTL schematic View 39

6.15 Column addition stage-2 RTL schematic View 39

6.16 Column addition stage RTL schematic detailed View 40

6.17 Compressor logic in column addition stage RTL schematic View 40

6.18 Proposed accumulator RTL schematic View 41

6.19 Accumulator consists MFCF-PA RTL schematic View 41

6.20 MFCF-PA used in first stage of accumulator RTL schemantic View 42

6.21 MFCF-PA RTL schematic View 42

Page|iii
Fig. No. Title Page No.
6.22 MFCF-PA logic RTL schematic internal View (NOT+NOR logic) 43

Page|iv
LIST OF TABLES

Table No. Title Page No.

6.10 Parameter analysis 43

Page|v
LIST OF ABBREVATIONS
Acronym Expansion

MAC Multiply Accumulate Unit

CLA Carry Look ahead Adder

CSA Carry Save Adder

DSP Digital Signal Processing

VLSI Very Large Scale Integration

FFT Fast Fourier Transform

CSLA Carry Select Adder

RCA Ripple Carry Adder

XOR Exclusive OR

PPR Partial Product Reduction

MBA Modified Booth’s Algorithm

CMOS Complementary Metal Oxide Semiconductor

PE Processing Element

FPGA Field Programmable Gate Array

CPLD Complex Programmable Logic Device

GPU Graphics Processing Unit

CPU Central Processing Unit

VHDL Verilog Hardware Descriptive Language

RTL Register Transfer Level

HDL Hardware Descriptive Language

PA Pipelined Accumulator

FCF Modified Feed Forward Cutset Free

MFCF Modified Feed Forward Cutset Free

Page|vi
Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW
In a machine learning accelerator, a large number of Multiply–Accumulate
(MAC) units are included for parallel computations, and timing- critical paths of the system
are often found in the unit. A multiplier typically consists of several computational parts
including a partial product generation, a column addition, and a final addition. An
accumulator consists of the carry propagation adder. Long critical paths through these stages
lead to the performance degradation of the overall system. To minimize this problem, various
methods have been studied. The Wallace and Dadda multipliers are well-known examples for
the achievement of column addition, and the Carry-Look ahead Adder (CLA) is often used to
reduce the critical path in the accumulator or the final addition stage of the multiplier.
Meanwhile, a MAC operation is performed in the machine learning algorithm to compute a
partial sum that is the accumulation of the input multiplied by the weight.
In a MAC unit, the multiply and accumulate operations are usually merged to
reduce the number of carry propagation steps from two to one. Such a structure, however,
still comprises a long critical path delay that is approximately equal to the critical path delay
of a multiplier. It is well known that pipelining is one of the most popular approaches for
increasing the operation clock frequency. Modern DSP algorithms used in portable
applications demand high performance VLSI data path systems with low area and power.
Multiply and accumulate units (MAC) are the most computation intensive component of
many DSP operations such as FFT, convolution, filters etc. This puts onus on design of
efficient MAC architectures to achieve optimization of DSP modules. MAC unit consists of a
‘n’-bit multiplier, which multiplies a set of ‘n’ bit input operands repeatedly, an adder of
width ‘2n’ to add consecutive products and an accumulator of width ‘2n’ to accumulate the
result in register.
1.2 FEED FORWARD-CUTSET RULE FOR PIPELINING
It is well known that pipelining is one of the most effective ways to reduce the
critical path delay, thereby increasing the clock frequency. This reduction is achieved through
the insertion of flip- flops into the datapath. In addition to reducing critical path delays
through pipelining, it is also important to satisfy functional equality before and after

Dept. Of E.C.E Vemu Institute of Technology Page 1


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

pipelining. The point at which the flip-flops are inserted to ensure functional equality is
called the feed forward-cutset.

Cutset: A set of the edges of a graph such that if these edges are removed from the graph, and
the graph becomes disjointed.

Feed forward-cutset: A cutset where the data move in the forward direction on all of the
cutset edges.

Fig. 1.1 Block diagram of 3-tap FIR filter using valid pipelining

The above shows demonstrate a case of substantial pipelining. The two-


organize pipelined FIR channel is developed by embeddings two flip-tumbles along
feedforward-cutset.

Multipliers are the most time consuming, energy intensive module of a MAC
unit which are built using adders. Hence, adder optimization leads to efficient multipliers
which in turn results in optimization of MAC unit. Various adder topologies are designed to
achieve efficiency in terms of area and delay. Adders such as Ripple carry adder (RCA),
Carry select adder (CSLA), Carry save adder (CSA) use traditional addition algorithm to
generate sum and carry signal. For an adder of width ‘m’ bits, RCA computes carry out signal
by rippling carry signal of each module from “carry in” to “carry out” with a delay of O(m).
CSLA and CSA compute sum similar to RCA, but differs in calculation of carry out signal
and reduce critical delay path to O(m(l+2)/(l+1)) and O(log(m)) respectively. Carry look
ahead adder (CLA) reduces critical delay path to O(log(m)), by computing carry out signal
independent of (m-1)th carry bit.
Compressor based adders are alternate means for implementing multi-bit
adders. A traditional 4:2 compressor adder consists of 2 full adder modules and accepts 4

Dept. Of E.C.E Vemu Institute of Technology Page 2


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

inputs of equal bit weight to produce two output signals. If tXOR, tCy indicates the delay of
XOR gate and carry generator logic then delay of 4:2 compressor can be expressed as:
(i*tXOR + j*tCy) where, ‘I’ is the number of XOR gates and ‘j’ is the number of carry
generation logic stages. Various multiplier architectures such as array multipliers, booth
multiplier, Wallace tree multiplier, dada multiplier etc. use adders for their partial product
reduction tree (PP) and PP accumulation. Array multiplier has regular structure, but has the
highest delay. Other multiplication algorithms such as booth, Wallace and dada multiplier aim
to reduce number of adder stages in PP reduction tree to optimize area and power
consumption. Hence, various adder topologies are constituted in multiplier architecture to
achieve optimal performance in terms of area and power. Literature for enhanced MAC units
through optimized adder and multiplier architectures exist.
Various multipliers such as Array multiplier, modified booth multiplier,
Wallace tree multiplier and Dadda multiplier used to realize MAC architecture. CSA adders
are used in PP reduction tree of MAC unit multipliers. It ascertained that Wallace tree
multiplier implemented using CSA adder achieves high speed and low power performance
and is an ideal candidate for efficient MAC unit. Therefore, Wallace tree multiplier is a prime
candidate for enhancing efficiency of MAC unit due to a smaller number of stages in its PP
reduction tree has proposed a modified Wallace tree multiplier that achieves lesser delay and
explored the efficacy of compressor adder in PP reduction tree. Counter based Wallace
multiplier implemented in replaces traditional adder topologies with compressor adders to
enhance efficacy of MAC unit. High speed MAC unit design using compressor adders have
been explored. 3:2 compressor based parallel multiplier is implemented in MAC unit for
audio applications. 5:3 compressors for PP reduction in multipliers is proposed to implement
word length MAC unit. Propose a merged MAC unit that eliminates the use of separate
accumulation unit which is built using traditional 4:3 compressor adder. In this brief, we
propose a modified efficient 4:2 compressor to realize high speed MAC unit.

Dept. Of E.C.E Vemu Institute of Technology Page 3


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 2
LITERATURE SURVEY
Jithendra Babu N., Sarma R., “A Novel Low Power Multiply–Accumulate
(MAC) Unit Design for Fixed Point Signed Numbers”, In: Dash S., Bhaskar M., Panigrahi
B., Das S. (eds) Artificial Intelligence and Evolutionary Computations in Engineering
Systems. Advances in Intelligent Systems and Computing. In the emerging technologies the
low power designs play a critical role of operations. Our proposed work is on the low power
MAC unit that is used to find the fixed point signed numbers. The proposed design is to
achieve high throughput and low power consumption. Our proposed work has various
building blocks like firstly, Wallace tree multiplier since a multiplier is one of the key parts
for the processing of digital signal processing systems and secondly an accumulation block.
Since the output from the multiplier and adder is to be efficient, we proposed a BCD block
that is used to convert the output into BCD number. The overall MAC is performed in the
cadence virtuoso 90 nm technology and performance analysis of each individual block is
examined using the cadence virtuoso before designing the overall MAC unit. Power, delay
and power-delay product are calculated using the Cadence Spectre tool.

Seo. Y and Kim. D, “A New VLSI Architecture of Parallel Multiplier–


Accumulator Based on Radix-2 Modified Booth Algorithm”, in IEEE Transactions on Very
Large-Scale Integration (VLSI) Systems. In this paper, we proposed a new architecture of
multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication
with accumulation and devising a hybrid type of carry save adder (CSA), the performance
was improved. Since the accumulator that has the largest delay in MAC was merged into
CSA, the overall performance was elevated. The proposed CSA tree uses 1's-complement-
based radix-2 modified Booth's algorithm (MBA) and has the modified array for the sign
extension in order to increase the bit density of the operands. The CSA propagates the carries
to the least significant bits of the partial products and generates the least significant bits in
advance to decrease the number of the input bits of the final adder. Also, the proposed MAC
accumulates the intermediate results in the type of sum and carry bits instead of the output of
the final adder, which made it possible to optimize the pipeline scheme to improve the
performance. The proposed architecture was synthesized with 250, 180 and 130 nm, and 90

Dept. of E.C.E Vemu Institute of Technology Page 4


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

nm standard CMOS library. Based on the theoretical and experimental estimation, we


analysed the results such as the amount of hardware resources, delay, and pipelining scheme.
We used Sakurai's alpha power law for the delay modelling. The proposed MAC showed the
superior properties to the standard design in many ways and performance twice as much as
the previous research in the similar clock frequency. We expect that the proposed MAC can
be adapted to various fields requiring high performance such as the signal processing areas.

Chang, Chip Hong, Zhang et al., “Ultra Low-Voltage Low-Power CMOS 4:2
and 5:2 Compressors for Fast Arithmetic Circuits” IEEE Transactions on Circuits and
Systems I: Regular Papers. This paper presents several architectures and designs of low-
power 4:2 and 5:2 compressors capable of operating at ultra-low supply voltages. These
compressor architectures are anatomized into their constituent modules and different static
logic styles based on the same deep submicrometric CMOS process model is used to realize
them. Different configurations of each architecture, which include a number of novels 4:2 and
5:2 compressor designs, are prototyped and simulated to evaluate their performance in speed,
power dissipation and power-delay product. The newly developed circuits are based on
various configurations of the novel 5:2 compressor architecture with the new carry generator
circuit, or existing architectures configured with the proposed circuit for the exclusive OR
(XOR) and exclusive NOR (XNOR) [XOR-XNOR] module. The proposed new circuit for
the XOR-XNOR module eliminates the weak logic on the internal nodes of pass transistors
with a pair of feedback PMOS-NMOS transistors. Driving capability has been considered in
the design as well as in the simulation setup so that these 4:2 and 5:2 compressor cells can
operate reliably in any tree structured parallel multiplier at very low supply voltages. Two
new simulation environments are created to ensure that the performances reflect the realistic
circuit operation in the system to which these cells are integrated. Simulation results show
that the 4:2 compressor with the proposed XOR-XNOR module and the new fast 5:2
compressor architecture are able to function at supply voltage as low as 0.6 V, and outperform
many other architectures including the classical CMOS logic compressors and variants of
compressors constructed with various combinations of recently reported superior low-power
logic cells.

Dept. of E.C.E Vemu Institute of Technology Page 5


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Singh. K. N., Tarunkumar. H, “A review on various multipliers designs in


VLSI”, Proc. Annual IEEE India Conference (INDICON). In this paper we are going to study
Array multiplier, Wallace multiplier, Bypassing multiplier, Modified Booth multiplier, Vedic
multiplier and Booth recorded Wallace tree multiplier which have been proposed by different
researchers. When the study of the various multipliers has been performed, Array multiplier
is found to have the largest delay and large power consumption while Booth encoded Wallace
tree multiplier has the least delay though it also has a large area. We also realized that, with
proper optimization the performance of the multipliers can be increased significantly,
irrespective of the type. Temporal tilling method optimized array multiplier delay and power
dissipation is found to increase by 50% and 30% respectively while using the partially
guarded technique power consumption is reduced by 10-44% with 30-36% less area
overhead. Booth recorded Wallace tree multiplier is found to be 67% faster than the Wallace
tree multiplier, 53% faster than the Vedic multiplier, 22% faster than the radix 8 booth
multipliers. We also study various optimization techniques for Wallace multiplier, bypassing
multiplier, modified booth multiplier and Vedic multiplier.

Patil. P. A., Kulkarni. C, “A Survey on Multiply Accumulate Unit”, Fourth


International Conference on Computing Communication Control and Automation
(ICCUBEA). In the most of the Digital Signal Processing (DSP) applications, the
fundamental operations usually involve multiplications and accumulations. Multiplication is
the arithmetic operation in which a processor requires most of its time as well as hardware
resources among all other arithmetic operations like addition and subtraction. To attain a
high-performance DSP application for real time signal processing applications, efficient
Multiply Accumulate Unit (MAC) is always a mainstay. In the last few years, the main focus
of MAC design is to boost its speed. This is because speed and throughput rate are always the
crucial parameters of DSP systems. In this paper, a survey is done for different kind of MAC
unit with different multipliers and adders where, multipliers are used to create partial
products while adders to accumulate these partial products. This study reviews various MAC
units designed until now with high speed and low power consumption.

Dept. of E.C.E Vemu Institute of Technology Page 6


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 3
EXISTING METHOD
3.1 CONVENTIONAL MAC UNIT DESIGN

Fig. 3.1 Block diagram of 32-bit MAC unit

A MAC unit consists of three components: a multiplier, an adder, and an


accumulator. Words are obtained from memory locations and passed as inputs to the
multiplier. The block diagram of a 32-bit MAC unit is shown in Fig. 3.1

Dept. of E.C.E Vemu Institute of Technology Page 7


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 3.2 Block diagram of conventional MAC unit

In conventional mac unit Wallace multiplier, PA , CSA has been used in place
of multiplier, accumulator , adder is shown in Fig.3.2

3.2 ADDER
The adder computes the sum of the product from the multiplier and the value
stored in the accumulator. The output of the adder is passed onto the accumulator. If inputs to
the multiplier are of bit size 16, then the adder should be of bit size 32, producing an output
of size 32+1. Carry save adder, carry select adder, ripple carry adder(RCA), carry look-ahead
adder(CLA) are among the widely used adders in the design of digital logic processing
devices.

Propagation delay and critical delay are two important parameters to be


considered while using adders. In conventional method , carry save adder is used in the
design of MAC unit. It works on the principle of preserving carries until the end. It is one of
the widely used circuits for implementing fast arithmetic computations. As the numbers
become large, the propagation delay incurred in Ripple carry adder(RCA) and Carry Look

Dept. of E.C.E Vemu Institute of Technology Page 8


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Ahead(CLA) adder also increases. Hence, the carry-save adder is much faster than
conventional adders. The block diagram of a 4-bit Carry Save Adder is shown in Fig. 3.3

Fig. 3.3 Block diagram of a 32-bit Carry Save Adder


3.3 PIPILINED ACCUMULATOR(PA)

Fig. 3.4 Conventional FCF PA


Fig. 3.4 shows examples of the two-stage 32-bit pipelined accumulator (PA) that
is based on the ripple carry adder (RCA). A[31 : 0] represents data that move from the
outside to the input buffer register. A Reg[31 : 0] represents the data that are stored in the
input buffer. S[31 : 0] represents the data that are stored in the output buffer register as a
result of the accumulation. In the conventional PA structure ,the flip-flops must be inserted
along the feed forwardcutset to ensure functional equality. Since the accumulator in Fig. 3.4
comprises two pipeline stages, the number of additional flip-flops for the pipelining is 33
(gray- coloured flip-flops). If the accumulator is pipelined to the n-stage, the number of
Dept. of E.C.E Vemu Institute of Technology Page 9
Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

inserted flip-flops becomes 33(n−1), which confirms that the number of flip-flops for the
pipelining increases significantly as the number of pipeline stages is increased.

Fig. 3.5 Schematic diagram of conventional PA

In the conventional PA, the correct accumulation values of all the inputs up to
the corresponding clock cycle are produced in each clock cycle as shown in the timing
diagram of Fig. 3.5, Fig. 3.6. A two-cycle difference exists between the input and the
corresponding output due to the two- stage pipeline. In the conventional two-stage PA, the
accumulation output (S) is produced two clock- cycle after the corresponding input is stored
in the input buffer. For example, in the conventional case, the generated carry from the lower
half and the corresponding inputs are fed into the upper half adder in the same clock cycle as
shown in the cycles 4 and 5 of Fig. 3.6).

Fig. 3.6 Example of 32-bit Pipelined Accumulator

Dept. of E.C.E Vemu Institute of Technology Page 10


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

3.4 W ALLACE MULTILPLIER


It is an efficient hardware circuit designed to achieve higher speeds of
operation is shown in [Fig 3.7]. It was designed by Chris Wallace in 1964. It is a variant of
the long multiplication method. Wallace tree reduces the number of partial products and uses
carry select adder for the addition of partial products. Here, the total delay incurred is
proportional to the logarithm of the length of the multiplier operand, in this turn results in
faster computations.

Fig. 3.7 Wallace tree algorithm


The Wallace tree method of multiplication has three steps and it is shown in Fig 3.7:

• Each bit of the input is multiplied with each bit of the other input.

• The number of partial products is reduced by half by using half and full adders.

• The wires in the two inputs are grouped together and then added.

Wallace tree with 15:4 Compressors is made by a tree like formation of many
15:4 compressor each having four inputs that are multi-bit and produces two outputs which is
also multi-bit .The fig. 3.4.2shown below is a Wallace tree structure with 15:4 compressors
have 16 partial products given as inputs and a carry and a sum as the output which has the
same dimension same as input.

Dept. of E.C.E Vemu Institute of Technology Page 11


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 3.8 Wallace tree algorithm using Compressor technique

Dept. of E.C.E Vemu Institute of Technology Page 12


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 4
PROPOSED SYSTEM
4.1 PROPOSED MAC ARCHITECTURE
The MAC architecture was divided into two blocks one is multiplier and
another one is accumulator. The multiplier multiplies the two inputs and the results are stored
in the accumulator register.
Merged structure are used (Pipelined Dadda with 4::2 Compressor +MFCF-
PA). The column addition in the MAC operation is for the calculation of binary numbers in
each addition stage using the half-adders and/or full adders and then for the passing of the
results to the next addition stage. Since MAC computations are based on such additions, the
proposed pipelining method can also be applied to the machine learning-specific MAC
structure. In this section, the proposed pipelining method is applied to the MAC architecture
by using the unique characteristic of Dadda multiplier. The Dadda multiplier performs the
column addition in a similar fashion to the Wallace multiplier which is widely used, and it
has less area and shorter critical path delay than the Wallace multiplier.

Fig. 4.1 Pipelined column addition structure with Dadda multiplier (Top) conventional
pipelining (Bottom)proposed FCF pipelining

Dept. of E.C.E Vemu Institute of Technology Page 13


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 4.1 shows the pipelined column addition structures in the Dadda multiplier. The Dadda
multiplier performs the column addition to reduce the height of each stage. If a particular
column already satisfies the target height for the next column addition stage, then no
operation is performed during the stage . Using this property, the proposed pipelining method
can be applied to the MAC structure as well. Fig. 4.1(Top) is an example of pipelining where
the conventional method is used. All of the edges in the feedforward-cutset are subject to
pipelining. On the other hand, in the proposed FCF pipelining case [Fig. 4.1(Bottom)], if a
node in the column addition stage does not need to participate in the height reduction, it can
be excluded from the pipelining [the group in the dotted box of Fig. 4.1(Bottom)]. In other
words, in the conventional pipelining method, all the edges in the feedforward-cutset must be
pipelined to ensure functional equality regardless of a timing slack of each edge [Fig.
4.1(Top)]. However, in the FCF pipelining method, some edges in the cutset do not
necessarily have to be pipelined if the edges have enough timing slacks [Fig. 4.1(Bottom)].
As a result, a smaller number of flip-flops are required compared with the conventional
pipelining case. On the other hand, in the Wallace multiplier, as many partial products as
possible are involved in the calculation for each column addition stage. Because the partial
products do not have enough timing slack to be excluded from pipelining, the effectiveness of
the proposed FCF pipelining method is smaller in the Wallace multiplier case than in the
Dadda multiplier case.

Fig. 4.2 Proposed MAC unit

Dept. of E.C.E Vemu Institute of Technology Page 14


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 4.2 shows the block diagrams of pipelined MAC architectures. The
proposed MAC architecture [Fig. 4.2] combines the FCF-MAC (MAC with the proposed
FCF pipelining) for the column addition and the MFCF-PA for the accumulation. Instead of
pipelining all of the final nodes in the column addition stage as in [Fig. 4.2], the proposed
FCF-MAC architecture is used to selectively choose the nodes for the pipelining. For the
proposed architecture, the merged multiply–accumulation style is adopted . The final adder is
placed in the last stage of the MAC operation. In general, the final adder is designed using the
CLA to achieve a short critical path delay. In contrast, the proposed design uses the MFCF-
PA style in the accumulation stage in consideration of the greater power and the area
efficiency of the MFCF-PA.
The design makes use of compressors in place of full adders, and the final
carry propagate stage is replaced by carry save adder. The Dadda basically multiplies two
unsigned integers. The proposed Dadda multiplier architecture comprises of an AND array
for computing the partial produces so obtained and carry save adder in the final stage of
addition. In the proposed architecture partial product reduction is accomplished by the use of
4-2 compressor structures and final stage of addition is performed by the carry save adder.
This multiplier architecture comprises of a partial product generation stage, partial product
reduction stage and the final addition stage. The latency in the Dadda multiplier can be
reduced by decreasing the number of adders in the partial product reduction stage.
In the proposed architecture, multi bit compressor is used for realizing the
reduction in the number of partial product addition stages. The combined factors of low
power, low transistor count and minimum delay makes the 4:2 compressor.
In this compressor, the outputs generated at each stage are efficiently used by
replacing the XOR block with multiplier blocks. The select bits to the multiplexers are
available much ahead of the inputs so that the critical path delay is minimized. The various
adder structures in the conventional architecture are replaced by compressors.
4.2 MODIFIED 4:2 COMPRESSOR
However, to improve regularity of arrangement of cells in multiplier
modifications to logic expressions of standard 4:2 compressor is proposed with logic low in
Can and neglecting output Count. Modified 4:2 exact compressor has inputs X1, X2, X3, X4
and the outputs Sum-S and Carry-C. Elimination of Cout in 4:2 compressor generate error for
X1X2X3X4 = “1111”.

Dept. of E.C.E Vemu Institute of Technology Page 15


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 4.3 Block Diagram of Exact 4:2 Compressor


Conversely, the approximation logic which produces logic high in Sum for
X1X2X3X4= “1111” reduces the MaxED to -1. The block diagram of modified 4:2 exact
compressor is shown in Figure 4.3 and the corresponding logic.
S = X1^X2^X3^X4
C = ((X1^X2^X3^X4)’&X4))

However, in the multiplier implementation. we account for Count elimination in the modified
4:2 exact compressor with error compensation bit E = X1& X2& X3& X4.
Fig 4.4 and 4.5 represents implementation of compressor technique to Dadda multiplier.

Fig. 4.4 Algorithm implementation of DADDA multiplier

Dept. of E.C.E Vemu Institute of Technology Page 16


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 4.5 Algorithm implementation using Compressor technique


Fig. 4.4, Fig. 4.5 shows the proposed Dadda multiplier partial product reduction. In fig 4.4,
the reduction part uses half-adders, full-adders and 4:2 compressors; each partial product bit
is represented by a dot. The Dadda multiplier is constructed by considering all the bits in each
four rows at a time and compressing them in an appropriate manner. Thus, compressors form
the essential requirement of high-speed multipliers. The speed, area and power consumption
of the multipliers will be in direct proportion to the efficiency of the compressors. In the first
stage 4 half adders, 4 full adders and 10 compressors are utilized to reduce the partial
products into at most four rows. In the second stage or final stage, 6 half-adder, 2 full-adder
and 5 compressors are used to compute the two final rows of the partial products.

In the proposed architecture the outputs are utilized efficiently by using


multiplexers at select stages in the circuit. Also, additional inverter stages are eliminated.
This in turn contributes to the reduced of delay, power consumption and gate count (area).
The term accumulator or register refers to a flip-flop or latches which is used to store the one-
bit data. In our proposed work the accumulator register is used to store the output from the
multiplier.
4.3 PROPOSED PIPELINING FCF ARCHITECTURE
The accumulator register consists of D-flip-flop, basic gates that operated by
clock synchronization. The register cell has 3 inputs and 1 output. The inputs are D, reset and
clock and Q will be the output. The flip-flop will store the input value when clock is 1, if

Dept. of E.C.E Vemu Institute of Technology Page 17


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

reset signal is 0, then the flip-flop will pass the value that is stored using tristate buffer to the
output.
In our proposed work the D flip-flop is implemented using a technique called
DDFF, which is a hybrid flip-flop that uses sleepy stack technique for less power dissipation
and low leakage currents which occurs during the storing of data in the register block. The
block diagram is shown in the fig 4.6.

reset
DFF Q
D

CLK

Fig. 4.6 One-bit Accumulator register


Since the accumulator in Fig. 4.7 comprises two pipeline stages, the number of
additional flip-flops for the pipelining is 33 (gray-colored flip-flops). If the accumulator is
pipelined to the n-stage, the number of inserted flip-flops becomes 33(n−1), which confirms
that the number of flip-flops for the pipelining increases significantly as the number of
pipeline stages is increased. Fig. 4.7 shows the proposed FCF-PA. For the FCF-PA, only one
flip-flop is inserted for the two-stage pipelining.
Therefore, the number of additional flip-flops for the n-stage pipeline is n − 1 only. In the
conventional PA, the correct accumulation values of all the inputs up to the corresponding
clock cycle are produced in each clock cycle as shown in the timing diagram of Fig. 4.8.
A two-cycle difference exists between the input and the corresponding output due to the two-
stage pipeline. On the other hand, in the proposed architecture, only the final accumulation
result is valid as shown in the timing diagram of Fig. 4.8. Fig. shows examples of the ways
that the conventional PA and the proposed method (FCF-PA) work. In the conventional two-
stage PA, the accumulation output (S) is produced two clock-cycle after the corresponding
input buffer.
Fig. 4.8 shows examples of the two-stage 32-bit pipelined accumulator (PA) that is based on
the ripple carry adder (RCA). A[31 : 0] represents data that move from the outside to the
input buffer register. AReg[31 : 0] represents the data that are stored in the input buffer.

Dept. of E.C.E Vemu Institute of Technology Page 18


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

S[31 : 0] represents the data that are stored in the output buffer register as a result of the
accumulation. In the conventional PA structure [Fig. 3.5], the flip-flops must be inserted
along the feedforward-cutset to ensure functional equality.

Fig. 4.7 Schematic diagram of two stage 32-bit accumulator for Proposed FCF-PA

On the other hand, regarding the proposed structure, the output is generated one clock cycle
after the input arrives. Moreover, for the proposed scheme, the generated carry from the
lower half of the 32-bit adder is involved in the accumulation one clock cycle later than the
case of the conventional pipelining.
For example, in the conventional case, the generated carry from the lower half
and the corresponding inputs are fed into the upper half adder in the same clock cycle as
shown in the cycles 4 and 5 of Fig.4.8 (left). On the other hand, in the proposed FCF-PA, the
carry from the lower half is fed into the upper half one cycle later than the corresponding
input for the upper half, as depicted in the clock cycles 3-5 of Fig. 4.8 (right).

Dept. of E.C.E Vemu Institute of Technology Page 19


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 4.8 (Left) Timing diagram of two stage 32-bit accumulator and (Right) Example
with propose FCF-PA
This characteristic makes the intermediate result that is stored in the output
buffer of the proposed accumulator different from the result of the conventional pipelining
case. Fig.3.5. Two-stage 32-bit pipelined-accumulation examples with the conventional
pipelining (left) and proposed FCF-PA (right). Binary number “1” between the two 16-bit
hexadecimal numbers is a carry from the lower half. The proposed accumulator, however,
shows the same final output (cycle 5) as that of the conventional one.
In addition, regarding the two architectures, the number of cycles from the
initial input to the final output is the same. The characteristic of the proposed FCF pipelining
method can be summarized as follows: In the case where adders are used to process data in
an accumulator, the final accumulation result is the same even if binary inputs are fed to the
adders in an arbitrary clock cycle as far as they are fed once and only once. In the machine
learning algorithm, only the final result of the weighted sum of the multiplication between the
input feature map and the filters is used for the subsequent operation, so the proposed
accumulator would produce the same results as the conventional accumulator.
Meanwhile, the CLA adder has been mostly used to reduce the critical path
delay of the accumulator. The carry prediction logic in the CLA, however, causes a
significant increase in the area and the power consumption. For the same critical path delay,
the FCF-PA can be implemented with less area and lower power consumption compared with
the accumulator that is based on the CLA.

Dept. of E.C.E Vemu Institute of Technology Page 20


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

4.4 MODIFIED FCF FOR POWER REDUCTION


Although the proposed FCF-PA can reduce the area and the power
consumption by replacing the CLA, there are certain input conditions in which the undesired
data transition in the output buffer occurs, thereby reducing the power efficiency when 2’s
complement numbers are used.
Fig. 4.9 shows an example of the undesired data transition. The inputs are 4-bit 2’s
complement binary numbers. AReg[7 : 4] is the sign extension of AReg[3], which is the sign
bit of AReg[3 : 0]. In the conventional pipelining [Fig. 4 (left)], the accumulation result (S) in
cycle 3 and the data stored in the input buffer (AReg) in cycle 2 are added and stored in the
output buffer (S) in cycle 4. In this case, the “1” in AReg[2] in cycle 2 and the “1” in S[2] in
cycle 3 are added, thereby generating a carry. The carry is transmitted to the upper half of the
S, and hence, S[7:4] remains as “0000” in cycle 4.

Fig 4.9 Example of an undesired data transition in the two-stage 8-bit PAs with 4-bit

2’s complement input numbers. Binary number “1” between the two 4-bit hexadecimal
numbers is a carry from the lower half.
On the other hand, in the FCF-PA [Fig. 4.9], AReg[2] and S[2] are added in
cycle 2, thereby generating a carry. In cycle 3, the generated carry from the lower half is
stored in the flip-flop. The carry is no longer propagated toward the upper half in this clock
cycle. In the next clock cycle (cycle 4), the carry that is stored in the flip-flop is transferred to

Dept. of E.C.E Vemu Institute of Technology Page 21


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

the carry input of the upper half. During the calculation, it can be observed that S[7 : 4]
changes to “1111” in cycle 3 and returns to “0000” in cycle 4. Although this undesired data
transition does not affect the accuracy of accumulation results, it reduces the power efficiency
of the FCF-PA. Fig. 4.10(left) shows the structure of the FCF-PA for the 2’s complement
numbers. The binary numbers in the diagram indicate the sign extension of AReg[m − 1] and
the carry bits for the undesired data transition case.

Fig 4.10 Proposed FCF-PA(Left), MFCF-PA for power improvement of power efficiency
(Right)

To prevent the undesired data transitions, the sign extended the input number to RCA[n − 1 :
k] and the carry out of RCA[k−1 : 0] must be modified to “0” if both of AReg[m−1] and the
carry out from RCA[k − 1 : 0] are “1”; therefore, the undesired data transition can be
prevented by detecting such a condition. Since the critical delay becomes too long if the
upper half-addition needs to wait until the decision regarding the lower half carry-out
condition detection, the modified version of the FCF-PA (MFCF-PA) is proposed here as
shown in Fig. 4.10.
An additional flip-flop was added between the two RCAs to prevent the
formation of a long critical path. RCA[n − 1 : k] receives both AFix with the sign extension
and the CarryFix signals as modified input values. AFix generates “0” when AReg[m − 1]

Dept. of E.C.E Vemu Institute of Technology Page 22


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

and Carry are both “1.” Otherwise, AReg[m − 1] is buffered in AFix as it is. Similarly,
CarryFix generates “0” when AReg[m − 1] and Carry are both “1.” Otherwise, Carry is
buffered in CarryFix as it is. Although an additional NORI(NOR + INV) gate causes an
additional delay (150 ps at SS corner, 10% supply voltage drop, 125 ◦C temperature in our
analysis) in the critical path, the overhead is negligible considering that target clock period
ranges from 1.1 to 2.7 ns (in Section V). In the event that the accumulator is pipelined to
multiple stages, the insertion of the additional logic into all of the pipeline stages may
increase the area overhead.
To reduce the overhead, the modified structure [Fig. 4.10(right)] is inserted
into only one pipeline stage. For the rest of the pipeline stages, only the FCF-PA [Fig.
4.10(left)] is used. Fig. 4.11 shows a block diagram of the MFCF-PA. A good power
efficiency is still shown, because the probability of the sign extension bit becoming “1” is
reduced.

Fig. 4.11 Block diagram of MFCF-PA.

Dept. of E.C.E Vemu Institute of Technology Page 23


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 5
IMPLEMENTATION
5.1 SOFTWARE OVERVIEW

Xilinx was founded and incorporated in California in February 1984. In April


1990, the Company reincorporated in Delaware. Xilinx offers an exhaustive multi-hub
portfolio to address prerequisites over a wide arrangement of utilizations. Regardless of
whether you are planning a best in class, superior systems administration application
requiring the most astounding limit, transfer speed, and execution, or searching for a minimal
effort, FPGA takes to a product characterized innovation to the following dimension.
Not with standing its programmable stages, Xilinx gives configuration
administrations, client preparing, field designing and specialized support. License enrolment
is required to utilize the Web Edition of Xilinx ISE, which is free and can be re-established a
boundless number of times.

Xilinx ISE (coordinated programming condition) controls all parts of the


improvement stream. Project Navigator is a graphical interface for users to access software
tools and relevant files associated with the project. It is divided into four sub windows:

1. Source’s window (top left): hierarchically displays the files included in the
project

2. Processes window (middle left): displays available processes for the source
file currently selected.

3. Transcript window (bottom): displays status messages, errors, and warnings

4. Wworkplace window (top right): contains multiple document windows (such


as HDL code, report, schematic, and so on) for viewing and editing

The Xilinx ISE Web-PACK is a complete FPGA/CPLD programmable logic design suite
providing: 1. Specification of programmable logic via schematic capture or Verilog/VHDL 2.
Synthesis and Place & Route of specified logic for various Xilinx FPGAs and CPLDs 3.
Functional (Behavioral) and Timing (post-Place & Route) simulation. 4.Download of
configuration data into target device via communications cable.

Dept. of E.C.E Vemu Institute of Technology Page 24


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Xilinx currently claims that its FPGAs, due to their ability to be customized
for different workloads, accelerate processing by 40 times for machine-learning inference, 10
times for video and image processing, and 100 times for genomics, with respect to CPU-or
GPU-based frameworks.

Xilinx FPGAs provide you with system integration while optimizing for
performance/watt. The minimal effort Spartan group of FPGAs is completely upheld by this
release, and also the group of CPLDs, which means small designers and instructive
establishments, have no overheads from the expense of advancement programming.

The ISE software controls all aspects of the design flow. Through the Project
Navigator interface, you can access all of the design entry and design implementation tools.
You can also access the files and documents associated with your project.

Fig. 5.1 Project navigator

Dept. of E.C.E Vemu Institute of Technology Page 25


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

By default, the Project Navigator interface is divided into four panel sub-
windows, as seen in Figure 5.1. On the top left are the Start, Design, Files, and Libraries
panels, which include display and access to the source files in the project as well as access to
running processes for the currently selected source. The Start panel provides quick access to
opening projects as well as frequently access reference material, documentation and tutorials.
At the bottom of the Project Navigator are the Console, Errors, and Warnings panels, which
display status messages, errors, and warnings. To the right is a multi-document interface
(MDI) window referred to as the Workspace. The Workspace enables you to view design
reports, text files, schematics, and simulation waveforms. Each window can be resized,
undocked from Project Navigator, moved to a new location within the main Project Navigator
window, tiled, layered, or closed. You can use the View > Panel’s menu commands to open
or close panels. You can use the Layout > Load Default Layout to restore the default window
layout. These windows are discussed in more detail in the following sections.
5.2 VHDL CODE

There are two HDL languages such as Verilog and VHDL which are widely
used as IEEE VHDL (VHSICHardware Description Language) is a hardware description
language used in electronic plan computerization to depict advanced and blended flag
frameworks, for example, field-programmable door exhibits and incorporated circuits.
VHDL can likewise be utilized as a universally useful parallel programming dialect. The
basic rules in VHDL are free formatting and case insensitive. It also consists of identifier
which is the name of the object and composed of letters, digits and underscore and must start
with a letter.
Library and package: library IEEE; & utilize IEEE.std_logic_1164.all; These
statements are used to add types, operators, functions.
Entity Declaration:
Entity eq is Port (i0, i1: in std_logic; eq: out std_logic);
End eq;
The basic format for I/O port declaration is:
signal_name1, signal_name2…: mode data_type;
Data type and operators:
Std_logictype: It is defined in std_logic_1164 package & consists of nine values.

Dept. of E.C.E Vemu Institute of Technology Page 26


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Logical Operators – Some of the logical operators such as not, and, or &xor are defined in
std_logic_ vector and std_logic data type.
Architecture body: The architecture body may include a declaration. There are three
concurrent statements between begin and end. There are two internal signals declared here:
Signal p0, p1: std_logic;

5.3 VERILOG CODE

A module is a block of Verilog code that implements certain functionality.


Modules can be embedded within other modules, and a higher-level module can
communicate with its lower-level modules using their input and output ports.
5.3.1 SYNTAX

A module should be enclosed within a module and end module keywords. The name of the

module should be given right after the module keyword, and an optional list of ports may be

declared as well.

module <name> ([port list]);

// Contents of the module

end module

// A module can have an empty port list

module name;

// Contents of the module

end module

All variable declarations, functions, tasks, dataflow statements, and lower


module instances must be defined within the module and end module keywords.
5.3.1 TOP-LEVEL MODULES
A top-level module is one that contains all other modules. A top-level module
is not instantiated within any other module. For example, design modules are usually
instantiated within top-level testbench modules so that simulation can be run by providing

Dept. of E.C.E Vemu Institute of Technology Page 27


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

input stimulus. But, the testbench is not instantiated within any other module because it is a
block that encapsulates everything else.

1. Design Top Level

The design code shown below has a top-level module called design. It contains
all other sub-modules required to make the design complete.

The sub-module can have a more nested sub-module, such as mod3 inside
mod1 and mod4 inside mod2.

// Design code

module mod3 ([port list]);

reg c;

// Design code

end module

module mod4 ([port list]);

wire a;

// Design code

end module

2. Testbench Top Level

The testbench module contains a stimulus to check the functionality of the


design and primarily used for functional verification by using simulation tools.

Hence the design is instantiated and called d0 inside the testbench module. The
testbench is the top-level module from a simulator perspective

Dept. of E.C.E Vemu Institute of Technology Page 28


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

module testbench;

design d0 ([port_list_connections]);

end module

5.4 RTL SCHEMATIC

Register Transfer Level (RTL) is an abstraction for defining the digital


portions of a design. It is the principal abstraction used for defining electronic systems today
and often serves as the golden model in the design and verification flow. The RTL design is
usually captured using a hardware description language (HDL) such as Verilog or VHDL.
While these languages are capable of defining systems at other levels of abstraction, it is
generally the RTL semantics of these languages, and indeed a subset of these languages
defined as the synthesizable subset. This means the language constructs that can be reliably
fed into a logic synthesis tool which in turn creates the gate-level abstraction of the design
that is used for all downstream implementation operations.

RTL is based on synchronous logic and contains three primary pieces namely,
registers which hold state information, combinatorial logic which defines the nest state inputs
and clocks that control when the state changes.

Fig. 5.2 Sample RTL view

Dept. of E.C.E Vemu Institute of Technology Page 29


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

5.5 MODELING CONCEPTS

Verilog HDL modelling language supports three kinds of modelling styles:


gate-level, dataflow, and behavioural. The gate-level and dataflow modelling are used to
model combinatorial circuits whereas the behavioural modelling is used for both
combinatorial and sequential circuits.
5.5.1 GATE-LEVEL MODELLING

Verilog HDL supports built-in primitive gates modelling. The gates supported
are multiple-input, multiple output, tristate, and pull gates. The multiple-input gates
supported are: and, Nand, or, nor, xor, and xnor whose number of inputs are two or more,
and has only one output. The multiple-output gates supported are buf and not whose number
of outputs is one or more, and has only one input. The language also supports modelling of
tri-state gates which include bufif0, bufif1, notif0, and notif1.
These gates have one input, one control signal, and one output. The pull gates
supported are pullup and pulldown with a single output (no input) only.
The basic syntax for each type of gates with zero delays is as follows:
and | Nand | or | nor | xor | xnor [instance name] (out, in1, …, inN); // [] is optional and | is
selection buf | not [instance name] (out1, out2, …, out2, input);
bufif0 | bufif1 | notif0 | notif1 [instance name] (output A, input, control);
pullup | pulldown [instance name] (output A);
One can also have multiple instances of the same type of gate in one construct separated by
a comma such as
and [inst1] (out11, in11, in12), [inst2] (out21, in21, in22, in23), [inst3] (out31, in31, in32,
in33); The language also allows the delays to be expressed when instantiating gates. The
delay expressed is from input to output. The delays can be expressed in form of rise, fall,
and turn-off delays; one, two, or all three types of delays can be expressed in a given
instance expression. The turn-off delay is applicable to gates whose output can be turned
OFF (. e.g., notif1).
For example,
and #5 A1(Out, in1, in2); // the rise and fall delays are 5 units
and # (2,5) A2(out2, in1, in2); // the rise delay is 2 unit and the fall delay is 5 units

Dept. of E.C.E Vemu Institute of Technology Page 30


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

notif1 # (2, 5, 4) A3(out3, in2, ctrl1); //the rise delay is 2, the fall delay is 5, and the turnoff
delay is 4 unit
The gate-level modelling is useful when a circuit is a simple combinational, as
an example a multiplexer.
Multiplexer is a simple circuit which connects one of many inputs to an
output. In this part, you will create a simple 2-to-1 multiplexer and extend the design to
multiple bits.

5.5.2 DATAFLOW MODELLING

Dataflow modelling style is mainly used to describe combinational circuits.


The basic mechanism used is the continuous assignment. In a continuous assignment, a
value is assigned to a data type called net.
The syntax of a continuous assignment is
assign [delay] LHS_net = RHS_expression;
Where LHS_net is a destination net of one or more bit, and RHS_expression is
an expression consisting of various operators. The statement is evaluated at any time any of
the source operand value changes and the result is assigned to the destination net after the
delay unit. The gate level modelling examples listed in Part 1 can be described in dataflow
modelling using the continuous assignment.
For example,
assign out1 = in1 & in2; // perform and function on in1 and in2 and assign the result to out1
assign out2 = not in1;
assign #2 z [0] = ~ (ABAR & BBAR & EN); // perform the desired function and assign the
result after 2 units
The target in the continuous assignment expression can be one of the following:
1. A scalar net (e.g., 1st and 2nd examples above)
2. Vector net
3. Constant bit-select of a vector (e.g., 3rd example above)
4. Constant part-select of a vector
5. Concatenation of any of the above
Let us take another set of examples in which a scalar and vector nets are declared and used
wire COUNT, CIN; // scalar net declaration
wire [3:0] SUM, A, B; // vector nets declaration

Dept. of E.C.E Vemu Institute of Technology Page 31


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

assign {COUT, SUM} = A + B + CIN; // A and B vectors are added with CIN and the result
is // assigned to a concatenated vector of a scalar and vector nets
Note that multiple continuous assignment statements are not allowed on the same destination
net.

5.5.3 BEHAVIOURAL MODELLING

Behavioural modelling is used to describe complex circuits. It is primarily


used to model sequential circuits, but can also be used to model pure combinatorial circuits.
The mechanisms (statements) for modelling the behaviour of a design are:
initial Statements always Statements
A module may contain an arbitrary number of initial or always statements and
may contain one or more procedural statements within them. They are executed concurrently
(i.e. to model parallelism such that the order in which statements appear in the model does
not matter) with respect to each other whereas the procedural statements are executed
sequentially (i.e. the order in which they appear does matter). Both initial and always
statements are executed at time=0 and then only always statements are executed during the
rest of the time.
The syntax is as follows:
initial [timing control] procedural statements;
always [timing_control] procedural_statements;
where a procedural_statement is one of:
procedural assignment
conditional_statement
case statement
loop_statement
wait_statement
The initial statement is non-synthesizable and is normally used in testbenches.
The always statement is synthesizable, and the resulting circuit can be a combinatorial or
sequential circuit.

Dept. of E.C.E Vemu Institute of Technology Page 32


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 6
RESULTS

6.1 PROPOSED 4:2 COMPRESSOR DESIGN AND RTL SCHEMATICS

Fig. 6.1 Proposed 4:2 Compressor design

Fig. 6.2 RTL schematics of Proposed 4:2 Compressor

Dept. of E.C.E Vemu Institute of Technology Page 33


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

6.2 PROPOSED 4:2 COMPRESSOR SIMULATION RESULTS

Fig. 6.3 Simulation results of P roposed 4:2 Compressor


6.3 DADDA MULTIPLIER DESIGN AND RTL SCHEMATICS

Fig. 6.4 Dadda multiplier design

Fig. 6.5 Dadda multiplier RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 34


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

6.4 DADDA MULTIPLIER SIMULATION RESULTS

Fig. 6.6 Dadda Multiplier Simulation Results

6.5 SYNTHESIS RESULTS OF PROPOSED MAC -POWER

Fig. 6.7 Synthesis results-Power

Dept. of E.C.E Vemu Institute of Technology Page 35


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

6.6 SYNTHESIS RESULTS OF PROPOSED MAC -AREA

Fig. 6.8 Synthesis results-Area

6.7 SYNTHESIS RESULTS OF PROPOSED MAC -DELAY

Fig. 6.9 Synthesis results-Delay

Dept. of E.C.E Vemu Institute of Technology Page 36


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

6.8 SIMULATION RESUTS OF PROPOSED MAC

Fig. 6.10 Simulation results-1

Fig. 6.11 Simulation results-2

Dept. of E.C.E Vemu Institute of Technology Page 37


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

6.9 RTL SCHEMATICS OF PROPOSED MAC

Fig. 6.12 MAC unit RTL schematic View

Fig. 6.13 Column addition stage-1 RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 38


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 6.14 Proposed pipeline stage RTL schematic View

Fig. 6.15 Column addition stage-2 RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 39


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 6.16 Column addition stage RTL schematic detailed View

Fig. 6.17 Compressor logic in column addition stage RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 40


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 6.18 Proposed accumulator RTL schematic View

Fig. 6.19 Accumulator consists MFCF-PA RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 41


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 6.20 MFCF-PAused in first stage of accumulator RTL schematic View

Fig. 6.21 MFCF-PA RTL schematic View

Dept. of E.C.E Vemu Institute of Technology Page 42


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

Fig. 6.22 MFCF-PA logic RTL schematic internal View (NOT+NOR logic)

6.10 PARAMETER ANALYSIS

Table 6.1: Performance Analysis of Parameters for proposed and exist method

The key performance parameters of the proposed method are optimized with respect to Area,
Power, Delay and Energy.

Dept. of E.C.E Vemu Institute of Technology Page 43


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 7
CONCLUSION

High speed MAC architecture for FIR filter is proposed in this brief using 4:2 compressors. We
chose 4:2 compressor as the optimal choice since modern Datapath elements have fixed Datapath
in multiples of 2n where n=3,4,5 etc. In the proposed scheme, the number of flip-flops in a
pipeline can be reduced by relaxing the feedforward-cutset constraint, having the unique
characteristic of the machine learning algorithm. The proposed accumulator showed the
reduction of area and the power consumption by 17% and 19%, respectively, compared with the
accumulator with the conventional CLA adder-based design. In the case of the MAC
architecture, the proposed scheme exhibits occupied area 4% , power consumption 0.032W and
delay 5.342ns.From the obtained results, it is evident that the proposed MAC is able to operate at
higher clock frequencies than the conventional schemes. In addition, proposed MAC reduces the
power requirement to greater extent and suitable for portable designs.

Dept. Of E.C.E Vemu Institute of Technology Page 44


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

CHAPTER 8

REFERENCES
[1] Jithendra Babu N., Sarma R., “A Novel Low Power Multiply–Accumulate (MAC) Unit
Design for Fixed Point Signed Numbers”, In: Dash S., Bhaskar M., Panigrahi B., Das S. (eds)
Artificial Intelligence and Evolutionary Computations in Engineering Systems. Advances in
Intelligent Systems and Computing, 2016, vol 394. Springer, New Delhi
[2] Seo. Y and Kim. D, “A New VLSI Architecture of Parallel Multiplier Accumulator Based on
Radix-2 Modified Booth Algorithm”, in IEEE Transactions on Very Large-Scale Integration
(VLSI) Systems, 2010, vol. 18, no. 2, pp. 201-208.
[3] Milos D. Ercegovac, Tomás Lang , “Digital Arithmetic” , Elsevier, 2004, Pg: 59-63.
[4] Chang, Chip Hong, Zhang et al., “Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2
Compressors for Fast Arithmetic Circuits” IEEE Transactions on Circuits and Systems I: Regular
Papers, 2004, DOI: 10.1109/TCSI.2004.835683.
[5] Singh. K. N., Tarunkumar. H, “A review on various multipliers designs in VLSI”, Proc.
Annual IEEE India Conference (INDICON), 2015, New Delhi, pp. 1-4.
[6] Patil. P. A., Kulkarni. C, “A Survey on Multiply Accumulate Unit”, Fourth International
Conference on Computing Communication Control and Automation (ICCUBEA), 2018, Pune,
India, pp. 1-5.
[7] Sai Kumar. M, D. Kumar. A and Samundiswary. P, “Design and performance analysis of
Multiply-Accumulate (MAC) unit”, International Conference on Circuits, Power and Computing
Technologies [ICCPCT],2014, Nagercoil, 2014, pp. 1084-1089
[8] Nagaraju. N, Ramesh. S.M., “Implementation of high speed and area efficient MAC unit for
industrial applications”, Journal of Cluster Computing (Springer) 22, Pg.4511–4517, 2019.
https://doi.org/10.1007/s10586- 018-2060-z.
[9] Kwon. O, Nowka. K, & Swartzlander. E.E, “A 16-Bit by 16-Bit MAC Design Using Fast 5:3
Compressor Cells”, The Journal of VLSI Signal Processing-Systems for Signal, Image, and
Video Technology, 2002, vol. 31, pp.77– 89, DOI: https://doi.org/10.1023/A:1015333103608

Dept. of E.C.E Vemu Institute of Technology Page 45


Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator

[10] Malleshwari, R. and E. Srinivas,“FPGA Implementation of Low Power and High Speed 64-
Bit Multiply Accumulate Unit for Wireless Applications”, International Journal of Science and
Research (2016). DOI:10.21275/v5i4.14041608
[11] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1,
pp. 127–138, Jan. 2017

[12] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-
13, no. 1, pp. 14–17, Feb. 1964.

[13] L. Dadda,“Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, no. 5, pp. 349–
356, Mar. 1965.

[14] P. F. Stelling and V. G. Oklobdzija, “Implementing multiply-accumulate operation in


multiplication time,” in Proc. 13th IEEE Symp. Comput. Arithmetic, Jul. 1997, pp. 99–106.

[15] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors, “A high-speed, energy-efficient two-


cycle multiply-accumulate (MAC) architecture and its application to a double-throughput MAC
unit,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 12, pp. 3073–3081, Dec. 2010.

[16] W. J. Townsend, E. E. Swartzlander, and J. A. Abraham, “A comparison of Dadda and


Wallace multiplier delays,” Proc. SPIE, Adv. Signal Process. Algorithms, Archit., Implement.
XIII, vol. 5205, pp. 552–560, Dec. 2003.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep


convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[18] K. Manikantta Reddy, M. H. Vasantha, Y. B. N. Kumar, and D. Dwivedi, “Design and


analysis of multiplier using approximate 4-2 compressor,”AEU-Int. J. Electron.Comm
Commun., vol. 107, pp. 89_97, Jul. 2019.

[19] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based


models for speech recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 577–585.

[20] A. Momeni, J. Han, P. Montuschi, F. Lombardi, "Design and analysisof approximate


compressors for multiplication," IEEE Transactionson Computers, in press, 2014.

Dept. of E.C.E Vemu Institute of Technology Page 46

You might also like