VLSI Project Document
VLSI Project Document
on
BACHELOR OF TECHNOLOGY
in
BONAFIDE CERTICATE
This is to certify that the project report entitled “OPTIMIZATION OF FEED FORWARD
CUTSET FREE PIPELINED MULTIPLY ACCUMULATE UNIT FOR MACHINE
LEARNING ACCELERATOR”, is being submitted by Ms. D. NAGAVENI (174M1A0422),
Mr. B. GIRI (174M1A0410), Ms. G. SWETHA (174M1A0430) and Ms. G. RAMYA SRI
(174M1A0431) in partial fulfillment of degree of BACHELOR OF TECHNOLOGY in
Date: Date:
DECLARATION
Place: P. KOTHAKOTA
Date: Ms. D. NAGAVENI 174M1A0422
Mr. B. GIRI 174M1A0410
Ms. G. SWETHA 174M1A0430
Ms. G. RAMYA SRI 174M1A0431
ACKNOWLEDGEMENT
A Grateful thanks to Dr. K. Chandra Sekhar Naidu, Chairman, Vemu Institute of Technology
for providing education in his esteemed institution.
I express my sincere thanks to Dr. Naveen Kilari, Principal, Vemu Institute of Technology for
his good administrative support during my course of study.
With deep sense of gratitude to Dr. S. Munirathnam, HOD, Dept.of ECE, Vemu Institute of
Technology for his unconditional help during my study period.
I express my sincere thanks to Project Co-ordinator Dr. G. Elairaja, Professor, Dept. of ECE,
for his valuable encouragement to the project.
I would like to acknowledge Project guide Dr. S. Murali Mohan, Professor, Dept. of ECE, for
his supervision and valuable guidance with constant monitoring in completing the project
successfully.
This project work would not have been possible without the inspiration, moral and emotional
support of my parents. Their continuous words of encouragement, immense endurance and
understanding deserve a special vote of appreciation.
Finally, I would like to express my sincere thanks to all faculty members of ECE Department,
lab Technicians and friends, one and all who has helped me to complete the project work
successfully.
Ms. D. NAGAVENI 174M1A0422
Mr. B. GIRI 174M1A0410
Ms. G. SWETHA 174M1A0430
Ms. G. RAMYA SRI 174M1A0431
CONTENTS
Chapter No. Title Page No.
Abstract i
List of Tables v
List of Abbreviations vi
Chapter 1 INTRODUCTION 1-3
3.2 Adder 8
5.3.1 Syntax 27
Chapter 7 CONCLUSION 44
Page|ii
Fig. No. Title Page No.
5.2 Sample RTL view 29
Page|iii
Fig. No. Title Page No.
6.22 MFCF-PA logic RTL schematic internal View (NOT+NOR logic) 43
Page|iv
LIST OF TABLES
Page|v
LIST OF ABBREVATIONS
Acronym Expansion
XOR Exclusive OR
PE Processing Element
PA Pipelined Accumulator
Page|vi
Optimization of Feed Forward Cutset Free Pipelined Multiply Accumulate Unit for Machine
Learning Accelerator
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
In a machine learning accelerator, a large number of Multiply–Accumulate
(MAC) units are included for parallel computations, and timing- critical paths of the system
are often found in the unit. A multiplier typically consists of several computational parts
including a partial product generation, a column addition, and a final addition. An
accumulator consists of the carry propagation adder. Long critical paths through these stages
lead to the performance degradation of the overall system. To minimize this problem, various
methods have been studied. The Wallace and Dadda multipliers are well-known examples for
the achievement of column addition, and the Carry-Look ahead Adder (CLA) is often used to
reduce the critical path in the accumulator or the final addition stage of the multiplier.
Meanwhile, a MAC operation is performed in the machine learning algorithm to compute a
partial sum that is the accumulation of the input multiplied by the weight.
In a MAC unit, the multiply and accumulate operations are usually merged to
reduce the number of carry propagation steps from two to one. Such a structure, however,
still comprises a long critical path delay that is approximately equal to the critical path delay
of a multiplier. It is well known that pipelining is one of the most popular approaches for
increasing the operation clock frequency. Modern DSP algorithms used in portable
applications demand high performance VLSI data path systems with low area and power.
Multiply and accumulate units (MAC) are the most computation intensive component of
many DSP operations such as FFT, convolution, filters etc. This puts onus on design of
efficient MAC architectures to achieve optimization of DSP modules. MAC unit consists of a
‘n’-bit multiplier, which multiplies a set of ‘n’ bit input operands repeatedly, an adder of
width ‘2n’ to add consecutive products and an accumulator of width ‘2n’ to accumulate the
result in register.
1.2 FEED FORWARD-CUTSET RULE FOR PIPELINING
It is well known that pipelining is one of the most effective ways to reduce the
critical path delay, thereby increasing the clock frequency. This reduction is achieved through
the insertion of flip- flops into the datapath. In addition to reducing critical path delays
through pipelining, it is also important to satisfy functional equality before and after
pipelining. The point at which the flip-flops are inserted to ensure functional equality is
called the feed forward-cutset.
Cutset: A set of the edges of a graph such that if these edges are removed from the graph, and
the graph becomes disjointed.
Feed forward-cutset: A cutset where the data move in the forward direction on all of the
cutset edges.
Fig. 1.1 Block diagram of 3-tap FIR filter using valid pipelining
Multipliers are the most time consuming, energy intensive module of a MAC
unit which are built using adders. Hence, adder optimization leads to efficient multipliers
which in turn results in optimization of MAC unit. Various adder topologies are designed to
achieve efficiency in terms of area and delay. Adders such as Ripple carry adder (RCA),
Carry select adder (CSLA), Carry save adder (CSA) use traditional addition algorithm to
generate sum and carry signal. For an adder of width ‘m’ bits, RCA computes carry out signal
by rippling carry signal of each module from “carry in” to “carry out” with a delay of O(m).
CSLA and CSA compute sum similar to RCA, but differs in calculation of carry out signal
and reduce critical delay path to O(m(l+2)/(l+1)) and O(log(m)) respectively. Carry look
ahead adder (CLA) reduces critical delay path to O(log(m)), by computing carry out signal
independent of (m-1)th carry bit.
Compressor based adders are alternate means for implementing multi-bit
adders. A traditional 4:2 compressor adder consists of 2 full adder modules and accepts 4
inputs of equal bit weight to produce two output signals. If tXOR, tCy indicates the delay of
XOR gate and carry generator logic then delay of 4:2 compressor can be expressed as:
(i*tXOR + j*tCy) where, ‘I’ is the number of XOR gates and ‘j’ is the number of carry
generation logic stages. Various multiplier architectures such as array multipliers, booth
multiplier, Wallace tree multiplier, dada multiplier etc. use adders for their partial product
reduction tree (PP) and PP accumulation. Array multiplier has regular structure, but has the
highest delay. Other multiplication algorithms such as booth, Wallace and dada multiplier aim
to reduce number of adder stages in PP reduction tree to optimize area and power
consumption. Hence, various adder topologies are constituted in multiplier architecture to
achieve optimal performance in terms of area and power. Literature for enhanced MAC units
through optimized adder and multiplier architectures exist.
Various multipliers such as Array multiplier, modified booth multiplier,
Wallace tree multiplier and Dadda multiplier used to realize MAC architecture. CSA adders
are used in PP reduction tree of MAC unit multipliers. It ascertained that Wallace tree
multiplier implemented using CSA adder achieves high speed and low power performance
and is an ideal candidate for efficient MAC unit. Therefore, Wallace tree multiplier is a prime
candidate for enhancing efficiency of MAC unit due to a smaller number of stages in its PP
reduction tree has proposed a modified Wallace tree multiplier that achieves lesser delay and
explored the efficacy of compressor adder in PP reduction tree. Counter based Wallace
multiplier implemented in replaces traditional adder topologies with compressor adders to
enhance efficacy of MAC unit. High speed MAC unit design using compressor adders have
been explored. 3:2 compressor based parallel multiplier is implemented in MAC unit for
audio applications. 5:3 compressors for PP reduction in multipliers is proposed to implement
word length MAC unit. Propose a merged MAC unit that eliminates the use of separate
accumulation unit which is built using traditional 4:3 compressor adder. In this brief, we
propose a modified efficient 4:2 compressor to realize high speed MAC unit.
CHAPTER 2
LITERATURE SURVEY
Jithendra Babu N., Sarma R., “A Novel Low Power Multiply–Accumulate
(MAC) Unit Design for Fixed Point Signed Numbers”, In: Dash S., Bhaskar M., Panigrahi
B., Das S. (eds) Artificial Intelligence and Evolutionary Computations in Engineering
Systems. Advances in Intelligent Systems and Computing. In the emerging technologies the
low power designs play a critical role of operations. Our proposed work is on the low power
MAC unit that is used to find the fixed point signed numbers. The proposed design is to
achieve high throughput and low power consumption. Our proposed work has various
building blocks like firstly, Wallace tree multiplier since a multiplier is one of the key parts
for the processing of digital signal processing systems and secondly an accumulation block.
Since the output from the multiplier and adder is to be efficient, we proposed a BCD block
that is used to convert the output into BCD number. The overall MAC is performed in the
cadence virtuoso 90 nm technology and performance analysis of each individual block is
examined using the cadence virtuoso before designing the overall MAC unit. Power, delay
and power-delay product are calculated using the Cadence Spectre tool.
Chang, Chip Hong, Zhang et al., “Ultra Low-Voltage Low-Power CMOS 4:2
and 5:2 Compressors for Fast Arithmetic Circuits” IEEE Transactions on Circuits and
Systems I: Regular Papers. This paper presents several architectures and designs of low-
power 4:2 and 5:2 compressors capable of operating at ultra-low supply voltages. These
compressor architectures are anatomized into their constituent modules and different static
logic styles based on the same deep submicrometric CMOS process model is used to realize
them. Different configurations of each architecture, which include a number of novels 4:2 and
5:2 compressor designs, are prototyped and simulated to evaluate their performance in speed,
power dissipation and power-delay product. The newly developed circuits are based on
various configurations of the novel 5:2 compressor architecture with the new carry generator
circuit, or existing architectures configured with the proposed circuit for the exclusive OR
(XOR) and exclusive NOR (XNOR) [XOR-XNOR] module. The proposed new circuit for
the XOR-XNOR module eliminates the weak logic on the internal nodes of pass transistors
with a pair of feedback PMOS-NMOS transistors. Driving capability has been considered in
the design as well as in the simulation setup so that these 4:2 and 5:2 compressor cells can
operate reliably in any tree structured parallel multiplier at very low supply voltages. Two
new simulation environments are created to ensure that the performances reflect the realistic
circuit operation in the system to which these cells are integrated. Simulation results show
that the 4:2 compressor with the proposed XOR-XNOR module and the new fast 5:2
compressor architecture are able to function at supply voltage as low as 0.6 V, and outperform
many other architectures including the classical CMOS logic compressors and variants of
compressors constructed with various combinations of recently reported superior low-power
logic cells.
CHAPTER 3
EXISTING METHOD
3.1 CONVENTIONAL MAC UNIT DESIGN
In conventional mac unit Wallace multiplier, PA , CSA has been used in place
of multiplier, accumulator , adder is shown in Fig.3.2
3.2 ADDER
The adder computes the sum of the product from the multiplier and the value
stored in the accumulator. The output of the adder is passed onto the accumulator. If inputs to
the multiplier are of bit size 16, then the adder should be of bit size 32, producing an output
of size 32+1. Carry save adder, carry select adder, ripple carry adder(RCA), carry look-ahead
adder(CLA) are among the widely used adders in the design of digital logic processing
devices.
Ahead(CLA) adder also increases. Hence, the carry-save adder is much faster than
conventional adders. The block diagram of a 4-bit Carry Save Adder is shown in Fig. 3.3
inserted flip-flops becomes 33(n−1), which confirms that the number of flip-flops for the
pipelining increases significantly as the number of pipeline stages is increased.
In the conventional PA, the correct accumulation values of all the inputs up to
the corresponding clock cycle are produced in each clock cycle as shown in the timing
diagram of Fig. 3.5, Fig. 3.6. A two-cycle difference exists between the input and the
corresponding output due to the two- stage pipeline. In the conventional two-stage PA, the
accumulation output (S) is produced two clock- cycle after the corresponding input is stored
in the input buffer. For example, in the conventional case, the generated carry from the lower
half and the corresponding inputs are fed into the upper half adder in the same clock cycle as
shown in the cycles 4 and 5 of Fig. 3.6).
• Each bit of the input is multiplied with each bit of the other input.
• The number of partial products is reduced by half by using half and full adders.
• The wires in the two inputs are grouped together and then added.
Wallace tree with 15:4 Compressors is made by a tree like formation of many
15:4 compressor each having four inputs that are multi-bit and produces two outputs which is
also multi-bit .The fig. 3.4.2shown below is a Wallace tree structure with 15:4 compressors
have 16 partial products given as inputs and a carry and a sum as the output which has the
same dimension same as input.
CHAPTER 4
PROPOSED SYSTEM
4.1 PROPOSED MAC ARCHITECTURE
The MAC architecture was divided into two blocks one is multiplier and
another one is accumulator. The multiplier multiplies the two inputs and the results are stored
in the accumulator register.
Merged structure are used (Pipelined Dadda with 4::2 Compressor +MFCF-
PA). The column addition in the MAC operation is for the calculation of binary numbers in
each addition stage using the half-adders and/or full adders and then for the passing of the
results to the next addition stage. Since MAC computations are based on such additions, the
proposed pipelining method can also be applied to the machine learning-specific MAC
structure. In this section, the proposed pipelining method is applied to the MAC architecture
by using the unique characteristic of Dadda multiplier. The Dadda multiplier performs the
column addition in a similar fashion to the Wallace multiplier which is widely used, and it
has less area and shorter critical path delay than the Wallace multiplier.
Fig. 4.1 Pipelined column addition structure with Dadda multiplier (Top) conventional
pipelining (Bottom)proposed FCF pipelining
Fig. 4.1 shows the pipelined column addition structures in the Dadda multiplier. The Dadda
multiplier performs the column addition to reduce the height of each stage. If a particular
column already satisfies the target height for the next column addition stage, then no
operation is performed during the stage . Using this property, the proposed pipelining method
can be applied to the MAC structure as well. Fig. 4.1(Top) is an example of pipelining where
the conventional method is used. All of the edges in the feedforward-cutset are subject to
pipelining. On the other hand, in the proposed FCF pipelining case [Fig. 4.1(Bottom)], if a
node in the column addition stage does not need to participate in the height reduction, it can
be excluded from the pipelining [the group in the dotted box of Fig. 4.1(Bottom)]. In other
words, in the conventional pipelining method, all the edges in the feedforward-cutset must be
pipelined to ensure functional equality regardless of a timing slack of each edge [Fig.
4.1(Top)]. However, in the FCF pipelining method, some edges in the cutset do not
necessarily have to be pipelined if the edges have enough timing slacks [Fig. 4.1(Bottom)].
As a result, a smaller number of flip-flops are required compared with the conventional
pipelining case. On the other hand, in the Wallace multiplier, as many partial products as
possible are involved in the calculation for each column addition stage. Because the partial
products do not have enough timing slack to be excluded from pipelining, the effectiveness of
the proposed FCF pipelining method is smaller in the Wallace multiplier case than in the
Dadda multiplier case.
Fig. 4.2 shows the block diagrams of pipelined MAC architectures. The
proposed MAC architecture [Fig. 4.2] combines the FCF-MAC (MAC with the proposed
FCF pipelining) for the column addition and the MFCF-PA for the accumulation. Instead of
pipelining all of the final nodes in the column addition stage as in [Fig. 4.2], the proposed
FCF-MAC architecture is used to selectively choose the nodes for the pipelining. For the
proposed architecture, the merged multiply–accumulation style is adopted . The final adder is
placed in the last stage of the MAC operation. In general, the final adder is designed using the
CLA to achieve a short critical path delay. In contrast, the proposed design uses the MFCF-
PA style in the accumulation stage in consideration of the greater power and the area
efficiency of the MFCF-PA.
The design makes use of compressors in place of full adders, and the final
carry propagate stage is replaced by carry save adder. The Dadda basically multiplies two
unsigned integers. The proposed Dadda multiplier architecture comprises of an AND array
for computing the partial produces so obtained and carry save adder in the final stage of
addition. In the proposed architecture partial product reduction is accomplished by the use of
4-2 compressor structures and final stage of addition is performed by the carry save adder.
This multiplier architecture comprises of a partial product generation stage, partial product
reduction stage and the final addition stage. The latency in the Dadda multiplier can be
reduced by decreasing the number of adders in the partial product reduction stage.
In the proposed architecture, multi bit compressor is used for realizing the
reduction in the number of partial product addition stages. The combined factors of low
power, low transistor count and minimum delay makes the 4:2 compressor.
In this compressor, the outputs generated at each stage are efficiently used by
replacing the XOR block with multiplier blocks. The select bits to the multiplexers are
available much ahead of the inputs so that the critical path delay is minimized. The various
adder structures in the conventional architecture are replaced by compressors.
4.2 MODIFIED 4:2 COMPRESSOR
However, to improve regularity of arrangement of cells in multiplier
modifications to logic expressions of standard 4:2 compressor is proposed with logic low in
Can and neglecting output Count. Modified 4:2 exact compressor has inputs X1, X2, X3, X4
and the outputs Sum-S and Carry-C. Elimination of Cout in 4:2 compressor generate error for
X1X2X3X4 = “1111”.
However, in the multiplier implementation. we account for Count elimination in the modified
4:2 exact compressor with error compensation bit E = X1& X2& X3& X4.
Fig 4.4 and 4.5 represents implementation of compressor technique to Dadda multiplier.
reset signal is 0, then the flip-flop will pass the value that is stored using tristate buffer to the
output.
In our proposed work the D flip-flop is implemented using a technique called
DDFF, which is a hybrid flip-flop that uses sleepy stack technique for less power dissipation
and low leakage currents which occurs during the storing of data in the register block. The
block diagram is shown in the fig 4.6.
reset
DFF Q
D
CLK
S[31 : 0] represents the data that are stored in the output buffer register as a result of the
accumulation. In the conventional PA structure [Fig. 3.5], the flip-flops must be inserted
along the feedforward-cutset to ensure functional equality.
Fig. 4.7 Schematic diagram of two stage 32-bit accumulator for Proposed FCF-PA
On the other hand, regarding the proposed structure, the output is generated one clock cycle
after the input arrives. Moreover, for the proposed scheme, the generated carry from the
lower half of the 32-bit adder is involved in the accumulation one clock cycle later than the
case of the conventional pipelining.
For example, in the conventional case, the generated carry from the lower half
and the corresponding inputs are fed into the upper half adder in the same clock cycle as
shown in the cycles 4 and 5 of Fig.4.8 (left). On the other hand, in the proposed FCF-PA, the
carry from the lower half is fed into the upper half one cycle later than the corresponding
input for the upper half, as depicted in the clock cycles 3-5 of Fig. 4.8 (right).
Fig. 4.8 (Left) Timing diagram of two stage 32-bit accumulator and (Right) Example
with propose FCF-PA
This characteristic makes the intermediate result that is stored in the output
buffer of the proposed accumulator different from the result of the conventional pipelining
case. Fig.3.5. Two-stage 32-bit pipelined-accumulation examples with the conventional
pipelining (left) and proposed FCF-PA (right). Binary number “1” between the two 16-bit
hexadecimal numbers is a carry from the lower half. The proposed accumulator, however,
shows the same final output (cycle 5) as that of the conventional one.
In addition, regarding the two architectures, the number of cycles from the
initial input to the final output is the same. The characteristic of the proposed FCF pipelining
method can be summarized as follows: In the case where adders are used to process data in
an accumulator, the final accumulation result is the same even if binary inputs are fed to the
adders in an arbitrary clock cycle as far as they are fed once and only once. In the machine
learning algorithm, only the final result of the weighted sum of the multiplication between the
input feature map and the filters is used for the subsequent operation, so the proposed
accumulator would produce the same results as the conventional accumulator.
Meanwhile, the CLA adder has been mostly used to reduce the critical path
delay of the accumulator. The carry prediction logic in the CLA, however, causes a
significant increase in the area and the power consumption. For the same critical path delay,
the FCF-PA can be implemented with less area and lower power consumption compared with
the accumulator that is based on the CLA.
Fig 4.9 Example of an undesired data transition in the two-stage 8-bit PAs with 4-bit
2’s complement input numbers. Binary number “1” between the two 4-bit hexadecimal
numbers is a carry from the lower half.
On the other hand, in the FCF-PA [Fig. 4.9], AReg[2] and S[2] are added in
cycle 2, thereby generating a carry. In cycle 3, the generated carry from the lower half is
stored in the flip-flop. The carry is no longer propagated toward the upper half in this clock
cycle. In the next clock cycle (cycle 4), the carry that is stored in the flip-flop is transferred to
the carry input of the upper half. During the calculation, it can be observed that S[7 : 4]
changes to “1111” in cycle 3 and returns to “0000” in cycle 4. Although this undesired data
transition does not affect the accuracy of accumulation results, it reduces the power efficiency
of the FCF-PA. Fig. 4.10(left) shows the structure of the FCF-PA for the 2’s complement
numbers. The binary numbers in the diagram indicate the sign extension of AReg[m − 1] and
the carry bits for the undesired data transition case.
Fig 4.10 Proposed FCF-PA(Left), MFCF-PA for power improvement of power efficiency
(Right)
To prevent the undesired data transitions, the sign extended the input number to RCA[n − 1 :
k] and the carry out of RCA[k−1 : 0] must be modified to “0” if both of AReg[m−1] and the
carry out from RCA[k − 1 : 0] are “1”; therefore, the undesired data transition can be
prevented by detecting such a condition. Since the critical delay becomes too long if the
upper half-addition needs to wait until the decision regarding the lower half carry-out
condition detection, the modified version of the FCF-PA (MFCF-PA) is proposed here as
shown in Fig. 4.10.
An additional flip-flop was added between the two RCAs to prevent the
formation of a long critical path. RCA[n − 1 : k] receives both AFix with the sign extension
and the CarryFix signals as modified input values. AFix generates “0” when AReg[m − 1]
and Carry are both “1.” Otherwise, AReg[m − 1] is buffered in AFix as it is. Similarly,
CarryFix generates “0” when AReg[m − 1] and Carry are both “1.” Otherwise, Carry is
buffered in CarryFix as it is. Although an additional NORI(NOR + INV) gate causes an
additional delay (150 ps at SS corner, 10% supply voltage drop, 125 ◦C temperature in our
analysis) in the critical path, the overhead is negligible considering that target clock period
ranges from 1.1 to 2.7 ns (in Section V). In the event that the accumulator is pipelined to
multiple stages, the insertion of the additional logic into all of the pipeline stages may
increase the area overhead.
To reduce the overhead, the modified structure [Fig. 4.10(right)] is inserted
into only one pipeline stage. For the rest of the pipeline stages, only the FCF-PA [Fig.
4.10(left)] is used. Fig. 4.11 shows a block diagram of the MFCF-PA. A good power
efficiency is still shown, because the probability of the sign extension bit becoming “1” is
reduced.
CHAPTER 5
IMPLEMENTATION
5.1 SOFTWARE OVERVIEW
1. Source’s window (top left): hierarchically displays the files included in the
project
2. Processes window (middle left): displays available processes for the source
file currently selected.
The Xilinx ISE Web-PACK is a complete FPGA/CPLD programmable logic design suite
providing: 1. Specification of programmable logic via schematic capture or Verilog/VHDL 2.
Synthesis and Place & Route of specified logic for various Xilinx FPGAs and CPLDs 3.
Functional (Behavioral) and Timing (post-Place & Route) simulation. 4.Download of
configuration data into target device via communications cable.
Xilinx currently claims that its FPGAs, due to their ability to be customized
for different workloads, accelerate processing by 40 times for machine-learning inference, 10
times for video and image processing, and 100 times for genomics, with respect to CPU-or
GPU-based frameworks.
Xilinx FPGAs provide you with system integration while optimizing for
performance/watt. The minimal effort Spartan group of FPGAs is completely upheld by this
release, and also the group of CPLDs, which means small designers and instructive
establishments, have no overheads from the expense of advancement programming.
The ISE software controls all aspects of the design flow. Through the Project
Navigator interface, you can access all of the design entry and design implementation tools.
You can also access the files and documents associated with your project.
By default, the Project Navigator interface is divided into four panel sub-
windows, as seen in Figure 5.1. On the top left are the Start, Design, Files, and Libraries
panels, which include display and access to the source files in the project as well as access to
running processes for the currently selected source. The Start panel provides quick access to
opening projects as well as frequently access reference material, documentation and tutorials.
At the bottom of the Project Navigator are the Console, Errors, and Warnings panels, which
display status messages, errors, and warnings. To the right is a multi-document interface
(MDI) window referred to as the Workspace. The Workspace enables you to view design
reports, text files, schematics, and simulation waveforms. Each window can be resized,
undocked from Project Navigator, moved to a new location within the main Project Navigator
window, tiled, layered, or closed. You can use the View > Panel’s menu commands to open
or close panels. You can use the Layout > Load Default Layout to restore the default window
layout. These windows are discussed in more detail in the following sections.
5.2 VHDL CODE
There are two HDL languages such as Verilog and VHDL which are widely
used as IEEE VHDL (VHSICHardware Description Language) is a hardware description
language used in electronic plan computerization to depict advanced and blended flag
frameworks, for example, field-programmable door exhibits and incorporated circuits.
VHDL can likewise be utilized as a universally useful parallel programming dialect. The
basic rules in VHDL are free formatting and case insensitive. It also consists of identifier
which is the name of the object and composed of letters, digits and underscore and must start
with a letter.
Library and package: library IEEE; & utilize IEEE.std_logic_1164.all; These
statements are used to add types, operators, functions.
Entity Declaration:
Entity eq is Port (i0, i1: in std_logic; eq: out std_logic);
End eq;
The basic format for I/O port declaration is:
signal_name1, signal_name2…: mode data_type;
Data type and operators:
Std_logictype: It is defined in std_logic_1164 package & consists of nine values.
Logical Operators – Some of the logical operators such as not, and, or &xor are defined in
std_logic_ vector and std_logic data type.
Architecture body: The architecture body may include a declaration. There are three
concurrent statements between begin and end. There are two internal signals declared here:
Signal p0, p1: std_logic;
A module should be enclosed within a module and end module keywords. The name of the
module should be given right after the module keyword, and an optional list of ports may be
declared as well.
end module
module name;
end module
input stimulus. But, the testbench is not instantiated within any other module because it is a
block that encapsulates everything else.
The design code shown below has a top-level module called design. It contains
all other sub-modules required to make the design complete.
The sub-module can have a more nested sub-module, such as mod3 inside
mod1 and mod4 inside mod2.
// Design code
reg c;
// Design code
end module
wire a;
// Design code
end module
Hence the design is instantiated and called d0 inside the testbench module. The
testbench is the top-level module from a simulator perspective
module testbench;
design d0 ([port_list_connections]);
end module
RTL is based on synchronous logic and contains three primary pieces namely,
registers which hold state information, combinatorial logic which defines the nest state inputs
and clocks that control when the state changes.
Verilog HDL supports built-in primitive gates modelling. The gates supported
are multiple-input, multiple output, tristate, and pull gates. The multiple-input gates
supported are: and, Nand, or, nor, xor, and xnor whose number of inputs are two or more,
and has only one output. The multiple-output gates supported are buf and not whose number
of outputs is one or more, and has only one input. The language also supports modelling of
tri-state gates which include bufif0, bufif1, notif0, and notif1.
These gates have one input, one control signal, and one output. The pull gates
supported are pullup and pulldown with a single output (no input) only.
The basic syntax for each type of gates with zero delays is as follows:
and | Nand | or | nor | xor | xnor [instance name] (out, in1, …, inN); // [] is optional and | is
selection buf | not [instance name] (out1, out2, …, out2, input);
bufif0 | bufif1 | notif0 | notif1 [instance name] (output A, input, control);
pullup | pulldown [instance name] (output A);
One can also have multiple instances of the same type of gate in one construct separated by
a comma such as
and [inst1] (out11, in11, in12), [inst2] (out21, in21, in22, in23), [inst3] (out31, in31, in32,
in33); The language also allows the delays to be expressed when instantiating gates. The
delay expressed is from input to output. The delays can be expressed in form of rise, fall,
and turn-off delays; one, two, or all three types of delays can be expressed in a given
instance expression. The turn-off delay is applicable to gates whose output can be turned
OFF (. e.g., notif1).
For example,
and #5 A1(Out, in1, in2); // the rise and fall delays are 5 units
and # (2,5) A2(out2, in1, in2); // the rise delay is 2 unit and the fall delay is 5 units
notif1 # (2, 5, 4) A3(out3, in2, ctrl1); //the rise delay is 2, the fall delay is 5, and the turnoff
delay is 4 unit
The gate-level modelling is useful when a circuit is a simple combinational, as
an example a multiplexer.
Multiplexer is a simple circuit which connects one of many inputs to an
output. In this part, you will create a simple 2-to-1 multiplexer and extend the design to
multiple bits.
assign {COUT, SUM} = A + B + CIN; // A and B vectors are added with CIN and the result
is // assigned to a concatenated vector of a scalar and vector nets
Note that multiple continuous assignment statements are not allowed on the same destination
net.
CHAPTER 6
RESULTS
Fig. 6.17 Compressor logic in column addition stage RTL schematic View
Fig. 6.22 MFCF-PA logic RTL schematic internal View (NOT+NOR logic)
Table 6.1: Performance Analysis of Parameters for proposed and exist method
The key performance parameters of the proposed method are optimized with respect to Area,
Power, Delay and Energy.
CHAPTER 7
CONCLUSION
High speed MAC architecture for FIR filter is proposed in this brief using 4:2 compressors. We
chose 4:2 compressor as the optimal choice since modern Datapath elements have fixed Datapath
in multiples of 2n where n=3,4,5 etc. In the proposed scheme, the number of flip-flops in a
pipeline can be reduced by relaxing the feedforward-cutset constraint, having the unique
characteristic of the machine learning algorithm. The proposed accumulator showed the
reduction of area and the power consumption by 17% and 19%, respectively, compared with the
accumulator with the conventional CLA adder-based design. In the case of the MAC
architecture, the proposed scheme exhibits occupied area 4% , power consumption 0.032W and
delay 5.342ns.From the obtained results, it is evident that the proposed MAC is able to operate at
higher clock frequencies than the conventional schemes. In addition, proposed MAC reduces the
power requirement to greater extent and suitable for portable designs.
CHAPTER 8
REFERENCES
[1] Jithendra Babu N., Sarma R., “A Novel Low Power Multiply–Accumulate (MAC) Unit
Design for Fixed Point Signed Numbers”, In: Dash S., Bhaskar M., Panigrahi B., Das S. (eds)
Artificial Intelligence and Evolutionary Computations in Engineering Systems. Advances in
Intelligent Systems and Computing, 2016, vol 394. Springer, New Delhi
[2] Seo. Y and Kim. D, “A New VLSI Architecture of Parallel Multiplier Accumulator Based on
Radix-2 Modified Booth Algorithm”, in IEEE Transactions on Very Large-Scale Integration
(VLSI) Systems, 2010, vol. 18, no. 2, pp. 201-208.
[3] Milos D. Ercegovac, Tomás Lang , “Digital Arithmetic” , Elsevier, 2004, Pg: 59-63.
[4] Chang, Chip Hong, Zhang et al., “Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2
Compressors for Fast Arithmetic Circuits” IEEE Transactions on Circuits and Systems I: Regular
Papers, 2004, DOI: 10.1109/TCSI.2004.835683.
[5] Singh. K. N., Tarunkumar. H, “A review on various multipliers designs in VLSI”, Proc.
Annual IEEE India Conference (INDICON), 2015, New Delhi, pp. 1-4.
[6] Patil. P. A., Kulkarni. C, “A Survey on Multiply Accumulate Unit”, Fourth International
Conference on Computing Communication Control and Automation (ICCUBEA), 2018, Pune,
India, pp. 1-5.
[7] Sai Kumar. M, D. Kumar. A and Samundiswary. P, “Design and performance analysis of
Multiply-Accumulate (MAC) unit”, International Conference on Circuits, Power and Computing
Technologies [ICCPCT],2014, Nagercoil, 2014, pp. 1084-1089
[8] Nagaraju. N, Ramesh. S.M., “Implementation of high speed and area efficient MAC unit for
industrial applications”, Journal of Cluster Computing (Springer) 22, Pg.4511–4517, 2019.
https://doi.org/10.1007/s10586- 018-2060-z.
[9] Kwon. O, Nowka. K, & Swartzlander. E.E, “A 16-Bit by 16-Bit MAC Design Using Fast 5:3
Compressor Cells”, The Journal of VLSI Signal Processing-Systems for Signal, Image, and
Video Technology, 2002, vol. 31, pp.77– 89, DOI: https://doi.org/10.1023/A:1015333103608
[10] Malleshwari, R. and E. Srinivas,“FPGA Implementation of Low Power and High Speed 64-
Bit Multiply Accumulate Unit for Wireless Applications”, International Journal of Science and
Research (2016). DOI:10.21275/v5i4.14041608
[11] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1,
pp. 127–138, Jan. 2017
[12] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-
13, no. 1, pp. 14–17, Feb. 1964.
[13] L. Dadda,“Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, no. 5, pp. 349–
356, Mar. 1965.