WWWW
WWWW
BACHELOR OF TECHNOLOGY
IN
O.NITHIN 218R1A04N9
P.PRAVEEN 218R1A04O0
P.KISHORE 218R1A04O1
P.PURNACHANDU 218R1A04O2
Ms.L. LAVANYA
Assistant Professor
ECE DEPARTMENT
(2024-2025)
CMR ENGINEERING COLLEGE
(Approved by AICTE, UGC AUTONOMOUS, Accredited by NBA, NAAC)
Kandlakoya (V), Medchal , Telangana.
CERTIFICATE
This is to certify that the industry oriented mini-project work entitled “HIGH-
PERFORMANCE VLSI IMPLEMENTATION OF 3-PARALLEL FIR FILTER WITH
VEDIC MULTIPLIER” is being submitted O.NITHIN bearing Roll No:218R1A04N9,
P.PRAVEEN bearing Roll No: 218R1A04O0, P.KISHORE bearing Roll No: 218R1A04O1,
P.PURNA CHANDU bearing Roll No:218R1A04O2 in B.Tech IV-I semester, Electronics
and Communication Engineering is a record bonafide work carried out by then during the
academic year 2024-25.The results embodied in this report have not been submitted to any
other University for the award of any degree.
EXTERNAL EXAMINER
ACKNOWLEDGEMENTS
We sincerely thank the management of our college CMR Engineering College for providing
required facilities during our project work.
Dr. A. S. Reddy for his timely suggestions, which helped us to complete the project work
successfully.
Dr. SUMAN MISHRA, Head of the Department, ECE for his consistent encouragement
during the progress of this project.
We sincerely thank our project internal guide Ms.L.LAVANYA, Assistant Professor of ECE
for guidance and encouragement in carrying out this project work.
DECLARATION
We hereby declare that the mini-project entitled “HIGH-PERFORMANCE VLSI
IMPLEMENTATION OF 3-PARALLEL FIR FILTER WITH VEDIC MULTIPLIER”
is the work done by us in campus at CMR ENGINEERING COLLEGE, Kandlakoya
during the academic year 2024-2025 and is submitted as mini project in partial fulfilment of
the requirements for the award of degree of BACHELOR OF TECHNOLOGY in
ELECTRONICS AND COMMUNICATION ENGINEERING FROM JAWAHARLAL
NEHRU TECHNOLOGICAL UNIVERSITY, HYDERABAD.
O.NITHIN (218R1A04N9)
P.PRAVEEN (218R1A04O0)
P.KISHORE (218R1A04O1)
viii
LIST OF TABLES
8.2.1 AREA 59
8.2.3 DELAY 59
ix
ABSTRACT
The most important criteria for the design and implementation of DSP processor is area
optimization and reduction in power consumption. The fundamental block for the design
and implementation of the DSP processor is the Finite Impulse Response Filter. The Finite
Impulse Response (FIR) Filter consists of three basic modules which are adder blocks, flip
flops and multiplier blocks. The performance of the FIR Filter is largely influenced by the
multiplier, which is the slowest block out of all. In this paper, the Finite Impulse Response
Filter has been proposed using Vedic Multiplier and the proposed 3-parallel FIR filters
have been compared for various parameters. An improvement has been obtained both in
terms of area and delay. Also, low power consumption and reduction in terms of delay and
operational frequency of the booth multiplier makes it highly suitable for the designing of
the FIR Filter for low voltage and low power VLSI applications. The adder and multiplier
are two of the most important components in the filter architecture. In recent research,
reports ways of reducing the hardware complexity of parallel poly phase FIR filter
structure. The performance of the multiplier and adder blocks dictates the computational
speed and power dissipation of the entire filter. Accordingly, different types of adders and
multipliers are available in digital circuits.
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Digital filter plays a vital role in Digital Signal Processing. Finite impulse response (FIR)
filters are the most popular type of digital filter implementation in software. So, based on
the requirements of the application parallel FIR filters are implemented to modify the
sample rate or power consumption. Digital parallel FIR filters can benefit from minimizing
power consumption and boosting throughput. In recent decades, the FIR filter has been in
the focus of researchers . While parallel FIR filters have received a lot of attention in the
literature, most of it focusses on minimizing the count of multipliers by employing the fast
FIR algorithm. With aim of complexity improvement Conventional method, thus
implements fast FIR algorithms (FFA) by iterating small-sized filtering structures and
reducing the count of computational (i.e., adder and multiplier) units by reducing the
number of involved parallel sub filter units.
Further, for the implementation of linear phase parallel FIR filters updated FFA were
developed. In particular, concept of symmetric coefficients was used in these algorithms
recommended for odd-length FIR filters, leading to reduction in the count of multipliers in
the sub filter units by half, while increasing the number of adders in the pre/post
processing blocks. uses the polyphase coefficient symmetry property in parallel FIR filter
structure. The adder and multiplier are two of the most important components in the filter
architecture. Recent researches reports ways of reducing the hardware complexity of
parallel polyphase FIR filter structure. The performance of the multiplier and adder blocks
dictates the computational speed and power dissipation of the entire filter. Accordingly,
different types of adders and multipliers are available in digital circuits.
In authors have investigated an alternative, technology independent approach to design
area-efficient parallel polyphase FIR filter for DSP applications as the FIR filter’s
performance relies critically on the multipliers and adders. Motivated by the above
discussion, in this paper, we have applied Ripple carry, Carry lookahead and Brent Kung
adder on the structure given in. We have followed it by the usage of different multipliers
like Vedic multiplier .
1
CHAPTER 2
LITERATURE SURVEY
Parallel FIR filter is mostly used among various types of filter in Digital Signal Processing
(DSP). This paper shows the design of area-efficient 2-parallel FIR filter using VHDL and
its implementation on FPGA using image system. This paper gives the details basic blocks
of area-efficient 2- parallel FIR digital filter. In this paper proposed 2-parallel digital FIR
filter and area-efficient 2-parallel FIR filter are explained. Its simulation using Xilinx 14.2
are also discussed. It also presents the FPGA implementation of primary 2-parallel filter
and area-efficient 2-parallel on Xilinx 14.2 Spartan 3E Starter Board XC3S500E chips and
its results. Since adders are light weight in silicon area when compare with the multipliers,
therefore multipliers are replaced by the adder to reduce area and delay of the parallel FIR
filter. Xilinx ISE is used for simulating the design of the filter.
2.2 Short-length FIR filters and their use in fast non-recursive filtering.
This paper provides the basic tools required for an efficient use of the recently proposed
fast FIR algorithms. These algorithms not only reduce arithmetic complexity but also
partially maintain the multiply-accumulate structure, thus resulting in efficient
implementations. A set of basic algorithms is derived, together with some rules for
combining them. Their efficiency is compared with that of classical schemes in the case of
three different criteria, corresponding to various types of implementation. It is shown that
this class of algorithms (which includes classical ones as special cases) makes it possible to
find the best tradeoff corresponding to any criterion.
2
2.3 Low-area/power parallel FIR digital filter implementation.
This paper presents a novel approach for implementing area-efficient parallel (block) finite
impulse response (FIR) filters that require less hardware than traditional block FIR filter
implementations. Parallel processing is a powerful technique because it can be used to
increase the throughput of a FIR filter or reduce the power consumption of a FIR filter.
However, a traditional block filter implementation causes a linear increase in the hardware
cost (area) by a factor of L, the block size. In many design situations, this large hardware
penalty cannot be tolerated. Therefore, it is important to design parallel FIR filter structures
that require less area than traditional block FIR filtering structures. In this paper, we
propose a method to design parallel FIR filter structures that require a less-than-linear
increase in the hardware cost. A novel adjacent coefficient sharing based sub-structure
sharing technique is introduced and used to reduce the hardware cost of parallel FIR filters.
A novel coefficient quantization technique, referred to as a scalable maximum absolute
difference (MAD) quantization process, is introduced and used to produce quantized filters
with good spectrum characteristics. By using a combination of fast FIR filtering
algorithms, a novel coefficient quantization process and area reduction techniques, we
show that parallel FIR filters can be implemented with up to a 45% reduction in hardware
compared to traditional parallel FIR filters.
The objective of the paper is to reduce the hardware complexity of higher order FIR
filter with symmetric coefficients. The aim is to design efficient Fast Finite-Impulse
Response (FIR) Algorithms (FFAs) for parallel FIR filter structure with the
constraint that the filter tap must be a multiple of 2. In our work we have briefly
discussed for L = 4 parallel implementation. The parallel FIR filter structure based on
proposed FFA technique has been implemented based on carry save and ripple carry
adder for further optimization.
The reduction in silicon area complexity is achieved by eliminating the bulky multiplier
with an adder namely ripple carry and carry save adder.
3
For example, for a 6-parallel 1024-tap filter, the proposed structure saves 14
multipliers at the expense of 10 adders, whereas for a six-parallel 512-tap filter, the
proposed structure saves 108 multipliers at the expense of 10 adders. Overall, the
proposed parallel FIR structures can lead to significant hardware savings for
symmetric coefficients from the existing FFA parallel FIR filter, especially when the
length of the filter is very large.
Based on fast FIR algorithms (FFA), this paper proposes new 3-parallel finite-impulse
response (FIR) filter structures, which are beneficial to symmetric convolutions of odd
length in terms of the hardware cost. The proposed 3- parallel FIR structures exploit
the inherent nature of the symmetric coefficients of odd length, according to the length
of filter, (N mod 3), reducing half the number of multipliers in sub filter section at the
expense of additional adders in preprocessing and postprocessing blocks. The overhead
from the additional adders in preprocessing and postprocessing blocks stay fixed, not
increasing along with the length of the FIR filter, whereas the number of reduced
multipliers increases along with the length of the FIR filter. For example, for a 81-tap
filter, the proposed A structure saves 26 multipliers at the expense of 5 adders, whereas
for a 591-tap filter, the proposed structure saves 196 multipliers at the expense of 5
adders still. Overall, the proposed 3-parallel FIR structures can lead to significant
hardware savings for symmetric coefficients of odd length from the existing FFA
parallel FIR filter, especially when the length of the filter is large.
4
2.6 Exploiting coefficient symmetry in conventional polyphase FIR
filters.
The conventional polyphase architecture for linear-phase finite impulse response (FIR)
filter loses its coefficient symmetry property due to the inefficient arrangement of the
filter coefficients among its sub filters. Although, existing polyphase structures can
avail the benefits of coefficient symmetry property, at the cost of versatility and
complex sub filters arrangement of the conventional polyphase structure. To address
these issues, in this paper, we first present the mathematical expressions for inherent
characteristics of the conventional polyphase structure. Thereafter, we use these
expressions to develop a generalized mathematical framework which exploits
coefficient symmetry by retaining the direct use of conventional FIR filter coefficients.
Further, the transfer function expressions for the proposed Type-1/ transposed Type-1
polyphase structures using coefficient symmetry are derived. The proposed structures
can reduce the requirement of multiplier units in polyphase FIR filters by half. We also
demonstrate the decimator design using the proposed Type-1 polyphase structure and
the interpolator design using the proposed transposed Type-1 polyphase structure.
Moreover, the phase and magnitude characteristics of the proposed Type-1/transposed
Type-1 polyphase structures are presented. It is revealed via numerical examples that
all sub filters of the proposed symmetric polyphase structure possess linear-phase
characteristics.
2.7 Efficient FIR filter design using Booth multiplier for VLSI
applications.
The most important criteria for the design and implementation of DSP processor is
area optimization and reduction in power consumption. The fundamental block for the
design and implementation of the DSP processor is the Finite Impulse Response Filter.
The Finite Impulse Response (FIR) Filter consists of three basic modules which are
adder blocks, flip flops and multiplier blocks .The performance of the FIR Filter is
largely influenced by the multiplier, which is the slowest block out of all. In this paper,
the Finite Impulse Response Filter has been proposed using two different multipliers
5
namely Array multiplier and Booth Multiplier and both the proposed FIR filters have
been compared for various parameters.
The proposed filters are designed using Verilog HDL and is implemented using Xilinx
14.7 ISE tools. An improvement has been obtained both in terms of area and delay.
Also low power consumption and reduction in terms of delay and operational
frequency of the booth multiplier makes it highly suitable for the designing of the FIR
Filter for low voltage and low power VLSI applications.
6
CHAPTER 3
Polyphase decomposition is a typical approach for realizing FIR digital filter structures, in
which small parallel FIR filter blocks are initially created, and then large block-sized ones
are built by cascading or iterating small parallel FIR filter blocks. Polyphase
decomposition can be used to derive the traditional M-parallel FIR filter can as
parallel polyphase FFA based odd length FIR filter can be derived as
7
FIG 3.1 3-Parallel Odd-Length FIR Filter
polyphase FIR filter contains, three basic building blocks: a multiplier, an adder, and few
delay elements. Multiplier block contributes lion share of maximum delay in the design
which instigates optimization requirement of both the adder and the multiplier.
The word “Vedic” is derived from the word “Veda” which means the store-house
of all knowledge. Vedic mathematics is mainly based on 16 Sutras (or aphorisms)
dealing with various branches of mathematics like arithmetic, algebra, geometry etc.
These Sutras along with their brief meanings are enlisted below alphabetically.
5. Gunakasamuchyah – The factors of the sum is equal to the sum of the factors.
6. Gunitasamuchyah – The product of the sum is equal to the sum of the product.
12. Shunyam Saamyasamuccaye – When the sum is the same that sum is zero.
These methods and ideas can be directly applied to trigonometry, plain and
spherical geometry, conics, calculus (both differential and integral), and applied
mathematics of various kinds. As mentioned earlier, all these Sutras were
reconstructed from ancient Vedic texts early in the last century. Many Sub-sutras were also
discovered at the same time, which are not discussed here. The beauty of Vedic
mathematics lies in the fact that it reduces the otherwise cumbersome-looking
calculations in conventional mathematics to a very simple one. This is so because the
Vedic formulae are claimed to be based on the natural principles on which the human mind
works. This is a very interesting field and presents some effective algorithms which
can be applied to various branches of engineering such as computing and digital signal
processing [ 1,4]. The multiplier architecture can be generally classified into three
categories. First is the serial multiplier which emphasizes on hardware and minimum
amount of chip area. Second is parallel multiplier (array and tree) which carries out
high speed mathematical operations. But the drawback is the relatively larger chip
area consumption. Third is serial- parallel multiplier which serves as a good trade-off
between the times consuming serial multiplier and the area consuming parallel multipliers.
The Arithmetic module is split into smaller modules, which is multiplier and arithmetic
module. These three modules are implemented using Verilog HDL. The 2x2 bit multiplier
is obtained by "Vertical- crosswise Algorithm" based on Urdhva Tiryakbhyam Sutra. The
basic 2x2 bit multiplier is designed first using verilog code and then, 4x4 blocks were
designed using 2x2 blocks further 8x8 bits multiplier from 4-bit multiplier blocks and
conclusively Multiplication of 16x16 bit is obtained with final 16-bit multiplier.
10
3.3.1 Design of 2x2 vedic multiplier:
Figure illustrates the steps to to multiply two 2 bit numbers . Converting the above figure
to a hardware equivalent we have 3 and gates which will act as 2 bit multipliers and two
half adders to add the products to get the final product. Here is the hardware detail of the
multiplier.
11
FIG 3.3 Logic design of 2x2 multiplier
12
FIG 3.4 4X4 multiplier
Similar to the previous design of 4x4 multiplier , we need 4 such 4x4 multipliers to
develop 8x8 multipliers. Here we need to first design 8bit and 12 bit adders and by proper
instantiating of the module and connections as shown in the figure we have designed a 8x8
bit multiplier. At this point of time its necessary for you to even verify the RTL code and
check if the hardware is as per your design. PlanAhead tool by xilinx gives better view of
the hardware design with design elaborate option(will explain this in my next posts). Refer
the addition tree diagram to know the process for 8x8 multiplier:
13
FIG 3.5 8x8 multiplier
manner .The first step in the design of 16×16 block will be grouping the 8 bit (byte) of
each 16 bit input. These lower and upper bytes pairs of two inputs will form vertical and
crosswise product terms. Each input byte is handled by a separate 8×8 Vedic multiplier to
produce sixteen partial product rows. These partial products rows are added in a 16-bit
carry look ahead adder optimally to generate final product bits.The schematic of a 16×16
block designed using 8×8 blocks. The partial products represent the Urdhva verticaland
14
FIG 3.6 16x16 multiplier
manner .The first step in the design of 32×32 block will be grouping the 16 bit (byte) of
each 32 bit input. These lower and upper bytes pairs of two inputs will form vertical and
crosswise product terms. Each input byte is handled by a separate 16×16 Vedic multiplier
to produce sixteen partial product rows. These partial products rows are added in a 32-bit
carry look ahead adder optimally to generate final product bits. The schematic of a 32×32
block designed using 8×8 blocks. The partial products represent the Urdhva vertical and
15
FIG 3.6 32x32 multiplier
16
CHAPTER 4
INTRODUCTION TO VLSI
Digital systems are highly complex at their most detailed level. They may consist of
millions of elements i.e., transistors or logic gates. For many decades, logic schematics
served as then Gur Franca of logic design, but not anymore. Today, hardware complexity
has grown to such a degree that a schematic with logic gates is almost useless as it shows
only a web of connectivity and not functionality of design. Since the 1970s, computer
engineers, electrical engineers and electronics engineers have moved toward Hardware
description language (HDLs).
Digital circuit has rapidly evolved over the last twenty five years. The earliest digital
circuits were designed with vacuum tubes and transistors. Integrated circuits were then
invented where logic gates were placed on a single chip. The first IC chip was small scale
integration (SSI) chips where the gate count is small. When technology became
sophisticated, designers were able to place circuits with hundreds of gates on a chip. These
chips were called MSI chips with advent of LSI; designers could put thousands of gates on
a single chip. At this point, design process is getting complicated and designers felt the
need to automate these processes.
With the advent of VLSI technology, designers could design single chip with more than
hundred thousand gates. Because of the complexity of these circuits computer aided
techniques became critical for verification and for designing these digital circuits.
One way to lead with increasing complexity of electronic systems and the increasing time
to market is to design at high levels of abstraction. Traditional paper and pencil and
capture and simulate methods have largely given way to the described UN synthesized
approach.
For these reasons, hardware description languages have played an important role in
describe and synthesis design methodology. They are used for specification, simulation
and synthesis of an electronic system. This helps to reduce the complexity in designing and
products are made to be available in market quickly.
17
The components of a digital system can be classified as being specific to an application or
as being standard circuits. Standard components are taken from a set that has been used in
other systems. MSI components are standard circuits and their use results in a significant
reduction in the total cost as compared to the cost of using SSI Circuits. In contrasts,
specific components are particular to the system being implemented and are not commonly
found among the standard components.
The implementation of specific circuits with LSI chips can be done by means of IC that
can be programmed to provide the required logic.
Typical design flow for designing VLSI circuits is shown in the tool flow diagram. This
design flow is typically used by designers who use HDLs. In any design, specification is
first. Specification describes the functionality, interface and overall architecture of the
digital circuit to be designed. At this point, architects need not think about how they will
implement their circuit. A behavioral description is then created to analyze the design in
terms of functionality, performances and other high level issues. The behavioral
description is manually converted to an RTL (Register Transfer Level) description in an
HDL. The designer has to describe the data flow that will implement the desired digital
circuit. From this point onward the design process is done with assistance of CAD tools.
Logic synthesis tools convert the RTL description to a gate level net list. A gate level net
list is a description of the circuit in terms of gates and connections between them. The gate
level net list is input to an automatic place and route tool, which creates a layout. The
layout is verified and then fabricated on a chip. Thus most digital design activity is
concentrated on manually optimizing the RTL description of the circuit. After the RTL
description is frozen, CAD tools are available to assist the designer in further process
Designing at RTL level has shrunk design cycle times from years to a few months.
As designs got larger and complex, logic simulation assumed an important role in design
process. For a long time, programming languages such as fortran, pascal & c were been
used to describe the computer programs that were been used to describe the computer
18
programs that were sequential in nature. Similarly in digital design field, designers felt the
need for a standard language to describe digital circuits. Thus HDL is came in to existence.
HDLs allowed the designers to model the concurrency of processes found in hardware
elements. HDLs such as VERILOG HDL & VHDL (Very high speed integrated circuit
hardware description language).
Verilog was started in the year 1984 by Gateway Design Automation Inc as a proprietary
hardware modeling language. It is rumored that the original language was designed by
taking features from the most popular HDL language of the time, called HiLo, as well as
from traditional computer languages such as C. At that time, Verilog was not standardized
and the language modified itself in almost all the revisions that came out within 1984 to
1990.
Verilog simulator first used in 1985 and extended substantially through 1987.The
implementation of Verilog simulator sold by Gateway. The first major extension of Verilog
is Verilog-XL, which added a few features and implemented the infamous "XL algorithm"
which is a very efficient method for doing gate-level simulation. Later 1990, Cadence
Design System, whose primary product at that time included thin film process simulator,
decided to acquire Gateway Automation System, along with other Gateway products.,
Cadence now become the owner of the Verilog language, and continued to market Verilog
as both a language and a simulator. At the same time, Synopsys was marketing the top-
down design methodology, using Verilog. This was a powerful combination.
In 1990, Cadence organized the Open Verilog International (OVI), and in 1991 gave it the
documentation for the Verilog Hardware Description Language. This was the event which
"opened" the language.
Two things distinguish an HDL from a linear language like “C”: Concurrency:
19
• The ability to do several things simultaneously i.e. different code-blocks can run
concurrently.
Timing:
• An HDL might describe the layout of the wires, resistors and transistors on an Integrated
Circuit (IC) chip, i.e., the switch level.
• It might describe the logical gates and flip flops in a digital system, i.e., the gate level.
• An even higher level describes the registers and the transfers of vectors of information
between registers. This is called the Register Transfer Level (RTL).
• A powerful feature of the Verilog HDL is that you can use the same language for
describing, testing and debugging your system.
• Industrial support: Simulation is very fast and synthesis is very efficient. • Universal:
Entire process is allowed in one design environment.
• Specification: It uses wave former, test bencher or word for drawing waveform.
• CAD Tools are used for coding format for the Conversation of Specification.
Coding Styles:
• Behavioral Modeling
• The method of coding with respective inputs and outputs are going to be tested.
• Synthesis: Synthesis is done by Altera and Xilinx ,Simplify Pro, Leonardo Spectrum
Design Compiler, FPGA Compiler.
• Simulation and synthesis are used for functional Checking of HDL coding. Check the
RTL description if fails.
21
FIG 4.1 Logical Verification and Testing
• Place & Route: Implement FPGA vendors P&R tool for FPGA. Very costly P&R tools
like Apollo required for ASIC tools.
• The process of describing a circuit description into the physical layout is called the
Physical design, it explains the interconnections between the cells and routes position.
• Under layout verification first the physical layout structure has to be verified.
22
• Floor Planning Automatic Place and Route and RTL Description can be done for any
modifications.
4.6.9 Implementation
4.7 MODULES
Distinct parts of a Verilog module consists of are as shown in below figure. The keyword
module is the beginning of a module definition. In a module definition the module name,
port list, port declarations, and optional parameters must come first in its definition. If the
module has any ports to interact with the external environment then only Port list and port
declarations are present. There are five components within a module
• variable declarations,
• dataflow statements
• behavioral blocks
• tasks or functions.
The following components may appear in any order and at any place in a given module
definition.
The end module statement must always come last in a module definition. All components
except module, module name, and end module are optional and can be mixed and matched
as per design needs. Multiple modules definition in a single file are allowed by Verilog. In
the file the modules can be defined in any order.
23
FIG 4.2 Module Design
Endmodule
4.8 PORTS
The interface between a module and its environment is provided by the ports. The
input/output pins of an Integrated Circuit chip are its ports. The environment cannot see
the internals of the module. It is a great advantage for the designer. As long as the interface
is not modified the internals of the module may be mod ifiedwithout affecting its
environment. Terminals are the synonyms for the ports.
the module may contain the declaration of all ports in the given list of ports . The
declaration of ports are explained in detail below.
24
inout :-Bidirectional port depending on the direction of the port signal, each port in the port
list is given a label as follows input, output, or inout.
A port consisting of two units, primary unit is into the module and secondary unit is out of
the module. The primary and secondary units are connected. When modules are
instantiated within other modules there are rules governing port connections within
module. If any port connection rules are violated then the Verilog simulator complains.
The figure 5.6 shows the port connection rules.
Inputs:
Outputs:
Inouts:
Signals and Ports in a module can be connected in two ways. In the module definition
those two methods cannot be mixed.
25
• Port by order list
• Port by name
Most spontaneous method for learners is the Connecting port by order list. The order in
which the ports in the ports list in the module definition must be connected in the same
order.
The external signals a, b, out appear in exactly the same order as the ports a, b, out in the
module defined in adder in the below example.
Example
Port by name
For larger designs where the module have say 30 ports ,it is almost impractical and
possibility of errors if remembering the order of the ports in the module definition. There
is capability to connect external signals to ports by the port names, rather than by position
provided by the Verilog. Syntax for instantiation with port name:
26
The port connections in any order as long as the port name in the module definition
correctly matches the external signal.
CHAPTER 5
SOFTWARE TOOLS
Create a New Project Create a new ISE project which will target the FPGA device on the
Spartan-3 Startup Kit demo board. To create a new project: 1. Select File > New Project...
The New Project Wizard appears. 2. Type tutorial in the Project Name field. 3. Enter or
browse to a location (directory path) for the new project. A tutorial subdirectory is created
automatically. 4. Verify that HDL is selected from the Top-Level Source Type list. 5. Click
Next to move to the device properties page. 6. Fill in the properties in the table as shown
below:
♦ Family: Spartan3
♦ Device: XC3S200
♦ Package: FT256
♦ Speed Grade: -4
27
Leave the default values in the remaining fields. When the table is complete, your project
properties will look like the following:
Start the Xilinx ISE 8.1i project navigator by double clicking the Xilinx ISE 8.1i icon on
your desktop.
Select a project location and type the name you would like to call your project counter:
28
Click next
Click next
29
Creating a Verilog Source:
Create the top-level Verilog source file for the project as follows:
2. Select Verilog Module as the source type in the New Source dialog box.
5. Click Next.
6. Declare the ports for the counter design by filling in the port information as shown
below:
30
The source file containing the counter module displays in the Workspace, and the counter
displays in the Sources tab, as shown below:
31
The next step in creating the new source is to add the behavioral description for counter.
Use a simple counter code example from the ISE Language Templates and customize it for
the counter design. 1. Place the cursor on the line below the output [3:0] COUNT_OUT;
statement. 2. Open the Language Templates by selecting Edit → Language Templates…
Note: You can tile the Language Templates and the counter file by selecting Window →
Tile Vertically to make them both visible. 3. Using the “+” symbol, browse to the
following code example: Verilog → Synthesis Constructs → Coding Examples →
Counters → Binary → Up/Down Counters → Simple Counte.
4. With Simple Counter selected, select Edit → Use in File, or select the Use Template in
File toolbar button. This step copies the template into the counter source file.
Design Simulation:
3. In the New Source Wizard, select Test Bench WaveForm as the source type, and type
counter_tbw in the File Name field.
4. Click Next.
5. The Associated Source page shows that you are associating the test bench waveform
with the source file counter. Click Next.
6. The Summary page shows that the source will be added to the project, and it displays the
source directory, type, and name. Click Finish.
32
7. You need to set the clock frequency, setup time and output delay times in the Initialize
Timing dialog box before the test bench waveform editing window opens. The
requirements for this design are the following:
♦ The counter must operate correctly with an input clock frequency = 25 MHz.
♦ The DIRECTION input will be valid 10 ns before the rising edge of CLOCK
♦ The output (COUNT_OUT) must be valid 10 ns after the rising edge of CLOCK. The
design requirements correspond with the values below. Fill in the fields in the Initialize
Timing dialog box with the following information:
♦ Global Signals: GSR (FPGA) Note: When GSR(FPGA) is enabled, 100 ns. is added to
the Offset value automatically.
33
Click Finish to complete the timing initialization. 9. The blue shaded areas that precede the
rising edge of the CLOCK correspond to the Input Setup Time in the Initialize Timing
dialog box. Toggle the DIRECTION port to define the input stimulus for the counter
design as follows:
♦ Click on the blue cell at approximately the 300 ns to assert DIRECTION high so that the
counter will count up.
♦ Click on the blue cell at approximately the 900 ns to assert DIRECTION low so that the
counter will count down.
34
35
CHAPTER 6
PROGRAMMING
6.1 Code
`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////////////////////////
// Company:
// Engineer:
//
// Design Name:
// Project Name:
// Target Devices:
// Tool versions:
// Description:
//
// Dependencies:
//
// Revision:
// Additional Comments:
//
//////////////////////////////////////////////////////////////////////////////////
36
);
wire [63:0]m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,y1,y2,y3,L0,L1,L2,L3,temp,temp1;
wire [31:0]x1,x2,x3,x4;
assign b0=32'd1;
assign b1=32'd2;
assign b2=32'd3;
assign i1=3*X;
assign i2=3*X+1;
assign i3=3*X+2;
assign Y1=3*y1;
assign Y2=3*y2+1;
assign Y3=3*y3+2;
//1-parallel
vedic_32x32 p11(i1,b0,m1);
vedic_32x32 p12(x2,b2,m2);
vedic_32x32 p13(x3,b1,m3);
add_64bit p1(L0,m1,m2);
add_64bit p2(y1,L0,m3);
//2-parallel
//assign i4=i1+i2;
add_32_bit z12(i4,i1,i2);
37
vedic_32x32 p21(i4,b0,m4); //e0
vedic_32x32 p31(i1,b2,m8);
vedic_32x32 p32(i2,b1,m9);
vedic_32x32 p33(i3,b0,m10);
assign temp=L1-m9-m1; //
assign temp1=L2-m3;
add_64bit p5(y2,temp,temp1);
add_64bit p6(L3,m8,m9);
add_64bit p7(y3,L3,m10);
endmodule
module vedic_32x32(a,b,c);
input [31:0]a;
input [31:0]b;
output [63:0]c;
wire [31:0]q0;
wire [31:0]q1;
wire [31:0]q2;
wire [31:0]q3;
wire [31:0]temp1;
wire [47:0]temp2;
wire [47:0]temp3;
38
wire [47:0]temp4;
wire [31:0]q4;
wire [47:0]q5;
wire [47:0]q6;
vedic_16x16 z1(a[15:0],b[15:0],q0[31:0]);
vedic_16x16 z2(a[31:16],b[15:0],q1[31:0]);
vedic_16x16 z3(a[15:0],b[31:16],q2[31:0]);
vedic_16x16 z4(a[31:16],b[31:16],q3[31:0]);
// stage 1 adders
add_32_bit z5(q4,q1[31:0],temp1);
add_48_bit z6(q5,temp2,temp3);
assign temp4={16'b0,q4[31:0]};
//stage 2 adder
add_48_bit z7(q6,temp4,q5);
assign c[15:0]=q0[15:0];
assign c[63:16]=q6[47:0];
endmodule
module vedic_16x16(a,b,c);
input [15:0]a;
input [15:0]b;
output [31:0]c;
39
wire [15:0]q0;
wire [15:0]q1;
wire [15:0]q2;
wire [15:0]q3;
wire [15:0]temp1;
wire [23:0]temp2;
wire [23:0]temp3;
wire [23:0]temp4;
wire [15:0]q4;
wire [23:0]q5;
wire [23:0]q6;
vedic_8x8 z1(a[7:0],b[7:0],q0[15:0]);
vedic_8x8 z2(a[15:8],b[7:0],q1[15:0]);
vedic_8x8 z3(a[7:0],b[15:8],q2[15:0]);
vedic_8x8 z4(a[15:8],b[15:8],q3[15:0]);
// stage 1 adders
add_16_bit z5(q4,q1[15:0],temp1);
add_24_bit z6(q5,temp2,temp3);
assign temp4={8'b0,q4[15:0]}
//stage 2 adder
add_24_bit z7(q6,temp4,q5);
40
// fnal output assignment
assign c[7:0]=q0[7:0];
assign c[31:8]=q6[23:0];
endmodule
module vedic_8x8(a,b,c);
input [7:0]a;
input [7:0]b;
output [15:0]c;
wire [7:0]q0;
wire [7:0]q1;
wire [7:0]q2;
wire [7:0]q3;
wire [7:0]temp1;
wire [11:0]temp2;
wire [11:0]temp3;
wire [11:0]temp4;
wire [7:0]q4;
wire [11:0]q5;
wire [11:0]q6;
vedic_4x4 z1(a[3:0],b[3:0],q0[7:0]);
vedic_4x4 z2(a[7:4],b[3:0],q1[7:0]);
vedic_4x4 z3(a[3:0],b[7:4],q2[7:0]);
vedic_4x4 z4(a[7:4],b[7:4],q3[7:0]);
// stage 1 adders
41
add_8_bit z5(q4,q1[7:0],temp1);
add_12_bit z6(q5,temp2,temp3);
assign temp4={4'b0,q4[7:0]};
// stage 2 adder
add_12_bit z7(q6,temp4,q5);
assign c[3:0]=q0[3:0];
assign c[15:4]=q6[11:0];
endmodule
module vedic_4x4(a,b,c);
input [3:0]a;
input [3:0]b;
output [7:0]c;
wire [3:0]q0;
wire [3:0]q1;
wire [3:0]q2;
wire [3:0]q3;
wire [3:0]temp1;
wire [5:0]temp2;
wire [5:0]temp3;
wire [5:0]temp4;
wire [3:0]q4;
wire [5:0]q5;
wire [5:0]q6;
42
// using 4 2x2 multipliers
vedic_2x2 z1(a[1:0],b[1:0],q0[3:0]);
vedic_2x2 z2(a[3:2],b[1:0],q1[3:0]);
vedic_2x2 z3(a[1:0],b[3:2],q2[3:0]);
vedic_2x2 z4(a[3:2],b[3:2],q3[3:0]);
// stage 1 adders
add_4_bit z5(q4,q1[3:0],temp1);
add_6_bit z6(q5,temp2,temp3);
assign temp4={2'b0,q4[3:0]};
// stage 2 adder
add_6_bit z7(q6,temp4,q5);
assign c[1:0]=q0[1:0];
assign c[7:2]=q6[5:0];
endmodule
module vedic_2x2(a,b,c);
input [1:0]a;
input [1:0]b;
output [3:0]c;
wire [3:0]temp;
//stage 1
// four multiplication operation of bits accourding to vedic logic done using and gates
assign c[0]=a[0]&b[0];
43
assign temp[0]=a[1]&b[0];
assign temp[1]=a[0]&b[1];
assign temp[2]=a[1]&b[1];
//stage two
half_adder z1(temp[0],temp[1],c[1],temp[3]);
half_adder z2(temp[2],temp[3],c[2],c[3]);
endmodule
module dff_pipeline( input clk , input rst , input [31:0]din , output reg [31:0]dout
);
if (rst) begin
end
end
endmodule
module full_adder(x,y,c_in,s,c_out);
input x,y,c_in;
output s,c_out;
wire w;
assign s = x^y^c_in;
endmodule
module half_adder(x,y,s,c);
44
input x,y;
output s,c;
assign s=x^y;
assign c=x&y;
endmodule
module add_4_bit(answer,input1,input2);
parameter N=4;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_6_bit(answer,input1,input2);
parameter N=6;
45
output [N-1:0] answer;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_8_bit(answer,input1,input2);
parameter N=8;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
46
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_12_bit(answer,input1,input2);
parameter N=12;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_16_bit(answer,input1,input2);
47
parameter N=16;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_24_bit(answer,input1,input2);
parameter N=24;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
48
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_32_bit(answer,input1,input2);
parameter N=32;
input [N-1:0]input1,input2;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
49
endmodule
module add_48_bit(answer,input1,input2);
parameter N=48;
wire carry_out;
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
module add_64bit(answer,input1,input2);
parameter N=64;
wire carry_out;
50
genvar i;
generate
for(i=0;i<N;i=i+1)
begin: generate_N_bit_Adder
if(i==0)
half_adder f(input1[0],input2[0],answer[0],carry[0]);
else
full_adder f(input1[i],input2[i],carry[i-1],answer[i],carry[i]);
end
endgenerate
endmodule
////////////////////////////////////////////////////////////////////////////////
// Company:
// Engineer:
//
// Target Device:
// Tool versions:
// Description:
//
51
//
// Dependencies:
//
// Revision:
// Additional Comments:
//
////////////////////////////////////////////////////////////////////////////////
module tb;
// Inputs
reg clk;
reg rst;
reg [31:0] X;
// Outputs
FIR_FILTER uut (
.clk(clk),
.rst(rst),
.X(X),
.Y1(Y1),
.Y2(Y2),
.Y3(Y3)
);
initial begin
// Initialize Inputs
52
clk = 0;
forever #5 clk=~clk;
end
initial begin
rst = 1;
#10;
rst=0;
end
initial begin
#5
X = 50;
#100;
X = 60;
#100;
X = 70;
#100;
X = 80;
#100;
X = 90;
#1000;
$finish;
end
endmodule
53
CHAPTER 7
IMPLEMENTATION
The below figure-1, is the RTL Schematic diagram of 3-Parallel FIR-Filter using Vedic
Multiplier Here clk, rst, X are the inputs of the FIR Filter and Y1,Y2,Y3 is the outputs.
54
FIG: 7.2 RTL Internal diagram of 3-Parallel FIR-Filter using Vedic Multiplier.
55
FIG: 7.4 RTL Internal diagram of Vedic 16x16 Multiplier.
56
FIG: 7.6 RTL Internal diagram of Vedic 2x2 Multiplier
57
CHAPTER 8
RESULTS
The design procedure for the multipliers consists of obtaining the input coefficients. For
the Booth multiplier the coefficients are in signed representation whereas for the Vedic
multiplier form. The program for the implementation is written in Verilog-HDL and
simulated using Xilinx 14.7 Simulator.
The above diagram-7, is the simulation result of final output here we have
given input X=345 then output Y1=3261,Y2=3262,Y3=3254 ,this output is
verified according to inputs in this way FIR Filter using Vedic multiplier
multiplies 32-bit numbers. The inputs we can give in the binary form also
with selecting binary format then we have to select output as binary form.
58
8.2 PARAMETERS
8.2.1 AREA
8.2.3 DELAY
Power consumption for proposed 3-Parallel FIR Filter with Vedic Multiplier for 32 bits.
59
8.3 APPLICATIONS
Biomedical Engineering
Speech processing and recognition
Digital communications
Signal Processing
Telecommunications
Radar Systems
Medical Imaging
Embedded Systems
Data Compression
8.4 ADVANTAGES
Low power
High speed
Digital Implementation
Ease of Implementation
60
CHAPTER 9
9.1 CONCLUSION
Conventionally the FIR filters which have huge application in Digital Signal Processing
were developed using traditional DSP algorithms. With the advancement in the technology,
the FIR filters are being developed using VLSI technology. This leads to the extensive
decrease in the area occupied on chip and power consumed by the filter. The FIR filter
consists of three blocks: the multiplier, adder and the delay block. Out of all three, the
multiplier is the slowest of all. The research work presented in this paper has achieved
adequate results and has demonstrated the efficiency of high level optimization techniques.
In this work, the FIR filter has been designed using Vedic multiplier. From this work, it is
concluded that chip area of FIR filter designed using Vedic multiplier is significantly
reduced and that too without increasing any power dissipation, thereby making the system
faster.
The future scope of high-performance VLSI implementation of 3 parallel FIR filters with
Vedic multipliers is promising, driven by increasing demands in digital signal processing
across telecommunications, multimedia, and real-time applications. Continued
advancements in Vedic multiplication techniques will enhance speed and efficiency, while
the integration of machine learning can lead to adaptive filtering solutions. Emphasizing
low-power designs will be crucial for portable devices, alongside scalable and flexible
architectures that allow for dynamic adjustments based on application needs. Hardware
acceleration using GPUs and DSPs will further improve processing capabilities, enabling
higher data throughput for technologies like 5G.
61
9.3 REFERENCE
62