Thesis On FPGA
Thesis On FPGA
Belgaum, Karnataka
A Project Report On
IMPLEMENTATION OF DISTRIBUTED ARITHMETIC BASED FAST BLOCK
LMS ADAPTIVE FILTER FOR HIGH THROUGHPUT ON FPGA
A dissertation work submitted in partial fulfillment of the requirement for the
award of degree of Masters of Technology in Digital Electronics
Submitted by
Sagara T V
USN: 3BR10LDE14
2011-2012
CERTIFICATE
Certified that project work entitled IMPLEMENTATION OF DISTRIBUTED
ARITHMETIC BASED FAST BLOCK LMS ADAPTIVE FILTER FOR HIGH
THROUGHPUT ON FPGA carried out by Sagara T V bearing USN: 3BR10LDE14,
a bonafide student of Ballari institute of technology and management, in partial
fulfillment for the award of Master of Technology in Digital Electronics Engineering of
the Visvesvaraya Technological University, Belgaum during the year 2011-2012. It is
certified that all corrections/suggestions indicated for internal assessment have been
incorporated in the report deposited in the library. The project report has been approved
as it satisfies the academic requirements in respect of project work prescribed for the said
Degree.
Signature of the project guide
Prof. Premchand D R M.Tech.
Dr. U. Eranna
M.E, Ph.D
External viva
DECLARATION
Date:
Place: Bellary
Sagara.T.V
USN:3BR10LDE14
M.Tech (DE)
BITM, Bellary.
ACKNOWLEDGENT
The completion of my project would not have been possible without the kind support and
help of many individuals. I would like to extend my sincere thanks to all of them.
I am highly indebted to my project guide Prof. PREMCHAND D R for providing me
valuable guidance, constant supervision & as well as being a support in completing the
project.
I express my gratitude and sincere thanks to our head of the department, Dr.V.C.PATIL
for his encouragement, moral support rendered and facilities provided towards the
successful completion of project.
I owe my deep sense of gratitude to our principal Dr.U.ERANNA for providing all the
facilities and congenial environment in the college.
My special thanks to all the staff members of Electronics and communication department
for their support and help during the project work.
I wish to thank my family for their blessings and for being a constant source of
inspiration and encouragement.
Sagara.T.V
3BR10LDE14
ABSTRACT
The proposed work deals with the design and implementation of high
throughput adaptive digital filter using Fast Block Least Mean Squares (FBLMS)
algorithm. The filter structure is based on Distributed Arithmetic (DA). DA is able to
calculate the inner product by shifting, and accumulating of partial products and storing in
look-up table. Hence the proposed adaptive digital filter will be multiplierless. Thus a DA
based implementation of adaptive filter is highly computational and area efficient.
Furthermore, the fundamental building block in the DA architecture map well to
the architecture of todays Field Programmable Gate Arrays (FPGA). As per the literature
FPGA implementation of DA based adaptive filter occupies significantly smaller area,
about 45% less than that of the existing FBLMS algorithm based adaptive filter
This report contains the work carried out in last three months including detail
study of references[1]-[10], leading to thorough understanding of adaptive filter and its
algorithms, FFT and its various algorithms and Distributed Arithmetic.
Table of Contents
Page no.
CHAPTER 1 PREAMBLE
1.1
1.2
1.3
1.4
1.5
Introduction
Motivation
Problem Statement
Objective of Project
Organization of the Report
1
2
2
3
3
4
Introduction
Adaptive Filtering Problem
Applications
Adaptive Algorithms
3.4.1 Wiener Filter
3.4.2 Method of Steepest Descent
3.4.3 Fast Block LMS Algorithm
3.4.4 RLS Algorithm
7
8
8
9
9
11
12
14
16
16
18
19
19
23
25
26
29
29
30
31
32
35
35
36
36
7.3.3
7.3.4
7.3.5
7.3.6
7.3.7
Implementation
Behavioral Simulation
Functional simulation
Static Timing Analysis
Architectural overview
37
39
40
40
40
42
42
43
44
46
48
49
49
50
52
53
BIBLIOGRAPHY
APPENDIX
54
54
List of Figures
Description of the Figures
Page no.
17
21
22
24
26
28
30
33
36
37
38
38
39
41
42
43
45
46
47
48
48
49
50
51
51
52
52
ABBREVIATIONS
ASIC
CSD
DA
DISTRIBUTED ARITHMETIC
DFT
DSP
FFT
FBLMS
FPGA
IFFT
LMS
LEAST-MEAN -SQUARE
LUT
LOOKUP TABLE
MSE
RAM
ROM
RLS
RECURSIVE LEAST-SQURES
SNR
HDL
DDR
DOUBLE DATA-RATE
DCI
DCM
CLB
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 1
PREAMBLE
1.1
INTRODUCTION
Adaptive digital filters are widely used in the area of signal processing such as echo
Page 1
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
main reason behind the popularity of the FPGA is due to balance that FPGAs provide the
designer in terms of flexibility, cost, and time-to-market.
The concept of DA has already applied to LMS based adaptive filters
[1],[3],[8],[10] but, not to FBLMS algorithm based adaptive filters. This paper proposes a
new hardware efficient implementation of FBLMS algorithm based adaptive filter using
DA. The proposed architecture reduces the area requirement of original FBLMS by
reducing the number of hardware multipliers using DA, result as adaptive filter with low
power dissipation and high throughput.
1.2
MOTIVATION
Adaptive digital filters are widely used in in the area of signal processing such as
1.3
PROBLEM STATEMENT
BLMS based adaptive filter, that takes an input sequence, which is partitioned into
non- overlapping blocks of length each by means of a serial-to-parallel converter, and the
blocks of data so produced are applied to an FIR filter of length, one block at a time. The
tap weights of the filter are updated after the collection of each block of data samples, so
that the adaptation of the filter proceeds on a block-by-block. This has low throughput.
The throughput of FBLMS based adaptive filter is limited by computational complexity
lies in FFT (and IFFT) block. The main hardware complexity of the system is due to
hardware multipliers.
Page 2
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
1.4
OBJECTIVE OF PROJECT
The objective of the project is to implement FBLMS Adaptive filter without
1.5
Page 3
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 2
LITERATURE SURVEY
Paper[1] Discrete fourier transforms when the number of data samples is
prime.
Explaination : A block adaptive filter was derived which allowed fast implementation
while maintaining performance equivalent to that of the LMS adaptive filter. It was
pointed out that BLMS adaptive filters, have an analysis advantage over LMS adaptive
Page 4
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
filters when the inputs were correlated. Finally, BLMS adaptive filters were less
computational complexity.
Explanation : They have developed and evaluated an FFT algorithm which combines the
prime factor index map with the index permutation that converts a DFT to convolution
and the evaluation of the convolution by distributed arithmetic. It was shown that the
conversion of a length-N DFT to two length-(-N 1 )/2 convolutions can be efficiently
used with distributed arithmetic. An indexing scheme using a stored table was developed
to give a very efficient implementation of both the prime factor index map and the index
permutation.
Paper [6] Multi memory block structure for implementing a digital adaptive
filter using distributed arithmetic.
Explanation : This adaptive algorithm based on the LMS algorithm for adaptive filters
with a large number of taps were derived from the distributed arithmetic technique. This
type of adaptive filter offered a great advantage in hardware simplicity. In this structure,
the total number filter taps N was divided into M blocks, each with R taps. These M
blocks were operate simultaneously and thus achieve a high speed signal processing
capability.
Paper [7] Applications of distributed arithmetic to digital signal processing.
Biquadratic Digital
Page 5
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Explanation : In this paper, a new hardware adaptive filter structure for very high
throughput LMS adaptive filters was proposed and its implementation details were
presented. The DA concept involves implementation of MAC operations using a LUT.
The problems typically encountered in updating the LUT for an adaptive filter were
overcome by using an auxiliary LUT with special addressing. Several optimizations for
efficient implementation of large DA adaptive filters were presented. The proposed DA
adaptive filter system was implemented on an FPGA. The design trade-offs presented in
this paper indicate that the DA adaptive filter can yield a significantly higher throughput
than the traditional implementation employing up to four hardware multiply and
accumulate units. The cost of such a design is a marginal increase in memory
requirements. It is also demonstrated that the design can be easily reconfigured to match a
wide range of performance requirements and cost constraints.
Page 6
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 3
ADAPTIVE FILTER
3.1 Introduction
Adaptive filters learn the statistics of their operating environment and continually
adjust their parameters accordingly. In practice, signals of interest often become
contaminated by noise or other signals occupying the same band of frequency. When the
signal of interest and the noise reside in separate frequency bands, conventional linear
filters are able to extract the desired signal. However, when there is spectral overlap
between the signal and noise, or the signal or interfering signals statistics change with
time, fixed coefficient filters are inappropriate Figure 3.1 shows an example of a
wideband signal whose Fourier spectrum overlaps narrowband interference signal.
This situation can occur frequently when there are various modulation
technologies operating in the same range of frequencies. In fact, in mobile radio systems
cochannel interference is often the limiting factor rather than thermal or other noise
sources. It may also be the result of intentional signal jamming, a scenario that regularly
arises in military operations when competing sides intentionally broadcast signals to
disrupt their enemies communications.
Furthermore, if the statistics of the noise are not known a priori, or change over
time, the coefficients of the filter cannot be specified in advance. In these situations,
adaptive algorithms are needed in order to continuously update the filter coefficients.
Page 7
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
3.3 Applications
Because of their ability to perform well in unknown environments and track
statistical timevariations, adaptive filters have been employed in a wide range of fields.
However, there are essentially four basic classes of applications for adaptive filters. These
are: Identification, Inverse modeling, prediction, and interference cancellation, with the
Dept.of E&CE, BITM Bellary
Page 8
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
main difference between them being the manner in which the desired response is
extracted.
Applications of adaptive filters are:
Channel Identification
Channel Equalization
Page 9
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
The mean square error(MSE), , is obtained by taking the expectations of both sides:
(3.3)
Above,
is
bowl shaped surface with the minimum point being the optimal weights. This is referred
to as the error performance surface whose gradient is given by
(3.4)
To determine the optimal Wiener filter for a given signal requires solving the
WienerHopf equations. First, let the matrix
of . That is,
(3.5)
Where the superscript H denotes the Hermitian transpose. In expanded form this is
Also, let
response
represent the crosscorrelation vector between the tap inputs and the desired
:
(3.6)
Page 10
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
With
stands for the Mby1 optimum tap-weight vector, for the transversal filter. That
surface. One can adaptively reach the minimum by updating the weights at each time step
by using the equation
(3.7)
where the constant
is the step size parameter. The step size parameter determines how
fast the algorithm converges to the optimal weights. A necessary and sufficient condition
for the convergence or stability of the steepest descent algorithm is for
to satisfy
(3.8)
where
and ,
whereas the leastmeansquare algorithm performs similarly using much less calculations
Page 11
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
, but
may be
its simplicity has caused it to be the most widely implemented in practice. For an Ntap
filter, the number of operations has been reduced to 2*N multiplications and N additions
per coefficient update. This is suitable for realtime applications, and is the reason for the
popularity of the LMS algorithm.
Consider a BLMS based adaptive filter[4], that takes an input sequence
which is partitioned into non-overlapping blocks of length
parallel converter, and the blocks of data so produced are applied to an FIR filter of
Page 12
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
length , one block at a time. The tap weights of the filter are updated after the collection
of each block of data samples, so that the adaptation of the filter proceeds on a block-byblock basis rather than on a sample-by-sample basis as in conventional LMS algorithm.
With the -th block, (
Z) consisting of
= 0,1, ..... ,
(3.13)
Where
j-th block,
,
and
The sequence
, given by,
, given as
= 0,1, ......
-1,
, with the
can be obtained by the usual circular correlation technique, by employing M point FFT
and setting the last
Page 13
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
represent the forgetting factor and regularization parameter respectively. The forgetting
factor is a positive constant less than unity, which is roughly a measure of the memory of
the algorithm; and the regularization parameters value is determined by the
signaltonoise ratio (SNR) of the signals.
The vector
represents the adaptive filters weight vector and the MbyM matrix
and added to the weight vector to update the weights. Once the
weights have been updated the inverse correlation matrix is recalculated, and the training
resumes with the new input values.
Where
= 1, 2, 3 , compute:
Page 14
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
(3.14)
An adaptive filter trained with the RLS algorithm can converge up to an order of
magnitude faster than the LMS filter at the expense of increased computational
complexity.
Page 15
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 4
FFT ALGORITHMS
4.1 Introduction
A Fast Fourier Transform (FFT) is an efficient algorithm used to calculate the
Discrete Fourier Transform (DFT) and its inverse. FFTs are widely used in application
areas like fast convolution, fast correlation, solving Partial Differential Equations and
multiplication of large integers and complex numbers. FFT algorithms are based on
divide and conquer approach. In this approach N-point DFT is successfully decomposed
into smaller DFTs. Because of this decomposition the number of computations is reduced.
The other types of FFT algorithms are:Prime-factor FFT algorithm
Cooley-Tukey FFT algorithm
Raders FFT algorithm
Winograds FFT algorithm
and
as a two-dimensional
DFT,
decomposes the one dimensional DFT into a multidimensional DFT using the index map
proposed by Good [4]
The index mapping suggest by Good and Thomas for
is
(4.1)
results
(4.2)
If we substitute the Goo d-Thomas index map in the equation for DFT matrix it follows
Dept.of E&CE, BITM Bellary
Page 16
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
The prime-factor FFT can then be calculated in the following four steps:
1. Organize the inputs of
according to
3. Calculate the
for k as in eqn.(4.2)
since the prime-factor FFT does not require multiplications by twiddle factors, it is
generally considered to be the most efficient method for calculating the DFT of a
sequence.
Consider the length N = 14, suppose we have
input index according to
= 7 and
and
for output
index results.
Page 17
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
FFT (where
is a simple identity between the convolution of length N and the FFT of the same length,
so if ( 1) is easily factorisable this allows the convolution to be computed efficiently
via the FFT
DFT of the sequence
is given by
The Rader algorithm to compute DFT is defined only for prime length
since
by substituting
with
modN and k
It is easily seen from the definition of the DFT that the transform of a length
sequence
real
to compute only half of the transform, as the remaining half is redundant and need not be
calculated. Rader algorithm provides straightforward way to compute only half of the
conjugate symmetric outputs without calculating the others, which is not possible with
other algorithms like radix-2, Cooley-Tukey and Winograd.
Page 18
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 5
DISTRIBUTED ARITHMETIC
5.1 INTRODUCTION
Distributed arithmetic (DA) is important FPGA technology which can be used to
compute sum of product [7]. This technique, first proposed by Croisier et al. is a
multiplierless architecture that is based on an efficient partition of the function in partial
terms using 2s complement binary representation of data. The partial terms can be precomputed and stored in LUTs. The flexibility of this algorithm on FPGAs permits
everything from bit-serial implementations to pipelined or full-parallel versions of the
scheme, which can greatly improve the design performance. This has been widely used in
DSP application such as convolution, DFT, DCT and digital filters [7].
y= c[n] x[n]
(5.1)
n=0
where c[n] are fixed coefficients and x[n] are input data word. Sum of product term can
be expanded as
N-1
y=
n=0
x[n]=
(5.3)
b=0
where xb[n] denotes the bth bit of x[n]. The product can be represented as
N-1
B-1
y= c[n]
n=0
xb[n] X 2b
(5.4)
b=0
Page 19
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
N-1
y= 2
b=0
n=0
xb[n] X c[n]
(5.5)
x[n] = -2
x xB-1[n] +
xb[n] X 2b
(5.6)
b=0
Page 20
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
B-2
N-1
c[n] x xB-1[n] + 2b x
n=0
b=0
xb[n] X c[n]
n=0
Block diagram for the DA is as shown in Figure 5.1. Number of partial product in
LUT depends on number of inputs. If there are N inputs then LUT has to store 2N partial
products. LSB of all inputs are taken as address to get the partial products from LUT.
Each time the input is shifted right to get the new address and the partial product from
LUT addressed by LSB bits. These partial products are accumulated to get the sum of
product. This method of implementation of DA is called as serial DA. Number of clock
cycles required to compute a sum of product depends on number of bits in the input.
Page 21
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 22
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 6
IMPLEMENTATION METHODOLOGY
DFTs of length
3) Computation of
DFTs of length
Page 23
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
= 7 and
and
for output
index results and using these index transforms we can construct the signal flow graph as
shown in Figure 6.1.
From Figure 6.1 we realize that first stage has 2 DFTs each having 7 -points and
second stage has 7 DFTs each having of length 2. One of the interesting thing here is
multiplication with twiddle factors between the stages is not required.
= 7 if the data are real we need to calculate only half of the transform.
Also, as Rader showed the zero frequency term must be calculated separately.
In matrix form, we write
Replacing
Page 24
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
(6.4)
If real and imaginary parts of W matrix in (6.4) are separated, a simplification is possible.
Consider first the real part using notation in matrix that k stands for cos(2
partof(6.4)becomes
(6.5)
Using the notation of k for sin(2 k/7) gives, for the imaginary part of (3).
(6.6)
The (6.5) and (6.6) are cyclic convolution relation. Since in our problem we always
convolve with the same coefficients (In case of DFT it is twiddle factor matrix),
arithmetic efficiency can be improved by precalculating some of the intermediate results.
These are stored in table in memory and simply addressed as needed. Using distributed
arithmetic this can be implemented efficiently Here we will present only the structure
Figure 6.2 best suited to the DFT calculated by cyclic convolution
Initially R1 to R7 are cleared to zero and the Xis are loaded into registers R1 to
R3 after addition.
Then all R1 to R3 are shifted by one bit, the last bit of each register is in R4. The
ROM output will be added to R5.
Circular shift of R4 produces at ROM output are added to R5 to R7. When the
first cycle is completed content of all R1 to R7 except R4 are right shifted by one
bit and the second cycle starts.
Page 25
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
At Bth cycle function at ADD_ SUB changed from adder to subtraction and after
this Bth cycle the content of R5 to R7 gives final FFT coefficient and zero
frequency component can be calculated separately applying accumulate and
addition of Xi s as in figure 6. 2
2. Performs an FFT to transform the input signal blocks from the time domain to the
frequency domain.
3.
Multiplies the input signal blocks by the filter coefficients vector W(n).
Page 26
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
5. Retrieves the last block from the result as the output signal vector y(n) .
6. Calculates the error signal vector e(n) by comparing the input signal vector d(n)
with y(n) .
After calculating the output and error signals, the fast block LMS algorithm updates the
filter coefficients. The following shows the steps that this algorithm completes to update
the filter coefficients.
7. Inserts zeroes before the error signal vector e(n) . This step ensures the error
signal vector has the same length as the concatenated input signal blocks.
9.
Multiplies the results by the complex conjugate of the FFT result of the input
signal blocks.
11. Sets the values of the last block of the IFFT result to zeroes and then performs an
FFT on the IFFT result.
12. Multiplies the step size by the FFT result.
13. Adds the filter coefficients vector w(n) to the multiplication result. This step
updates the filter coefficients.
Page 27
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Signal-flow graph representation of the DA based Fast Block LMS algorithm is shown in
Figure.6.3
Page 28
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 7
7.1 VERILOG
Verilog is a Hardware Description Language; a textual format for describing
electronic circuits and systems. Applied to electronic design, Verilog is intended to be
used for verification through simulation, for timing analysis, for test analysis (testability
analysis and fault grading) and for logic synthesis.
The Verilog HDL is an IEEE standard - number 1364. The first version of the
IEEE standard for Verilog was published in 1995. A revised version was published in
2001; this is the version used by most Verilog users. The IEEE Verilog standard
document is known as the Language Reference Manual, or LRM. This is the complete
authoritative definition of the Verilog HDL.
A further revision of the Verilog standard was published in 2005, though it has
little extra compared to the 2001 standard. SystemVerilog is a huge set of extensions to
Verilog, and was first published as an IEEE standard in 2005. See the appropriate
Knowhow section for more details about SystemVerilog.
IEEE Std 1364 also defines the Programming Language Interface, or PLI. This is
a collection of software routines which permit a bidirectional interface between Verilog
and other languages (usually C).
Note that VHDL is not an abbreviation for Verilog HDL - Verilog and VHDL are
two different HDLs. They have more similarities than differences, however.
Page 29
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Cadence Design Systems acquired Gateway in 1989, and with it the rights to the
language and the simulator. In 1990, Cadence put the language (but not the simulator)
into the public domain, with the intention that it should become a standard, nonproprietary language.
The Verilog HDL is now maintained by a non profit making organisation,
Accellera, which was formed from the merger of Open Verilog International (OVI) and
VHDL International. OVI had the task of taking the language through the IEEE
standardisation procedure.
In December 1995 Verilog HDL became IEEE Std. 1364-1995. A significantly
revised version was published in 2001: IEEE Std. 1364-2001. There was a further
revision in 2005 but this only added a few minor changes.
Accellera have also developed a new standard, SystemVerilog, which extends
Verilog. SystemVerilog became an IEEE standard (1800-2005) in 2005. For more details,
see the Systemverilog section of KnowHow There is also a draft standard for analog and
mixed-signal extensions to Verilog, Verilog-AMS.
Page 30
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
At the highest level, Verilog contains stochastical functions (queues and random
probability distributions) to support performance modelling.
Verilog supports abstract behavioural modeling, so can be used to model the
functionality of a system at a high level of abstraction. This is useful at the system
analysis and partitioning stage.
Verilog supports Register Transfer Level descriptions, which are used for the
detailed design of digital circuits. Synthesis tools transform RTL descriptions to gate
level.
Verilog supports gate and switch level descriptions, used for the verification of
digital designs, including gate and switch level logic simulation, static and dynamic
timing analysis, testability analysis and fault grading.
Verilog can also be used to describe simulation environments; test vectors,
expected results, results comparison and analysis. With some tools, Verilog can be used
to control simulation e.g. setting breakpoints, taking checkpoints, restarting from time 0,
tracing waveforms. However, most of these functions are not included in the 1364
standard, but are proprietary to particular simulators. Most simulators have their own
command languages; with many tools this is based on Tcl, which is an industry-standard
tool language.
Design process
The diagram below shows a very simplified view of the electronic system design
process incorporating Verilog. The central portion of the diagram shows the parts of the
design process which will be impacted by Verilog.
System level
Verilog is not ideally suited for abstract system-level simulation, prior to the
hardware-software split. This is to some extent addressed by SystemVerilog. Unlike
VHDL, which has support for user-defined types and overloaded operators which allow
the designer to abstract his work into the domain of the problem, Verilog restricts the
Dept.of E&CE, BITM Bellary
Page 31
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
designer to working with pre-defined system functions and tasks for stochastic simulation
and can be used for modelling performance, throughput and queueing but only in so far as
those built-in langauge features allow. Designers occasionally use the stochastic level of
abstraction for this phase of the design process.
Digital
Verilog is suitable for use today in the digital hardware design process, from
functional simulation, manual design and logic synthesis down to gate-level simulation.
Verilog tools provide an integrated design environment in this area.
Verilog is also suited for specialized implementation-level design verification
tools such as fault simulation, switch level simulation and worst case timing simulation.
Verilog can be used to simulate gate level fanout loading effects and routing delays
through the import of SDF files.
The RTL level of abstraction is used for functional simulation prior to synthesis.
The gate level of abstraction exists post-synthesis but this level of abstraction is not often
created by the designer, it is a level of abstraction adopted by the EDA tools (synthesis
and timing analysis, for example).
Analog
Because of Verilog's flexibility as a programming language, it has been stretched
to handle analog simulation in limited cases. There is a draft standard Verilog-AMS
that addresses analog and mixed signal simulation.
Page 32
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 33
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
For today's large, complex designs, verification can be a real bottleneck. This provides
another motivation for SystemVerilog - it has features for expediting testbench
development. See the SystemVerilog section of Knowhow for more details.
RTL verification
The RTL Verilog is then simulated to validate the functionality against the
specification. RTL simulation is usually one or two orders of magnitude faster than gate
level simulation, and experience has shown that this speed-up is best exploited by doing
more simulation, not spending less time on simulation.
In practice it is common to spend 70-80% of the design cycle writing and
simulating Verilog at and above the register transfer level, and 20-30% of the time
synthesizing and verifying the gates.
Look-ahead Synthesis
Although some exploratory synthesis will be done early on in the design process,
to provide accurate speed and area data to aid in the evaluation of architectural decisions
and to check the engineer's understanding of how the Verilog will be synthesized, the
main synthesis production run is deferred until functional simulation is complete. It is
pointless to invest a lot of time and effort in synthesis until the functionality of the design
is validated.
Synthesizing Verilog
Synthesis is a broad term often used to describe very different tools. Synthesis can
include silicon compilers and function generators used by ASIC vendors to produce
regular RAM and ROM type structures. Synthesis in the context of this tutorial refers to
generating random logic structures from Verilog descriptions. This is best suited to gate
arrays and programmable devices such FPGAs.
Synthesis is not a panacea! It is vital to tackle High Level Design using Verilog
with realistic expectations of synthesis.
The definition of Verilog for simulation is cast in stone and enshrined in the
Language Reference Manual. Other tools which use Verilog, such as synthesis, will make
their own interpretation of the Verilog language. There is an IEEE standard for Verilog
synthesis (IEEE Std. 1364.1-2002) but no vendor adheres strictly to it.
Dept.of E&CE, BITM Bellary
Page 34
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 35
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
steps and that will have the best chance of getting back a working prototype that functions
correctly.
Page 36
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has
selected. The resulting netlist(s) is saved to a Native Generic Circuit (NGC) file (for
Xilinx Synthesis Technology (XST)).
Page 37
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 38
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 39
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 40
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
array. Those devices ranging from the XC3S200 to the XC3S2000 have two columns of
block RAM. The XC3S4000 and XC3S5000 devices have four RAM columns. Each
column is made up of several 18-Kbit RAM blocks; each block is associated with a
dedicated multiplier. The DCMs are positioned at the ends of the outer block RAM
columns. The Spartan-3 family features a rich network of traces and switches that
interconnect all five functional elements, transmitting signals among them. Each
functional element has an associated switch matrix that permits multiple connections to
the routing.
Page 41
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 8
8.1
Simulation results
The simulation results of both FFT and IFFT modules are individually presented
Page 42
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Y=[ 28.0000 , -3.5000 + 7.2678i, -3.5000 + 2.7912i, -3.5000 + 0.7989i, -3.5000 - 0.7989i
-3.5000 - 2.7912i, -3.5000 - 7.2678i].
Since in implementation twiddle factors are scaled by 256 and stored in LUT. So output
obtained is also scaled output, expect zero frequency term
Page 43
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 44
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 45
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 46
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 47
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 48
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
8.2
Synthesis results
The synthesis result of DA-FBLMS adaptive filter was performed using Xilinx
Page 49
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 50
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 51
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Page 52
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
8.3
Page 53
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
CHAPTER 9
CONCLUSIONS AND FUTURE WORK
9.1 CONCLUSION
From the discussion of results, the following conclusions can be made:
1. The Fast Block LMS adaptive filter is designed and implemented for filter length
7(i.e.14 point FFT)
2.
Hardware Resource utilization results confirm that proposed adaptive digital filter
is requires less hardware (i.e about 40%) than that of existing architecture.So
proposed adaptive digital filter is Hardware efficient compared to existing one.
Page 54
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
BIBLIOGRAPHY
[1]. C. M. Rader, "Discrete fourier transforms when the number of data samples is
prime," IEEE Proceedings, vol. 56, pp. 1107-1108, June1968.
[2] A. Peled and B. Liu, "A new hardware realization of digital filters," IEEE
Transactions On Acoustics, Speech, And Signal Processing, vol. 22, pp.456-46,
December 1974
[3]. C. S. Burrus, "Index mappings for multidimensional formulation of the DFT and
convolution," IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. 25,
pp. 239-242, June 1977
[4].S. K. M. Gregory A. Clark and S. R. Parker, "Block implementation of adaptive
digital filters," IEEE Transactions on Circuits and Systems,vol. 28, pp. 584 - 592, 1981.
[5]. S. Chu and C. S. Burrus, "A prime factor FFT algorithm using distributed arithmetic,"
IEEE Transactions On Acoustics, Speech, And Signal Processing,vol. 30, April 1982.
[6]. C. Wei and 1. 1. Lou, "Multi memory block structure for implementing a digital
adaptive filter using distributed arithmetic,"IEEE Proceedings,Electronic Circuits and
Systems, vol. 133, February 1986
[7]. S. A. White, "Applications of distributed arithmetic to digital signal processing: A
tutorial review," IEEE ASSP Magazine, July 1989
[8] . DJ Allred, W. Huang, Y.Krishnan, H. Yoo, D.V Anderson, "An FPGA
implementation for a high throughput adaptive filter using distributedarithmetic." 12th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 324
- 325, 2004.
[9]. Daniel 1. Allred, Walter Huang, Venkatesh Krishnan, Heejong Yoo, and David Y.
Anderson, "LMS adaptive filters using distributed arithmetic for high throughput," IEEE
Transactions on Circuits and Systems,vol. 52, pp. 1327 - 1337, July 2005
[10]. N. J. Sorawat Chivapreecha, Aungkana Jaruvarakul and K. Dejhan,"Adaptive
equalization architecture using distributed arithmetic for partial response channels," IEEE
Tenth International Symposium on Consumer Electronics, 2006
Page 55
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Appendix
Xilinx
Creating a new ISE project for the FPGA device on the Spartan-3 Kit
To create a new project:
1. Select File > New Project... The New Project Wizard appears.
2. Type tutorial in the Project Name field.
3. Enter or browse to a location (directory path) for the new project. A tutorial
subdirectory is created automatically.
4. Verify that HDL is selected from the Top-Level Source Type list.
5. Click Next to move to the device properties page.
6. Fill in the properties in the table as shown below:
Product Category: All
Family: Spartan3
Device: XC3S200
Package: FT256
Speed Grade: -4
Top-Level Source Type: HDL
Synthesis Tool: XST (VHDL/Verilog)
Simulator: ISE Simulator (VHDL/Verilog)
Preferred Language: Verilog (or VHDL)
Verify that Enable Enhanced Design Summary is selected.
Click Next to proceed to the Create New Source window in the New Project Wizard. At
the end of the next section, your new project will be complete.
Page 56
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 1
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 2
Click Next, then Finish in the New Source Wizard - Summary dialog box to complete
the new source file template.
7. Click Next, then Next, then Finish.
The source file containing the entity/architecture pair displays in the Workspace, and the
counter displays in the Source tab, as shown below:
Figure 3
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
-- Uncomment the following library declaration if instantiating
-- any Xilinx primitive in this code.
--library UNISIM;
--use UNISIM.VComponents.all;
entity counter is
Port ( CLOCK : in STD_LOGIC;
DIRECTION : in STD_LOGIC;
COUNT_OUT : out STD_LOGIC_VECTOR (3 downto 0));
end counter;
architecture Behavioral of counter is
signal count_int : std_logic_vector(3 downto 0) := "0000";
begin
process (CLOCK)
begin
if CLOCK='1' and CLOCK'event then
if DIRECTION='1' then
count_int <= count_int + 1;
else
count_int <= count_int - 1;
end if;
end if;
end process;
COUNT_OUT <= count_int;
end Behavioral;
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 4
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Note: You can also create a UCF file for your project by selecting Project Create New
Source.
5. In the Timing Constraints dialog, enter the following in the Period, Pad to Setup, and
CLock to Pad fields:
Period: 40
Pade to Setup: 10
Clock to Pad: 10
6. Press Enter.
After the information has been entered, the dialog should look like what is shown
below..
Figure 5
Select Timing Constraints under Constraint Type in the Timing Constraints tab and
the newly created timing constraints are displayed as follows:
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 6
Save the timing constraints. If you are prompted to rerun the TRANSLATE or XST step,
click OK to continue.
9. Close the Constraints Editor.
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 7
Locate the Performance Summary table near the bottom of the Design Summary Click the
All Constraints Met link in the Timing Constraints field to view the Timing Constraints
report.
Verify
that
the
design
meets
the
specified
timing
requirements.
Figure 8
Dept.of E&CE, BITM Bellary
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 9
Dept.of E&CE, BITM Bellary
Page 0
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
5. Select File Save. You are prompted to select the bus delimiter type based on the
synthesis tool you are using. Select XST Default <> and click OK.
6. Close PACE.
Notice that the Implement Design processes have an orange question mark next to them,
indicating they are out-of-date with one or more of the design files. This is because the
UCF file has been modified.
Figure 10
Dept.of E&CE, BITM Bellary
Page 0
7. In the Welcome dialog box, select Configure devices using Boundary-Scan (JTAG).
8. Verify that Automatically connect to a cable and identify Boundary-Scan chain is
selected.
9. Click Finish.
10. If you get a message saying that there are two devices found, click OK to continue.
The devices connected to the JTAG chain on the board will be detected and displayed in
the iMPACT window.
11. The Assign New Configuration File dialog box appears. To assign a configuration file
to the xc3s200 device in the JTAG chain, select the counter.bit file and click Open.
Figure 11
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
Figure 12
On the board, LEDs 0, 1, 2, and 3 are lit, indicating that the counter is running.
16. Close iMPACT without saving.
CLB Overview
For more details on the CLBs, refer to the Using Configurable Logic Blocks chapter in
UG331. The Configurable Logic Blocks (CLBs) constitute the main logic resource for
implementing synchronous as well as combinatorial circuits. Each CLB comprises four
interconnected slices, as shown in Figure 9. These slices are grouped in pairs. Each pair is
organized as a column with an independent carry chain. The nomenclature that the FPGA
Editor part of the Xilinx development software uses to designate slices is as
follows:
The letter X followed by a number identifies columns of slices. The X number counts
up in sequence from the left side of the die to the right. The letter Y followed by a
number identifies the position of each slice in a pair as well as indicating the CLB row.
The Y number counts slices starting from the bottom of the die according to the
sequence: 0, 1, 0, 1 (the first CLB row); 2, 3, 2, 3 (the second CLB row); etc. Figure 9
shows the CLB located in the lower left-hand corner of the die. Slices X0Y0 and X0Y1
make up the column-pair on the left whereas slices X1Y0 and X1Y1 make up the
column-pair on the right. For each CLB, the term left-hand (or SLICEM) indicates the
pair of slices labeled with an even X number, such as X0, and the term right-hand (or
SLICEL) designates the pair of slices with an odd X number, The carry chain, together
with various dedicated arithmetic logic gates, support fast and efficient implementations
of math operations. The carry chain enters the slice as CIN and exits as COUT. Five
multiplexers control the chain: CYINIT, CY0F, and CYMUXF in the lower portion as
well as CY0G and CYMUXG in the upper portion. The dedicated arithmetic logic
includes the exclusive-OR gates XORG and XORF (upper and lower portions of the slice,
respectively) as well as the AND gates GAND and FAND (upperand lower
Dept.of E&CE, BITM Bellary
Page 1
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
portions,respectively).
Each of the two LUTs (F and G) in a slice have four logic inputs (A1-A4) and a single
output (D). This permits any four-variable Boolean logic operation to be programmed
Dept.of E&CE, BITM Bellary
Page 2
Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput
into them. Furthermore, wide function multiplexers can be used to effectively combine
LUTs within the same CLB or across different CLBs, making logic functions with still
more input variables possible. The LUTs in both the right-hand and left-hand slice-pairs
not only support the logic functions described above, but also can function as ROM that is
initialized with data at the time of configuration. The LUTs in the left-hand slice-pair
(even-numbered columns such as X0 in Figure 9) of each CLB support two additional
functions that the right-hand slice-pair (odd-numbered columns such as X1) do not. First,
it is possible to program the left-hand LUTs as distributed RAM. This type of memory
affords moderate amounts of data buffering anywhere along a data path. One left-hand
LUT stores 16 bits. Multiple left-hand LUTs can be combined in various ways to store
larger amounts of data. A dual port option combines two LUTs so that memory access is
possible from two independent data lines. A Distributed ROM option permits pre-loading
the
memory
with
data
during
FPGA
configuration
Page 3