0% found this document useful (0 votes)
234 views79 pages

Thesis On FPGA

A documentary compilation of the work of sinewave and its applications

Uploaded by

Ashish Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views79 pages

Thesis On FPGA

A documentary compilation of the work of sinewave and its applications

Uploaded by

Ashish Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Visvesvaraya Technological University

Belgaum, Karnataka

A Project Report On
IMPLEMENTATION OF DISTRIBUTED ARITHMETIC BASED FAST BLOCK
LMS ADAPTIVE FILTER FOR HIGH THROUGHPUT ON FPGA
A dissertation work submitted in partial fulfillment of the requirement for the
award of degree of Masters of Technology in Digital Electronics

Submitted by

Sagara T V
USN: 3BR10LDE14

Under the guidance of


Prof. Premchand D R , M.Tech.,

Dept of ECE, BITM, Bellary.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BALLARI INSTITUTE OF TECHNOLOGY & MANAGEMENT


NBA Accredited Institution*
(Recognized by Govt. of Karnataka, approved by AICTE, New Delhi & Affiliated to
Visvesvaraya Technological University, Belgaum)

"JnanaGangotri" Campus, No.873/2, Bellary-Hospet Road, Allipur,


Bellary-583 104 (Karnataka) (India)
Ph: 08392 237100 / 237190, Fax: 08392 23719

2011-2012

BASAVARAJESWARI GROUP OF INSTITUTIONS

BALLARI INSTITUTE OF TECHNOLOGY & MANAGEMENT


NBA Accredited Institution*
(Recognized by Govt. of Karnataka, approved by AICTE, New Delhi & Affiliated to
Visvesvaraya Technological University, Belgaum)

"JnanaGangotri" Campus, No.873/2, Bellary-Hospet Road, Allipur,


Bellary-583 104 (Karnataka) (India)
Ph: 08392 237100 / 237190, Fax: 08392 - 237197

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


ENGINEERING

CERTIFICATE
Certified that project work entitled IMPLEMENTATION OF DISTRIBUTED
ARITHMETIC BASED FAST BLOCK LMS ADAPTIVE FILTER FOR HIGH
THROUGHPUT ON FPGA carried out by Sagara T V bearing USN: 3BR10LDE14,
a bonafide student of Ballari institute of technology and management, in partial
fulfillment for the award of Master of Technology in Digital Electronics Engineering of
the Visvesvaraya Technological University, Belgaum during the year 2011-2012. It is
certified that all corrections/suggestions indicated for internal assessment have been
incorporated in the report deposited in the library. The project report has been approved
as it satisfies the academic requirements in respect of project work prescribed for the said
Degree.
Signature of the project guide
Prof. Premchand D R M.Tech.

Signature of the HOD


Dr. V.C.Patil M.Tech., Ph.D

Signature of the Principal

Dr. U. Eranna

M.E, Ph.D

External viva

Name of the Examiners


1.
2.

Signature with date

DECLARATION

I Sagara.T.V student of M.Tech in Digital Electronics, Ballari Institute of


Technology and Management, Bellary hereby declare that the dissertation entitled
IMPLEMENTATION OF DISTRIBUTED ARITHMATIC BASED FAST BLOCK
LMS ADAPTIVE FILTER FOR HIGH THROUGHPUT ON FPGA embodies the
report of my project work carried out independently by me during 4th semester M.Tech in
Digital Electronics under the supervision and guidance of Prof. Premchand D R, Dept
of Electronics and Communication Engineering , BITM, Bellary. This work has been
submitted in the partial fulfillment for the award of M.Tech degree.
I have not submitted embodied to any other university or institute for the award of
any other degree.

Date:
Place: Bellary

Sagara.T.V
USN:3BR10LDE14
M.Tech (DE)

BITM, Bellary.

ACKNOWLEDGENT
The completion of my project would not have been possible without the kind support and
help of many individuals. I would like to extend my sincere thanks to all of them.
I am highly indebted to my project guide Prof. PREMCHAND D R for providing me
valuable guidance, constant supervision & as well as being a support in completing the
project.
I express my gratitude and sincere thanks to our head of the department, Dr.V.C.PATIL
for his encouragement, moral support rendered and facilities provided towards the
successful completion of project.
I owe my deep sense of gratitude to our principal Dr.U.ERANNA for providing all the
facilities and congenial environment in the college.
My special thanks to all the staff members of Electronics and communication department
for their support and help during the project work.
I wish to thank my family for their blessings and for being a constant source of
inspiration and encouragement.

Sagara.T.V
3BR10LDE14

ABSTRACT
The proposed work deals with the design and implementation of high
throughput adaptive digital filter using Fast Block Least Mean Squares (FBLMS)
algorithm. The filter structure is based on Distributed Arithmetic (DA). DA is able to
calculate the inner product by shifting, and accumulating of partial products and storing in
look-up table. Hence the proposed adaptive digital filter will be multiplierless. Thus a DA
based implementation of adaptive filter is highly computational and area efficient.
Furthermore, the fundamental building block in the DA architecture map well to
the architecture of todays Field Programmable Gate Arrays (FPGA). As per the literature
FPGA implementation of DA based adaptive filter occupies significantly smaller area,
about 45% less than that of the existing FBLMS algorithm based adaptive filter
This report contains the work carried out in last three months including detail
study of references[1]-[10], leading to thorough understanding of adaptive filter and its
algorithms, FFT and its various algorithms and Distributed Arithmetic.

Table of Contents
Page no.

CHAPTER 1 PREAMBLE
1.1
1.2
1.3
1.4
1.5

Introduction
Motivation
Problem Statement
Objective of Project
Organization of the Report

CHAPTER 2 LITERACTURE SURVEY

1
2
2
3
3
4

CHAPTER 3 ADAPTIVE FILTER


3.1
3.2
3.3
3.4

Introduction
Adaptive Filtering Problem
Applications
Adaptive Algorithms
3.4.1 Wiener Filter
3.4.2 Method of Steepest Descent
3.4.3 Fast Block LMS Algorithm
3.4.4 RLS Algorithm

7
8
8
9
9
11
12
14

CHAPTER 4 FFT ALGORITHMS


4.1 Introduction of DFT and FFT
4.2 Prime Factor FFT Algorithm
4.3 Rader FFT Algorithm

16
16
18

CHAPTER 5 DISTRIBUTED ARITHMATIC


5.1 Introduction
5.2 Technical overview

19
19

CHAPTER 6 IMPLEMENTATION METHODOLOGY


6.1 14 point FFT using Prime factor algorithm
6.2 FFT using Distributed Arithmetic
6.3 DA Based FBLMS

23
25
26

CHAPTER 7 HARDWARE AND SOFTWARE REQUIRMENTS


7.1 Verilog
7.1.1 A Brief History of Verilog
7.1.2 Levels of Abstraction
7.1.3 Scope of Verilog
7.1.4 Design flow using Verilog
7.2 Introduction to Xlinx
7.3 The FPGA design flow
7.3.1 Design Entry
7.3.2 Synthesis

29
29
30
31
32
35
35
36
36

7.3.3
7.3.4
7.3.5
7.3.6
7.3.7

Implementation
Behavioral Simulation
Functional simulation
Static Timing Analysis
Architectural overview

37
39
40
40
40

CHAPTER 8 SIMULATION AND RESULTS


8.1 Simulation results
8.1.1 Signed DA convolution
8.1.2 point Rader FFT algorithm
8.1.3 Simulation results of 14 point FFT
8.1.4 Simulation results of IFFT
8.1.5 Fast Block LMS Adaptive filter
8.2 Synthesis Results
8.2.1 Synthesis result of FFT module
8.2.2 Synthesis result of IFFT module
8.2.3 Synthesis result of FBLMS_TOP module
8.3 Hardware Resource Utilization summary

42
42
43
44
46
48
49
49
50
52
53

CHAPTER 9 CONCLUSIONS AND FUTURE WORK


9.1 Conclusion
9.2 Future work

BIBLIOGRAPHY
APPENDIX

54
54

List of Figures
Description of the Figures

Page no.

Figure 3.1: A strong narrowband interference N(f) in a wideband signal S(f).

Figure 3.2 : Block diagram for the adaptive filter problem

Figure 4.1 : Good Thomas shuffling

17

Figure 5.1 Serial distributed arithmetic architecture

21

Figure 5.2 Parallel DA architecture

22

Figure 6.1 Mapping using Good-Thomas algorithm

24

Figure 6.2. Architracture for FFT using DA

26

Figure. 6.3. Proposed DA based FBLMS after optimization

28

Figure 7.1 Levels of abstraction

30

Figure 7.2 Design flow using Verilog

33

Figure 7.3 The FPGA Design Flow

36

Figure 7.4 FPGA Synthesis

37

Figure 7.5 FPGA translate

38

Figure 7.6 FPGA Map

38

Figure 7.7 FPGA place and route

39

Figure 7.8 Internal structure of FPGA

41

Figure 8.1. Simulation of signed DA convolution

42

Figure 8.2. Simulation of 7 point Rader FFT

43

Figure 8.3 Simulation of 14 point Good-Thomas FFT (for Real part)

45

Figure 8.4. Simulation of 14 point Good-Thomas FFT (for Imaginary part)

46

Figure 8.5. Simulation of 14 point Good-Thomas IFFT

47

Figure 8.6. Simulation of Fast Block LMS adaptive filter

48

Figure 8.7. Simulation of Fast Block LMS adaptive filter

48

Figure 8.8 Top module of 14-point FFT

49

Figure 8.9. RTL Schematic of 14 point Good-Thomas FFT

50

Figure 8.10 Top module of 14-point of IFFT

51

Figure 8.11 RTL View of 14 point Good-Thomas IFFT

51

Figure 8.12 Top module of FBLMS Using DA

52

Figure 8.13 RTL Schematic for FBLMS

52

ABBREVIATIONS

ASIC

APPLICATION SPECIFIC INTEGRATED CIRCUIT

CSD

CANONIC SIGN DIGIT

DA

DISTRIBUTED ARITHMETIC

DFT

DISCRETE FOURIER TRANSFORM

DSP

DIGITAL SIGNAL PROCESSING

FFT

FAST FOURIER TRANSFORM

FBLMS

FAST BLOCK LEAST MEAN-SQUARE

FPGA

FIELD PROGRAMMABLE GATE ARRAY

IFFT

INVERSE FAST FOURIER TRANSFORM

LMS

LEAST-MEAN -SQUARE

LUT

LOOKUP TABLE

MSE

MEAN SQUARE ERROR

RAM

RANDOM ACCESS MEMORY

ROM

READ ONLY MEMORY

RLS

RECURSIVE LEAST-SQURES

SNR

SIGNAL- TO -NOISE RATIO

HDL

HARDWARE DESCRIPTIVE LANGUAGE

DDR

DOUBLE DATA-RATE

DCI

DIGITALLY CONTROLLED IMPEADENCE

DCM

DIGITAL CLOCK MANAGER

CLB

CONFIGURABLE LOGIC BLOCK

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 1

PREAMBLE
1.1

INTRODUCTION
Adaptive digital filters are widely used in the area of signal processing such as echo

cancelation, noise cancelation, and channel equalization for communications and


networking systems. The necessity of hardware implementation requires several of
performances such as high speed, low power dissipation and good convergence
characteristics.
The throughput of FBLMS based adaptive filter is limited by computational
complexity lies in FFT (and IFFT) block. It is possible to enhance the throughput of
system by implantation of that FFT (and IFFT) block with reduced hardware complexity.
DA is one of the efficient techniques, in which, by means of a bit level rearrangement of a
multiply accumulate terms, FFT can be implemented without multiplier.
Since the main hardware complexity of the system is due to hardware multipliers and
introduction of DA eliminates the need of that multipliers and resulting system will have
high throughput and also have low power dissipation
Fast block least mean square (FBLMS) algorithm [9] is one of the fastest and
computationally efficient adaptive algorithm since here the process of filtering and
adaption is done in frequency domain by using FFT algorithms. But there is still
possibility to further enhanced the throughput of FBLMS algorithm based adaptive filters
by applying the concept of Distributed arithmetic (DA) [2] .Using bit level rearrangement
of a multiply accumulate terms, DA can hide the complex hardware multipliers and
therefore, the desired system becomes multiplier-less. DA is a powerful technique for
reducing the size of a parallel hardware multiply-accumulate that is well suited for FPGA
designs
Recently there has been a trend to implement DSP functions using FPGAs. While
Application Specific Integrated Circuits (ASICs) are the traditional solution to high
performance applications due to high development costs and time-to-market factor. The

Dept.of E&CE, BITM Bellary

Page 1

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

main reason behind the popularity of the FPGA is due to balance that FPGAs provide the
designer in terms of flexibility, cost, and time-to-market.
The concept of DA has already applied to LMS based adaptive filters
[1],[3],[8],[10] but, not to FBLMS algorithm based adaptive filters. This paper proposes a
new hardware efficient implementation of FBLMS algorithm based adaptive filter using
DA. The proposed architecture reduces the area requirement of original FBLMS by
reducing the number of hardware multipliers using DA, result as adaptive filter with low
power dissipation and high throughput.

1.2

MOTIVATION
Adaptive digital filters are widely used in in the area of signal processing such as

echo cancelation, noise cancelation, channel equalization for communications and


networking systems [2], [9]. The necessity of hardware implementation requires various
of performances such as high speed, low power dissipation and good convergence
characteristics. So hardware which is used for computation of FFT and IFFT must be
efficient and fast. Multiplierless hardware implementation approach provides solution to
this problem due to its scope for lower hardware-complexity and higher throughput of
computation. This is can be achieved by using DA

1.3

PROBLEM STATEMENT
BLMS based adaptive filter, that takes an input sequence, which is partitioned into

non- overlapping blocks of length each by means of a serial-to-parallel converter, and the
blocks of data so produced are applied to an FIR filter of length, one block at a time. The
tap weights of the filter are updated after the collection of each block of data samples, so
that the adaptation of the filter proceeds on a block-by-block. This has low throughput.
The throughput of FBLMS based adaptive filter is limited by computational complexity
lies in FFT (and IFFT) block. The main hardware complexity of the system is due to
hardware multipliers.

Dept.of E&CE, BITM Bellary

Page 2

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

1.4

OBJECTIVE OF PROJECT
The objective of the project is to implement FBLMS Adaptive filter without

multipliers using Distributed Arithmetic(DA),which is able to calculate the inner product


by shifting, and accumulating of partial products and storing in look-up table, also the
desired adaptive digital filter will be multiplierless. For low hardware implementation
serial DA structure is used which processes the input bit vector in serial manner and for
high throughput parallel or modified DA is used which process the input in parallel. FFT
and IFFT are DA based techniques.

1.5

ORGANIZATION OF THE REPORT


The project consists of seven chapters and one reference. The organization of the

report is described as follows.


Chapter 1 gives the introduction to the DA based FBLMS, along with the
problem statement, objectives and methodology of the project.
Chapter 2 Literature survey carried out for the completion of the project is
indicated in this chapter.
Chapter 3 gives brief explanation about the Adaptive filter problem, different
types of algorithms used to resolve the problem.
Chapter 4 gives detail explanations about the FFT algorithms. Along with
different types of fft algorithms.
Chapter 5 gives detail explaination about Distributed arithmetic
Chapter 6 gives detail of Implementation Methodology of DA Based FBLMS
adaptive filter.
Chapter 7 explains about the hardware and software requirements. It contains
explanation of VHDL , FPGA and Xilinx
Chapter 8 explains about the simulation results and synthesis results of both FFT
and IFFT. It also contain simulation and synthesis result of each algorithms.
Chapter 9 gives the conclusion, future scope for the built DA based FBLMS.

Dept.of E&CE, BITM Bellary

Page 3

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 2

LITERATURE SURVEY
Paper[1] Discrete fourier transforms when the number of data samples is
prime.

Explanation: To compute the discrete Fourier transform of a long sequence of data


samples, one of the fast Fourier transform (FFT) algorithms was used. The limitation
common to all of those algorithms was that the number of data samples, N, must be
highly composite. In this Paper they showed how FFT techniques were applied to the
computation of a discrete Fourier transform when N is prime.
Paper [2] A new hardware realization of digital filters.

Explanation : They have proposed a new approach to the implementation problem of


digital filters, which offered significant savings in cost and power consumption over
existing realizations and permits operation at higher speeds than those achievable by
existing realizations.
Paper [3] Index mappings for multidimensional formulation of the DFT and
convolution.

Explanation : They used various mappings of indices that result in reordering of


calculations has proved very effective in developing efficient algorithms for the FFT and
convolution. This paper has given the general condition for these linear maps to be unique
and cyclic.
Paper [4] Block implementation of adaptive digital filters.

Explaination : A block adaptive filter was derived which allowed fast implementation
while maintaining performance equivalent to that of the LMS adaptive filter. It was
pointed out that BLMS adaptive filters, have an analysis advantage over LMS adaptive

Dept.of E&CE, BITM Bellary

Page 4

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

filters when the inputs were correlated. Finally, BLMS adaptive filters were less
computational complexity.

Paper [5] A prime factor FFT algorithm using distributed arithmetic.

Explanation : They have developed and evaluated an FFT algorithm which combines the
prime factor index map with the index permutation that converts a DFT to convolution
and the evaluation of the convolution by distributed arithmetic. It was shown that the
conversion of a length-N DFT to two length-(-N 1 )/2 convolutions can be efficiently
used with distributed arithmetic. An indexing scheme using a stored table was developed
to give a very efficient implementation of both the prime factor index map and the index
permutation.
Paper [6] Multi memory block structure for implementing a digital adaptive
filter using distributed arithmetic.

Explanation : This adaptive algorithm based on the LMS algorithm for adaptive filters
with a large number of taps were derived from the distributed arithmetic technique. This
type of adaptive filter offered a great advantage in hardware simplicity. In this structure,
the total number filter taps N was divided into M blocks, each with R taps. These M
blocks were operate simultaneously and thus achieve a high speed signal processing
capability.
Paper [7] Applications of distributed arithmetic to digital signal processing.

Explanation : This paper explains the application of a DA to

Biquadratic Digital

filter(an example of vector dot-product and vector-matrix-product mechanism) and


application in transformers (fft and DCT).
Paper [8] An FPGA implementation for a high throughput adaptive filter using
distributed arithmetic.
.
Dept.of E&CE, BITM Bellary

Page 5

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Explanation : A novel multiplier-less implementation of an LMS-type adaptive filter


based on distributed arithmetic (DA) had been presented in this paper. For the purpose of
illustration, a 32 tap DA-based adaptive LMS filter was successfully implemented on an
FPGA. It is demonstrated that the system described in this paper may be used for a highthroughput implementation of adaptive filters at the cost of a marginal increase in the
memory requirements.
Paper [9] LMS adaptive filters using distributed arithmetic for high throughput.

Explanation : In this paper, a new hardware adaptive filter structure for very high
throughput LMS adaptive filters was proposed and its implementation details were
presented. The DA concept involves implementation of MAC operations using a LUT.
The problems typically encountered in updating the LUT for an adaptive filter were
overcome by using an auxiliary LUT with special addressing. Several optimizations for
efficient implementation of large DA adaptive filters were presented. The proposed DA
adaptive filter system was implemented on an FPGA. The design trade-offs presented in
this paper indicate that the DA adaptive filter can yield a significantly higher throughput
than the traditional implementation employing up to four hardware multiply and
accumulate units. The cost of such a design is a marginal increase in memory
requirements. It is also demonstrated that the design can be easily reconfigured to match a
wide range of performance requirements and cost constraints.

Dept.of E&CE, BITM Bellary

Page 6

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 3

ADAPTIVE FILTER
3.1 Introduction
Adaptive filters learn the statistics of their operating environment and continually
adjust their parameters accordingly. In practice, signals of interest often become
contaminated by noise or other signals occupying the same band of frequency. When the
signal of interest and the noise reside in separate frequency bands, conventional linear
filters are able to extract the desired signal. However, when there is spectral overlap
between the signal and noise, or the signal or interfering signals statistics change with
time, fixed coefficient filters are inappropriate Figure 3.1 shows an example of a
wideband signal whose Fourier spectrum overlaps narrowband interference signal.

Figure 3.1. A strong narrowband interference N(f) in a wideband signal S(f)

This situation can occur frequently when there are various modulation
technologies operating in the same range of frequencies. In fact, in mobile radio systems
cochannel interference is often the limiting factor rather than thermal or other noise
sources. It may also be the result of intentional signal jamming, a scenario that regularly
arises in military operations when competing sides intentionally broadcast signals to
disrupt their enemies communications.
Furthermore, if the statistics of the noise are not known a priori, or change over
time, the coefficients of the filter cannot be specified in advance. In these situations,
adaptive algorithms are needed in order to continuously update the filter coefficients.

Dept.of E&CE, BITM Bellary

Page 7

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

3.2 Adaptive Filtering Problem


The goal of any filter is to extract useful information from noisy data. Whereas a
normal fixed filter is designed in advance with knowledge of the statistics of both the
signal and the unwanted noise, the adaptive filter continuously adjusts to a changing
environment through the use of recursive algorithms. This is useful when either the
statistics of the signals are not known beforehand of change with time.

Figure 3.2 Block diagram for the adaptive filter problem


The discrete adaptive filter ( figure 3.2) accepts an input u(n) and produces an
output y(n) by a convolution with the filters weights, w(k). A desired reference signal,
d(n), is compared to the output to obtain an estimation error e(n). This error signal is used
to incrementally adjust the filters weights for the next time instant. Several algorithms
exist for the weight adjustment, such as the Least-Mean-Square (LMS) and the
Recursive-Least-Square (RLS) algorithms. The choice of training algorithm is dependent
upon needed convergence time and the computational complexity available, as statistics
of the operating environment.

3.3 Applications
Because of their ability to perform well in unknown environments and track
statistical timevariations, adaptive filters have been employed in a wide range of fields.
However, there are essentially four basic classes of applications for adaptive filters. These
are: Identification, Inverse modeling, prediction, and interference cancellation, with the
Dept.of E&CE, BITM Bellary

Page 8

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

main difference between them being the manner in which the desired response is
extracted.
Applications of adaptive filters are:

Channel Identification

Channel Equalization

Linear predictive coding

Adaptive Line Enhancement

3.4 Adaptive Algorithms


There are numerous methods for the performing weight update of an adaptive
filter. There is the Wiener filter, which is the optimum linear filter in the terms of mean
squared error, and several algorithms that attempt to approximate it, such as the method
of steepest descent. There is also leastmean square algorithm, developed by Widrow and
Hoff originally for use in artificial neural networks. Finally, there are other techniques
such as the recursiveleast squares algorithm. The choice of algorithm is highly dependent
on the signals of interest and the operating environment, as well as the convergence time
required and computation power available.
3.4.1 Wiener Filter
The Wiener filter, so named after its inventor, was developed in 1949. It is the
optimum linear filter in the sense that the output signal is as close to the desired signal as
possible. Although not often implemented in practice due to computational complexity,
the Wiener filter is studied as a frame of reference for the linear filtering of stochastic
signals to which other algorithms can be compared.
To formulate the Weiner filter and other adaptive algorithms, the mean squared
error (MSE) is used. If the input signal

to a filter with M taps is given as


(3.1)

and the coefficients or weight vector is given as

then the square of the output error can be formulated as


(3.2)

Dept.of E&CE, BITM Bellary

Page 9

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

The mean square error(MSE), , is obtained by taking the expectations of both sides:

(3.3)
Above,

is the variance of the desired output,

the autocorrelation matrix of

is the crosscorrelation vector and

is

. A plot of the MSE against the weights is a nonnegative

bowl shaped surface with the minimum point being the optimal weights. This is referred
to as the error performance surface whose gradient is given by
(3.4)

To determine the optimal Wiener filter for a given signal requires solving the
WienerHopf equations. First, let the matrix

can denote the MbyM correlation matrix

of . That is,

(3.5)

Where the superscript H denotes the Hermitian transpose. In expanded form this is

Also, let
response

represent the crosscorrelation vector between the tap inputs and the desired
:
(3.6)

which expanded is:

Since the lags in the definition of

are either zero or negative, the WienerHopf equation

may be written in compact matrix form:

Dept.of E&CE, BITM Bellary

Page 10

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

With

stands for the Mby1 optimum tap-weight vector, for the transversal filter. That

is, the optimum filters coefficients will be:

This produces the optimum output in terms of the meansquareerror, however if


the signals statistics change with time then the WienerHopf equation must be
recalculated. This would require calculating two matrices, inverting one of them and then
multiplying them together. This computation cannot be feasibly calculated in real time, so
other algorithms that approximate the Wiener filter must be used.

3.4.2 Method of Steepest Descent


With the errorperformance surface defined previously, one can use the method of
steepestdescent to converge to the optimal filter weights for a given problem. Since the
gradient of a surface (or hypersurface) points in the direction of maximum increase, then
the direction opposite the gradient (-

) will point towards the minimum point of the

surface. One can adaptively reach the minimum by updating the weights at each time step
by using the equation
(3.7)
where the constant

is the step size parameter. The step size parameter determines how

fast the algorithm converges to the optimal weights. A necessary and sufficient condition
for the convergence or stability of the steepest descent algorithm is for

to satisfy

(3.8)
where

is the largest eigen value of the correlation matrix .


Although it is still less complex than solving the WienerHopf equation, the

method of steepestdescent is rarely used in practice because of the high computation


needed. Calculating the gradient at each time step would involve calculating

and ,

whereas the leastmeansquare algorithm performs similarly using much less calculations

Dept.of E&CE, BITM Bellary

Page 11

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

3.4.3 Fast block LMS Based Adaptive filter


The leastmeansquare (LMS) algorithm is similar to the method of
steepestdescent in that it adapts the weights by iteratively approaching the MSE
minimum. Widrow and Hoff invented this technique in 1960 for use in training neural
networks. The key is that instead of calculating the gradient at every time step, the LMS
algorithm uses a rough approximation to the gradient. The error at the output of the filter
can be expressed as
(3.9)
which is simply the desired output minus the actual filter output. Using this definition for
the error an approximation of the gradient is found by
(3.10)
Substituting this expression for the gradient into the weight update equation from the
method of steepestdescent gives
(3.11)
which is the WidrowHoff LMS algorithm. As with the steepestdescent algorithm, it can
be shown to converge for values of less than the reciprocal of

, but

may be

timevarying, and to avoid computing it another criterion


(3.12)
where M is the number of filter taps and
density of the tap inputs

is the maximum value of the power spectral

.The relatively good performance of the LMS algorithm given

its simplicity has caused it to be the most widely implemented in practice. For an Ntap
filter, the number of operations has been reduced to 2*N multiplications and N additions
per coefficient update. This is suitable for realtime applications, and is the reason for the
popularity of the LMS algorithm.
Consider a BLMS based adaptive filter[4], that takes an input sequence
which is partitioned into non-overlapping blocks of length

each by means of a serial-to-

parallel converter, and the blocks of data so produced are applied to an FIR filter of

Dept.of E&CE, BITM Bellary

Page 12

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

length , one block at a time. The tap weights of the filter are updated after the collection
of each block of data samples, so that the adaptation of the filter proceeds on a block-byblock basis rather than on a sample-by-sample basis as in conventional LMS algorithm.
With the -th block, (

Z) consisting of

= 0,1, ..... ,

-1, the filter

coefficients are updated from block to block as,

(3.13)
Where

is the tap weight vector corresponding to

j-th block,
,
and

is the output error at

The sequence

, given by,

is the so-called desired response available during the initial

training period and

is the filter output at

, given as

The parameter , popularly called the step size parameter is to be chosen as


| for convergence of the algorithm. For the l-th sub-block within the -th block
i.e., for

= 0,1, ......

-1,

, the filter output

is obtained by convolving the input data sequence


filter coefficient vector
method via

, with the

and thus can be realized efficiently by the overlap-save


point FFT, where the first -1 points come from the previous

sub-block, for which the output is to be discarded.


Similarly, the weight updates term

can be obtained by the usual circular correlation technique, by employing M point FFT
and setting the last

-1 output terms as zero.

Dept.of E&CE, BITM Bellary

Page 13

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

3.4.4 Recursive Least Squares Algorithm


The recursiveleastsquares (RLS) algorithm is based on the wellknown least
squares method The leastsquares method is a mathematical procedure for finding the
best fitting curve to a given set of data points. This is done by minimizing the sum of the
squares of the offsets of the points from the curve. The RLS algorithm recursively solves
the least squares problem
In the following equations, the constants and

are parameters set by the user that

represent the forgetting factor and regularization parameter respectively. The forgetting
factor is a positive constant less than unity, which is roughly a measure of the memory of
the algorithm; and the regularization parameters value is determined by the
signaltonoise ratio (SNR) of the signals.
The vector

represents the adaptive filters weight vector and the MbyM matrix

is referred to as the inverse correlation matrix. The vector is used as an intermediary


step to computing the gain vector
estimation error

.This gain vector is multiplied by the a priori

and added to the weight vector to update the weights. Once the

weights have been updated the inverse correlation matrix is recalculated, and the training
resumes with the new input values.

A summary of the RLS algorithm follows


Initialize the weight vector and the inverse correlation matrix

Where

For each instance of time

Dept.of E&CE, BITM Bellary

= 1, 2, 3 , compute:

Page 14

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

(3.14)
An adaptive filter trained with the RLS algorithm can converge up to an order of
magnitude faster than the LMS filter at the expense of increased computational
complexity.

Dept.of E&CE, BITM Bellary

Page 15

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 4

FFT ALGORITHMS
4.1 Introduction
A Fast Fourier Transform (FFT) is an efficient algorithm used to calculate the
Discrete Fourier Transform (DFT) and its inverse. FFTs are widely used in application
areas like fast convolution, fast correlation, solving Partial Differential Equations and
multiplication of large integers and complex numbers. FFT algorithms are based on
divide and conquer approach. In this approach N-point DFT is successfully decomposed
into smaller DFTs. Because of this decomposition the number of computations is reduced.
The other types of FFT algorithms are:Prime-factor FFT algorithm
Cooley-Tukey FFT algorithm
Raders FFT algorithm
Winograds FFT algorithm

4.2 Prime factor FFT Algorithm


The prime factor FFT algorithm is also known as Good- Thomas algorithm is an
actual true dimensional FFT algorithm i.e. there are no twiddle factors induced by index
mapping as in Cooley-Tukey FFT algorithm ,but the price we have to pay for twiddle
factor free flow is that factors must be co prime. Good-Thomas algorithm re-expresses the
discrete Fourier transform (DFT) of a size
but only for the case where

and

as a two-dimensional

DFT,

are relatively prime. Algorithm presented here first

decomposes the one dimensional DFT into a multidimensional DFT using the index map
proposed by Good [4]
The index mapping suggest by Good and Thomas for

is
(4.1)

and as index mapping for

results
(4.2)

If we substitute the Goo d-Thomas index map in the equation for DFT matrix it follows
Dept.of E&CE, BITM Bellary

Page 16

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

The prime-factor FFT can then be calculated in the following four steps:
1. Organize the inputs of

into the two-dimensional array

according to

the index mapping for n as eq(4.1)


2. Calculate the

-point DFTs of the columns of

3. Calculate the

-point DFTs of the rows of

4. Unscramble the outputs

from the array

using the index mapping

for k as in eqn.(4.2)
since the prime-factor FFT does not require multiplications by twiddle factors, it is
generally considered to be the most efficient method for calculating the DFT of a
sequence.
Consider the length N = 14, suppose we have
input index according to

= 7 and

= 2 then mapping for the

and

for output

index results.

Figure 4.1 Good Thomas shuffling

Dept.of E&CE, BITM Bellary

Page 17

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

4.3 Rader FFT Algorithm


Large prime lengths can be handled efficiently by Rader algorithm. It is possible
to compute a DFT, by using results from number theory. Rader showed that it is possible
to convert a length-

FFT (where

is prime) into a convolution of length ( - 1). There

is a simple identity between the convolution of length N and the FFT of the same length,
so if ( 1) is easily factorisable this allows the convolution to be computed efficiently
via the FFT
DFT of the sequence

is given by

The Rader algorithm to compute DFT is defined only for prime length

since

= is prime there is a primitive element ,a generator g, that generates all elements of n


and k in the field
with

excluding zero i.e.

by substituting

with

modN and k

modN in DFT equation the following index transform

It is easily seen from the definition of the DFT that the transform of a length
sequence

has conjugate symmetry, i.e.

real

This property facilitates

to compute only half of the transform, as the remaining half is redundant and need not be
calculated. Rader algorithm provides straightforward way to compute only half of the
conjugate symmetric outputs without calculating the others, which is not possible with
other algorithms like radix-2, Cooley-Tukey and Winograd.

Dept.of E&CE, BITM Bellary

Page 18

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 5

DISTRIBUTED ARITHMETIC
5.1 INTRODUCTION
Distributed arithmetic (DA) is important FPGA technology which can be used to
compute sum of product [7]. This technique, first proposed by Croisier et al. is a
multiplierless architecture that is based on an efficient partition of the function in partial
terms using 2s complement binary representation of data. The partial terms can be precomputed and stored in LUTs. The flexibility of this algorithm on FPGAs permits
everything from bit-serial implementations to pipelined or full-parallel versions of the
scheme, which can greatly improve the design performance. This has been widely used in
DSP application such as convolution, DFT, DCT and digital filters [7].

5.2 TECHNICAL OVERVIEW


Consider the following sum of product:
N-1

y= c[n] x[n]

(5.1)

n=0

where c[n] are fixed coefficients and x[n] are input data word. Sum of product term can
be expanded as
N-1

y=

c[n] x[n]= c[0]x[0] + c[1]x[1]..c[N-1]x[N-1] (5.2)

n=0

The variable x[n] can be represented by


B-1

x[n]=

xb[n] X 2b with xb[n]

(5.3)

b=0

where xb[n] denotes the bth bit of x[n]. The product can be represented as
N-1

B-1

y= c[n]
n=0

Dept.of E&CE, BITM Bellary

xb[n] X 2b

(5.4)

b=0

Page 19

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Expanding the summations yields to


y=c[0] x (x0[0]20 + x1[1]21 +xB-1[0] 2B-1)
+c[1] x (x0[1] 20 + x1[1] 21 ++ xB-1[1] 2B-1)
.
.
+ c[N-1] x (x0[N-1]20 + x1[N-1]21+..xB-1[N-1]2B-1)
Redistributing the terms of y
y= ( c[0] x0[0] + c[1] x0[1] + .c[N-1] x0[N-1] ) 20

+( c[0] x1[1] + c[1] x1[1]+.c[N-1] x1[N-1] ) 21


..
..
+( c[0] xB-1[0]+ c[1] xB-1[1]+c[N-1] xB-1[N-1 ] ) 2B-1
In more compact form
B-1

N-1

y= 2

b=0

n=0

xb[n] X c[n]

(5.5)

The second summation can be mapped to a Look Up Table (LUT). The


coefficients c[n] are known and the xb[n] values are either 1 or 0 then each sum of product
is just a combination of c[n]s for which a table can be constructed. Since there are N
inputs number of partial product in LUT is equal to 2N.
DA can be extended for signed number multiplication also. The numbers must be
represented in signed twos complement form. A minor modification needs to be
introduced when working with signed twos complement numbers. In twos complement,
the MSB is used to determine the sign of the number. Therefore, the following B bit
representation.
B-2
B-1

x[n] = -2

x xB-1[n] +

xb[n] X 2b

(5.6)

b=0

Dept.of E&CE, BITM Bellary

Page 20

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Then, the output is defined by


N-1
y= -2B-1

B-2

N-1

c[n] x xB-1[n] + 2b x
n=0

b=0

xb[n] X c[n]

n=0

Block diagram for the DA is as shown in Figure 5.1. Number of partial product in
LUT depends on number of inputs. If there are N inputs then LUT has to store 2N partial
products. LSB of all inputs are taken as address to get the partial products from LUT.
Each time the input is shifted right to get the new address and the partial product from
LUT addressed by LSB bits. These partial products are accumulated to get the sum of
product. This method of implementation of DA is called as serial DA. Number of clock
cycles required to compute a sum of product depends on number of bits in the input.

Figure 5.1 Serial distributed arithmetic architecture


Implementation of a FIR filter can be also done as shown in Figure 5.1. The LUT can be
implemented in ROM. LUT contains pre computed partial products that are nothing but
combination of filter coefficients. Shift registers will have inputs of FIR filter and they
generate address of partial product present in ROM. These partial products are
accumulated to get the sum of product.

Dept.of E&CE, BITM Bellary

Page 21

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 5.2 Parallel DA architecture

DA can be implemented by parallel architecture also. Parallel architecture requires


more LUT which depends on the number of bits in the input. Advantage of parallel DA is
that the speed of operation. It is used for high speed computation of sum of product. But
for implementation it requires more hardware. Figure 5.2 shows Parallel DA architecture.
Since input has B bits there are B LUT followed by shifter and adder tree

Dept.of E&CE, BITM Bellary

Page 22

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 6
IMPLEMENTATION METHODOLOGY

6.1 14 point FFT using Prime factor algorithm


Algorithm presented here first decomposes the one dimensional DFT into a
multidimensional DFT using the index map proposed by Good. Next, a method which is
based on the index permutation proposed by Rader is used to convert the short DFTs into
convolution. This method changes a prime length N DFT of real data into two
convolutions of length (N- 1)/2. One convolution is cyclic and the other is cyclic or skewcyclic. The index mapping suggest by Good and Thomas for n is
(6.1)
and as index mapping for k result
(6.2)
If we substitute the Good-Thomas index map in the equation for DFT matrix it follows

Steps for Good-Thomas FFT Algorithm


An
-point DFT can be computed according to following steps:
1) Index transform of input sequence, according to (6.1).
2) Computation of

DFTs of length

using Rader algorithm.

3) Computation of

DFTs of length

using Rader algorithm.

4) Index transform of input sequence, according to (6.2).

Dept.of E&CE, BITM Bellary

Page 23

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Consider the length N = 14, suppose we have


input index according to

= 7 and

= 2 then mapping for the

and

for output

index results and using these index transforms we can construct the signal flow graph as
shown in Figure 6.1.
From Figure 6.1 we realize that first stage has 2 DFTs each having 7 -points and
second stage has 7 DFTs each having of length 2. One of the interesting thing here is
multiplication with twiddle factors between the stages is not required.

Figure 6.1 Mapping using Good-Thomas algorithm


Now consider

= 7 if the data are real we need to calculate only half of the transform.

Also, as Rader showed the zero frequency term must be calculated separately.
In matrix form, we write

Replacing

Dept.of E&CE, BITM Bellary

Page 24

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

(6.4)
If real and imaginary parts of W matrix in (6.4) are separated, a simplification is possible.
Consider first the real part using notation in matrix that k stands for cos(2

//7). The real

partof(6.4)becomes
(6.5)

Using the notation of k for sin(2 k/7) gives, for the imaginary part of (3).

(6.6)

The (6.5) and (6.6) are cyclic convolution relation. Since in our problem we always
convolve with the same coefficients (In case of DFT it is twiddle factor matrix),
arithmetic efficiency can be improved by precalculating some of the intermediate results.
These are stored in table in memory and simply addressed as needed. Using distributed
arithmetic this can be implemented efficiently Here we will present only the structure
Figure 6.2 best suited to the DFT calculated by cyclic convolution

6.2 FFT using Distributed Arithmetic

Initially R1 to R7 are cleared to zero and the Xis are loaded into registers R1 to
R3 after addition.

Then all R1 to R3 are shifted by one bit, the last bit of each register is in R4. The
ROM output will be added to R5.

Circular shift of R4 produces at ROM output are added to R5 to R7. When the
first cycle is completed content of all R1 to R7 except R4 are right shifted by one
bit and the second cycle starts.

Dept.of E&CE, BITM Bellary

Page 25

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

At Bth cycle function at ADD_ SUB changed from adder to subtraction and after
this Bth cycle the content of R5 to R7 gives final FFT coefficient and zero
frequency component can be calculated separately applying accumulate and
addition of Xi s as in figure 6. 2

Figure 6.2. Architracture for FFT using DA

6.3 DA Based FBLMS Algorithm


The fast block LMS algorithm performs the following steps to calculate the output
and error signals.
1. Concatenates the current input signal block to the previous blocks.

2. Performs an FFT to transform the input signal blocks from the time domain to the
frequency domain.

3.

Multiplies the input signal blocks by the filter coefficients vector W(n).

4. Performs an inverse FFT (IFFT) on the multiplication result.


Dept.of E&CE, BITM Bellary

Page 26

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

5. Retrieves the last block from the result as the output signal vector y(n) .

6. Calculates the error signal vector e(n) by comparing the input signal vector d(n)
with y(n) .

After calculating the output and error signals, the fast block LMS algorithm updates the
filter coefficients. The following shows the steps that this algorithm completes to update
the filter coefficients.
7. Inserts zeroes before the error signal vector e(n) . This step ensures the error
signal vector has the same length as the concatenated input signal blocks.

8. Performs an FFT on the error signal blocks.

9.

Multiplies the results by the complex conjugate of the FFT result of the input
signal blocks.

10. . Performs an IFFT on the multiplication result.

11. Sets the values of the last block of the IFFT result to zeroes and then performs an
FFT on the IFFT result.
12. Multiplies the step size by the FFT result.

13. Adds the filter coefficients vector w(n) to the multiplication result. This step
updates the filter coefficients.

Dept.of E&CE, BITM Bellary

Page 27

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Signal-flow graph representation of the DA based Fast Block LMS algorithm is shown in
Figure.6.3

Figure. 6.3. Proposed DA based FBLMS after optimization

Dept.of E&CE, BITM Bellary

Page 28

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 7

HARDWARE AND SOFTWARE REQUIRMENTS


This chapter gives detailed information about the architecture descriptions about
Xilinx Spartan-3 FPGA and information about how the modules are implemented in
Verilog.

7.1 VERILOG
Verilog is a Hardware Description Language; a textual format for describing
electronic circuits and systems. Applied to electronic design, Verilog is intended to be
used for verification through simulation, for timing analysis, for test analysis (testability
analysis and fault grading) and for logic synthesis.
The Verilog HDL is an IEEE standard - number 1364. The first version of the
IEEE standard for Verilog was published in 1995. A revised version was published in
2001; this is the version used by most Verilog users. The IEEE Verilog standard
document is known as the Language Reference Manual, or LRM. This is the complete
authoritative definition of the Verilog HDL.
A further revision of the Verilog standard was published in 2005, though it has
little extra compared to the 2001 standard. SystemVerilog is a huge set of extensions to
Verilog, and was first published as an IEEE standard in 2005. See the appropriate
Knowhow section for more details about SystemVerilog.
IEEE Std 1364 also defines the Programming Language Interface, or PLI. This is
a collection of software routines which permit a bidirectional interface between Verilog
and other languages (usually C).
Note that VHDL is not an abbreviation for Verilog HDL - Verilog and VHDL are
two different HDLs. They have more similarities than differences, however.

7.1.1 A Brief History of Verilog


The history of the Verilog HDL goes back to the 1980s, when a company called
Gateway Design Automation developed a logic simulator, Verilog-XL, and with it a
hardware description language.

Dept.of E&CE, BITM Bellary

Page 29

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Cadence Design Systems acquired Gateway in 1989, and with it the rights to the
language and the simulator. In 1990, Cadence put the language (but not the simulator)
into the public domain, with the intention that it should become a standard, nonproprietary language.
The Verilog HDL is now maintained by a non profit making organisation,
Accellera, which was formed from the merger of Open Verilog International (OVI) and
VHDL International. OVI had the task of taking the language through the IEEE
standardisation procedure.
In December 1995 Verilog HDL became IEEE Std. 1364-1995. A significantly
revised version was published in 2001: IEEE Std. 1364-2001. There was a further
revision in 2005 but this only added a few minor changes.
Accellera have also developed a new standard, SystemVerilog, which extends
Verilog. SystemVerilog became an IEEE standard (1800-2005) in 2005. For more details,
see the Systemverilog section of KnowHow There is also a draft standard for analog and
mixed-signal extensions to Verilog, Verilog-AMS.

7.1.2 Levels of Abstraction


Verilog descriptions can span multiple levels of abstraction i.e. levels of detail,
and can be used for different purposes at various stages in the design process.

Figure 7.1 Levels of abstraction


Dept.of E&CE, BITM Bellary

Page 30

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

At the highest level, Verilog contains stochastical functions (queues and random
probability distributions) to support performance modelling.
Verilog supports abstract behavioural modeling, so can be used to model the
functionality of a system at a high level of abstraction. This is useful at the system
analysis and partitioning stage.
Verilog supports Register Transfer Level descriptions, which are used for the
detailed design of digital circuits. Synthesis tools transform RTL descriptions to gate
level.
Verilog supports gate and switch level descriptions, used for the verification of
digital designs, including gate and switch level logic simulation, static and dynamic
timing analysis, testability analysis and fault grading.
Verilog can also be used to describe simulation environments; test vectors,
expected results, results comparison and analysis. With some tools, Verilog can be used
to control simulation e.g. setting breakpoints, taking checkpoints, restarting from time 0,
tracing waveforms. However, most of these functions are not included in the 1364
standard, but are proprietary to particular simulators. Most simulators have their own
command languages; with many tools this is based on Tcl, which is an industry-standard
tool language.

7.1.3 Scope of Verilog


Verilog can be used at different levels of abstraction as we have already seen. But
how useful are these different levels of abstraction when it comes to using Verilog?

Design process
The diagram below shows a very simplified view of the electronic system design
process incorporating Verilog. The central portion of the diagram shows the parts of the
design process which will be impacted by Verilog.

System level
Verilog is not ideally suited for abstract system-level simulation, prior to the
hardware-software split. This is to some extent addressed by SystemVerilog. Unlike
VHDL, which has support for user-defined types and overloaded operators which allow
the designer to abstract his work into the domain of the problem, Verilog restricts the
Dept.of E&CE, BITM Bellary

Page 31

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

designer to working with pre-defined system functions and tasks for stochastic simulation
and can be used for modelling performance, throughput and queueing but only in so far as
those built-in langauge features allow. Designers occasionally use the stochastic level of
abstraction for this phase of the design process.
Digital
Verilog is suitable for use today in the digital hardware design process, from
functional simulation, manual design and logic synthesis down to gate-level simulation.
Verilog tools provide an integrated design environment in this area.
Verilog is also suited for specialized implementation-level design verification
tools such as fault simulation, switch level simulation and worst case timing simulation.
Verilog can be used to simulate gate level fanout loading effects and routing delays
through the import of SDF files.
The RTL level of abstraction is used for functional simulation prior to synthesis.
The gate level of abstraction exists post-synthesis but this level of abstraction is not often
created by the designer, it is a level of abstraction adopted by the EDA tools (synthesis
and timing analysis, for example).

Analog
Because of Verilog's flexibility as a programming language, it has been stretched
to handle analog simulation in limited cases. There is a draft standard Verilog-AMS
that addresses analog and mixed signal simulation.

7.1.4 Design Flow using Verilog


The diagram below summarises the high level design flow for an ASIC (ie. gate
array, standard cell) or FPGA. In a practical design situation, each step described in the
following sections may be split into several smaller steps, and parts of the design flow
will be iterated as errors are uncovered.

Dept.of E&CE, BITM Bellary

Page 32

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 7.2 Design flow using Verilog


System-level Verification
As a first step, Verilog may be used to model and simulate aspects of the complete
system containing one or more ASICs or FPGAs. This may be a fully functional
description of the system allowing the specification to be validated prior to commencing
detailed design. Alternatively, this may be a partial description that abstracts certain
properties of the system, such as a performance model to detect system performance
bottle-necks. Verilog is not ideally suited to system-level modelling. This is one
motivation for SystemVerilog, which enhances Verilog in this area.

RTL design and testbench creation


Once the overall system architecture and partitioning is stable, the detailed design
of each ASIC or FPGA can commence. This starts by capturing the design in Verilog at
the register transfer level, and capturing a set of test cases in Verilog. These two tasks are
complementary, and are sometimes performed by different design teams in isolation to
ensure that the specification is correctly interpreted. The RTL Verilog should be
synthesizable if automatic logic synthesis is to be used. Test case generation is a major
task that requires a disciplined approach and much engineering ingenuity: the quality of
the final ASIC or FPGA depends on the coverage of these test cases.
Dept.of E&CE, BITM Bellary

Page 33

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

For today's large, complex designs, verification can be a real bottleneck. This provides
another motivation for SystemVerilog - it has features for expediting testbench
development. See the SystemVerilog section of Knowhow for more details.

RTL verification
The RTL Verilog is then simulated to validate the functionality against the
specification. RTL simulation is usually one or two orders of magnitude faster than gate
level simulation, and experience has shown that this speed-up is best exploited by doing
more simulation, not spending less time on simulation.
In practice it is common to spend 70-80% of the design cycle writing and
simulating Verilog at and above the register transfer level, and 20-30% of the time
synthesizing and verifying the gates.

Look-ahead Synthesis
Although some exploratory synthesis will be done early on in the design process,
to provide accurate speed and area data to aid in the evaluation of architectural decisions
and to check the engineer's understanding of how the Verilog will be synthesized, the
main synthesis production run is deferred until functional simulation is complete. It is
pointless to invest a lot of time and effort in synthesis until the functionality of the design
is validated.

Synthesizing Verilog
Synthesis is a broad term often used to describe very different tools. Synthesis can
include silicon compilers and function generators used by ASIC vendors to produce
regular RAM and ROM type structures. Synthesis in the context of this tutorial refers to
generating random logic structures from Verilog descriptions. This is best suited to gate
arrays and programmable devices such FPGAs.
Synthesis is not a panacea! It is vital to tackle High Level Design using Verilog
with realistic expectations of synthesis.
The definition of Verilog for simulation is cast in stone and enshrined in the
Language Reference Manual. Other tools which use Verilog, such as synthesis, will make
their own interpretation of the Verilog language. There is an IEEE standard for Verilog
synthesis (IEEE Std. 1364.1-2002) but no vendor adheres strictly to it.
Dept.of E&CE, BITM Bellary

Page 34

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

It is not sufficient that the Verilog is functionally correct; it must be written in


such a way that it directs the synthesis tool to generate good hardware, and moreover, the
Verilog must be matched to the idiosyncrasies of the particular synthesis tool being used.
We shall tackle some of these idiosyncrasies in this Verilog tutorial.
There are currently three kinds of synthesis:
behavioral synthesis
high-level synthesis
RTL synthesis
There is some overlap between these three synthesis domains. We will concentrate
on RTL synthesis, which is by far the most common. The essence of RTL code is that
operations described in Verilog are tied to particular clock cycles. The synthesized netlist
exhibits the same clock-by-clock cycle behavior, allowing the RTL testbench to be easily
re-used for gate-level simulation.

7.2 Introduction to Xilinx


Xilinx, Inc. is the world's largest supplier of programmable logic devices, the
inventor of the field programmable gate array (FPGA) and the first semiconductor
company with a fabless manufacturing model. The programmable logic device market has
been led by Xilinx since the late 1990s. Over the years, Xilinx has fueled an aggressive
expansion to India, Asia and Europe regions Xilinx representatives have described as
high-growth areas for the business. Xilinx's sales rose from $560 million in 1996 to
almost $2 billion by 2007. The relatively new President and CEO Moshe Gavrielov an
EDA and ASIC industry veteran appointed in early 2008 aims to bolster the company's
revenue substantially during his tenure by providing more complete solutions that align
FPGAs with software, IP cores, boards and kits to address focused target applications.
The company aims to use this approach to capture greater market share from applicationspecific integrated circuits (ASICs) and application-specific standard products (ASSPs).

7.3 The FPGA design flow


A simplified version of FPGA design flow is given in the flowing figure 7.1. This
is the entire process for designing a device that guarantees that will not overlook any

Dept.of E&CE, BITM Bellary

Page 35

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

steps and that will have the best chance of getting back a working prototype that functions
correctly.

Figure 7.3: The FPGA Design Flow


7.3.1 Design Entry
There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both. Selection of a method depends on the
design and designer. If the designer wants to deal more with Hardware, then Schematic
entry is the better choice. When the design is complex or the designer thinks that the
design in an algorithmic way, then HDL is the better choice. Language based entry is
faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details
of the hardware implementation. Schematic based entry gives designers much more
visibility into the hardware. It is the better choice for those who are hardware oriented.
7.3.2 Synthesis
The process which translates VHDL code into a device netlist format i.e. a
complete circuit with logical elements (gates, flip flop, etc) for the design. If the design
contains more than one sub designs ex. to implement a processor, need a CPU as one
design element and RAM as another and so on, then the synthesis process generates
netlist for each design element.

Dept.of E&CE, BITM Bellary

Page 36

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has
selected. The resulting netlist(s) is saved to a Native Generic Circuit (NGC) file (for
Xilinx Synthesis Technology (XST)).

Figure 7.4: FPGA Synthesis


7.3.3 Implementation
Implementation is the process of translating the synthesis output in to a bit stream
suited for a specific target device. This process consists of a sequence of three steps.
1.Translate
2.Map
3.Place and Route
The Translate Process combines all the input netlists and constraints to a logic
design file. This information is saved as a Native Generic Database (NGD) file. This can
be done using NGD Build program. Defining constraints is nothing but, assigning the
ports in the design to the physical elements (ex. pins, switches, buttons etc) of the targeted
device and specifying time requirements of the design. This information is stored in a file
named User Constraints File (UCF). Tools used to create or modify the UCF are PACE,
Constraint Editor, etc.

Dept.of E&CE, BITM Bellary

Page 37

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 7.5:FPGA translate


The MAP process divides the whole circuit with logical elements into sub blocks
such that they can be fit into the FPGA logic blocks. That means map process fits the
logic defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.

Figure 7.6: FPGA Map


The Place and Route program is used for this process. The place and route process
places the sub blocks from the map process into logic blocks according to the constraints
and connects the logic blocks. Example if a sub block is placed in a logic block which is
very near to I/O pin, then it may save the time but it may affect some other constraint. So
tradeoff between all the constraints is taken account by the place and route process. The
place and route tool takes the mapped NCD file as input and produces a completely
routed NCD file as output. Output NCD file contains the routing information.

Dept.of E&CE, BITM Bellary

Page 38

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 7.7: FPGA place and route


Now the design must be loaded on the FPGA. But the design must be converted to
a format so that the FPGA can accept it. BITGEN program deals with the conversion. The
routed NCD file is then given to the BITGEN program to generate a bit stream (a BIT
file) which can be used to configure the target FPGA device. This can be done using a
cable. Selection of cable depends on the design.
For a programmable device, you simply program the device and immediately have
your prototypes. You then have the responsibility to place these prototypes in your system
and determine that the entire system actually works correctly. If you have followed the
procedure up to this point, chances are very good that your system will perform correctly
with only minor problems. These problems can often be worked around by modifying the
system or changing the system software. These problems need to be tested and
documented so that they can be fixed on the next revision of the chip. System integration
and system testing is necessary at this point to insure that all parts of the system work
correctly together. Verification can be done at different stages of the process steps.

7.3.4 Behavioral Simulation (RTL Simulation)


This is first of all simulation steps; those are encountered throughout the hierarchy
of the design flow. This simulation is performed before synthesis process to verify RTL
(behavioral) code and to confirm that the design is functioning as intended. Behavioral
simulation can be performed on either VHDL or Verilog designs. In this process, signals
and variables are observed, procedures and functions are traced and breakpoints are set.
This is a very fast simulation and so allows the designer to change the HDL code if the
required functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
Dept.of E&CE, BITM Bellary

Page 39

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

7.3.5 Functional simulation (Post Translate Simulation)


Functional simulation gives information about the logic operation of the circuit.
Designer can verify the functionality of the design using this process after the Translate
process. If the functionality is not as expected, then the designer has to made changes in
the code and again follow the design flow steps.

7.3.6 Static Timing Analysis


This can be done after MAP or PAR processes Post MAP timing report lists signal
path delays of the design derived from the design logic. Post Place and Route timing
report incorporates timing delay information to provide a comprehensive timing summary
of the design.

7.3.7 Architectural overview


The Spartan-3 family architecture consists of five fundamental programmable
functional elements:
Configurable Logic Blocks (CLBs) contain RAM-based Look-Up Tables (LUTs) to
implement logic and storage elements that can be used as flip-flops or latches. CLBs can
be programmed to perform a wide variety of logical functions as well as to store data.
Input/Output Blocks (IOBs) control the flow of data between the I/O pins and the
internal logic of the device. Each IOB supports bidirectional data flow plus -3state
operation. Twenty-six different signal standards, including eight high-performance
differential standards. Double Data-Rate (DDR) registers are included. The Digitally
Controlled Impedance (DCI) feature provides automatic on-chip terminations,
simplifying board designs.
Block RAM provides data storage in the form of 18-Kbit dual-port blocks.
Multiplier blocks accept two 18-bit binary numbers as inputs and calculate the product.
Digital Clock Manager (DCM) blocks provide self-calibrating, fully digital solutions for
distributing, delaying, multiplying, dividing, and phase shifting clock signals.
These elements are organized as shown in Figure 7.6. A ring of IOBs surrounds a
regular array of CLBs. The XC3S50 has a single column of block RAM embedded in the
Dept.of E&CE, BITM Bellary

Page 40

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

array. Those devices ranging from the XC3S200 to the XC3S2000 have two columns of
block RAM. The XC3S4000 and XC3S5000 devices have four RAM columns. Each
column is made up of several 18-Kbit RAM blocks; each block is associated with a
dedicated multiplier. The DCMs are positioned at the ends of the outer block RAM
columns. The Spartan-3 family features a rich network of traces and switches that
interconnect all five functional elements, transmitting signals among them. Each
functional element has an associated switch matrix that permits multiple connections to
the routing.

Figure 7.8: Internal structure of FPGA

Dept.of E&CE, BITM Bellary

Page 41

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 8

SIMULATION AND RESULTS


The coding for DA based FBLMS Adaptive filter was done using VHDL as the
hardware description language. The code was simulated using ModelSim software. The
Synthesis, place and route was performed using Xilinx ISE. The simulation and synthesis
results are as shown below.

8.1

Simulation results
The simulation results of both FFT and IFFT modules are individually presented

here. Along with DA algorithm.

8.1.1 Signed DA convolution


The inner product y=c1x1+c2x2+c3x3, is computed by using DA, where c1, c2,
c3 are constant, can take negative values. Partial products are stored in LUT according to
input bit combination of x1, x2, x3.
Example: For c1= -2, c2=3, c3=1 and input x1=1, x2=3, x3=7, inner product
y= -2*1+3*3+1*7=14

Figure 8.1. Simulation of signed DA convolution

Dept.of E&CE, BITM Bellary

Page 42

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.1.2 7 point Rader FFT algorithm


Example: For input X= [1 2 3 4 5 6 7] 7-point Rader FFT gives

Y=[ 28.0000 , -3.5000 + 7.2678i, -3.5000 + 2.7912i, -3.5000 + 0.7989i, -3.5000 - 0.7989i
-3.5000 - 2.7912i, -3.5000 - 7.2678i].

Since in implementation twiddle factors are scaled by 256 and stored in LUT. So output
obtained is also scaled output, expect zero frequency term

Figure 8.2. Simulation of 7 point Rader FFT

Dept.of E&CE, BITM Bellary

Page 43

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.1.3 Simulation results of 14 point FFT


The Simulation result of FFT is shown in figure 8.3 and 8.4. The 'clk' is the clock
signal which generates clock pulse, ld' is the load signal of fft, 'rst' is the reset signal,
x_in0 to x_in13' are the input of FFT that need to be transformed to real and imaginary
values. When rst=0 & ld=0 when inputs are given we get the transformed values.
Example; for input X= [0 1 2 3 4 5 6 7 8 9 10 11 12 13] 14-point Prime factor
FFT gives
Y=[ 91.0000 , -7.0000 +30.6690i , -7.0000 +14.5356i, -7.0000 + 8.7777i, -7.0000 +
5.5823i -7.0000 + 3.3710i , -7.0000 + 1.5977i , -7.0000 ,-7.0000 - 1.5977i , -7.0000 3.3710i,

- 7.0000 -5.5823i, -7.0000 - 8.7777i, -7.0000 -14.5356i , -7.0000 -30.6690i ]

Dept.of E&CE, BITM Bellary

Page 44

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 8.3 Simulation of 14 point Good-Thomas FFT (for Real part)

Dept.of E&CE, BITM Bellary

Page 45

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 8.4. Simulation of 14 point Good-Thomas FFT (for Imaginary part)

8.1.4 Simulation results of IFFT


The Simulation result of FFT is shown in figure 8.5. The 'clk' is the clock signal
which generates clock pulse, ld' is the load signal of ifft, 'rst' is the reset signal, x_in0 to
x_in13' are the input of IFFT that need to be transformed to real and imaginary values.

Dept.of E&CE, BITM Bellary

Page 46

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Example :For input X= [14 14 14 14 14 14 14 14 14 14 14 14 14 14 0 0 0 0 0 0 0 0 0 0 0 0 0


0] 14-point Good-Thomas FFT gives Y=[14 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

Figure 8.5. Simulation of 14 point Good-Thomas IFFT

Dept.of E&CE, BITM Bellary

Page 47

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.1.5 Fast Block LMS Adaptive filter


The Simulation result of DA based FBLMS adaptive filter is shown in figure 8.6.
The 'clk' is the clock signal which generates clock pulse, ld' is the load signal, 'rst' is the
reset signal. err is the error signal, x_in is the input signal, d_in is the desired signal,
y_out is output of the system.

Figure 8.6. Simulation of Fast Block LMS adaptive filter


Figure 8.7. Simulation of DA based Fast Block LMS adaptive filter. x_in is the input
signal which consist of noise. Desired signal d_in is pure sine wave. Adaptive filter
removes the error and produces the output y_out is error free.

Figure 8.7. Simulation of Fast Block LMS adaptive filter

Dept.of E&CE, BITM Bellary

Page 48

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.2

Synthesis results
The synthesis result of DA-FBLMS adaptive filter was performed using Xilinx

ISE software and the results are as shown below.

8.2.1 Synthesis results of FFT module


Figure 8.8 shows the top module of fft. It takes 4 bits of 14 input data and
produces 14 bits of 14 real and 14 imaginary output as an transformed data by using fft.
Figure 8.9 shows RTL schematic of FFT

Figure 8.8 Top module of 14-point FFT

Dept.of E&CE, BITM Bellary

Page 49

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 8.9. RTL Schematic of 14 point Good-Thomas FFT

8.2.2 Synthesis results of IFFT module


Figure 8.10 shows the synthesis result of ifft. It takes 14 bits of 14 real and 14
imaginary input data and produces 24 bits of 14 output using ifft. Figure 8.11 RTL View
of 14 point Good-Thomas IFFT

Dept.of E&CE, BITM Bellary

Page 50

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 8.10 Top module of 14-point of IFFT

Figure 8.11 RTL View of 14 point Good-Thomas IFFT

Dept.of E&CE, BITM Bellary

Page 51

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.2.3 Synthesis results of FBLMS_TOP module


Figure 8.12 shows the synthesis result of FBLMS. It takes Xin and compare with
Din gives error and Yout signal. Figure 8.13 shows RTL schematic of FBLMS

Figure 8.12 Top module of FBLMS Using DA

Figure 8.13 RTL Schematic for FBLMS

Dept.of E&CE, BITM Bellary

Page 52

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

8.3

Hardware Resource Utilization Summary

Dept.of E&CE, BITM Bellary

Page 53

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

CHAPTER 9
CONCLUSIONS AND FUTURE WORK
9.1 CONCLUSION
From the discussion of results, the following conclusions can be made:

1. The Fast Block LMS adaptive filter is designed and implemented for filter length
7(i.e.14 point FFT)
2.

Hardware Resource utilization results confirm that proposed adaptive digital filter
is requires less hardware (i.e about 40%) than that of existing architecture.So
proposed adaptive digital filter is Hardware efficient compared to existing one.

9.2 FUTURE WORK


Our Design is efficient for FFT implementation of less number of points . In case
of design is in need of more number of points like 1024 etc. We have to implement more
number of LUTs using DA, Which may result in more hardware utilization and act as a
drawback. In order to overcome such issue we can implement FFT using different
algorithms like cardiac algorithm.

Dept.of E&CE, BITM Bellary

Page 54

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

BIBLIOGRAPHY
[1]. C. M. Rader, "Discrete fourier transforms when the number of data samples is
prime," IEEE Proceedings, vol. 56, pp. 1107-1108, June1968.
[2] A. Peled and B. Liu, "A new hardware realization of digital filters," IEEE
Transactions On Acoustics, Speech, And Signal Processing, vol. 22, pp.456-46,
December 1974
[3]. C. S. Burrus, "Index mappings for multidimensional formulation of the DFT and
convolution," IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. 25,
pp. 239-242, June 1977
[4].S. K. M. Gregory A. Clark and S. R. Parker, "Block implementation of adaptive
digital filters," IEEE Transactions on Circuits and Systems,vol. 28, pp. 584 - 592, 1981.
[5]. S. Chu and C. S. Burrus, "A prime factor FFT algorithm using distributed arithmetic,"
IEEE Transactions On Acoustics, Speech, And Signal Processing,vol. 30, April 1982.
[6]. C. Wei and 1. 1. Lou, "Multi memory block structure for implementing a digital
adaptive filter using distributed arithmetic,"IEEE Proceedings,Electronic Circuits and
Systems, vol. 133, February 1986
[7]. S. A. White, "Applications of distributed arithmetic to digital signal processing: A
tutorial review," IEEE ASSP Magazine, July 1989
[8] . DJ Allred, W. Huang, Y.Krishnan, H. Yoo, D.V Anderson, "An FPGA
implementation for a high throughput adaptive filter using distributedarithmetic." 12th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 324
- 325, 2004.
[9]. Daniel 1. Allred, Walter Huang, Venkatesh Krishnan, Heejong Yoo, and David Y.
Anderson, "LMS adaptive filters using distributed arithmetic for high throughput," IEEE
Transactions on Circuits and Systems,vol. 52, pp. 1327 - 1337, July 2005
[10]. N. J. Sorawat Chivapreecha, Aungkana Jaruvarakul and K. Dejhan,"Adaptive
equalization architecture using distributed arithmetic for partial response channels," IEEE
Tenth International Symposium on Consumer Electronics, 2006

Dept.of E&CE, BITM Bellary

Page 55

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Appendix

Xilinx
Creating a new ISE project for the FPGA device on the Spartan-3 Kit
To create a new project:
1. Select File > New Project... The New Project Wizard appears.
2. Type tutorial in the Project Name field.
3. Enter or browse to a location (directory path) for the new project. A tutorial
subdirectory is created automatically.
4. Verify that HDL is selected from the Top-Level Source Type list.
5. Click Next to move to the device properties page.
6. Fill in the properties in the table as shown below:
Product Category: All
Family: Spartan3
Device: XC3S200
Package: FT256
Speed Grade: -4
Top-Level Source Type: HDL
Synthesis Tool: XST (VHDL/Verilog)
Simulator: ISE Simulator (VHDL/Verilog)
Preferred Language: Verilog (or VHDL)
Verify that Enable Enhanced Design Summary is selected.
Click Next to proceed to the Create New Source window in the New Project Wizard. At
the end of the next section, your new project will be complete.

Dept.of E&CE, BITM Bellary

Page 56

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 1

Creating a VHDL Source


Create a VHDL source file for the project as follows:
1. Click the New Source button in the New Project Wizard.
2. Select VHDL Module as the source type.
3. Type in the file name counter.
4. Verify that the Add to project checkbox is selected.
5. Click Next.
6. Declare the ports for the counter design by filling in the port information as shown
below:

Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 2
Click Next, then Finish in the New Source Wizard - Summary dialog box to complete
the new source file template.
7. Click Next, then Next, then Finish.
The source file containing the entity/architecture pair displays in the Workspace, and the
counter displays in the Source tab, as shown below:

Figure 3

Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Final Editing of the VHDL Source


1. Add the following signal declaration to handle the feedback of the counter output
below the architecture declaration and above the first begin statement:
signal count_int : std_logic_vector(3 downto 0) := "0000";
2. Customize the source file for the counter design by replacing the port and signal name
placeholders with the actual ones as follows:
replace all occurrences of <clock> with CLOCK
replace all occurrences of <count_direction> with DIRECTION
replace all occurrences of <count> with count_int
3. Add the following line below the end process; statement:
COUNT_OUT <= count_int;
4. Save the file by selecting File Save.
When you are finished, the counter source file will look like the following:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
-- Uncomment the following library declaration if instantiating
-- any Xilinx primitive in this code.
--library UNISIM;
--use UNISIM.VComponents.all;
entity counter is
Port ( CLOCK : in STD_LOGIC;
DIRECTION : in STD_LOGIC;
COUNT_OUT : out STD_LOGIC_VECTOR (3 downto 0));
end counter;
architecture Behavioral of counter is
signal count_int : std_logic_vector(3 downto 0) := "0000";
begin
process (CLOCK)
begin
if CLOCK='1' and CLOCK'event then
if DIRECTION='1' then
count_int <= count_int + 1;
else
count_int <= count_int - 1;
end if;
end if;
end process;
COUNT_OUT <= count_int;
end Behavioral;

Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Checking the Syntax of the New Counter Module


When the source files are complete, check the syntax of the design to find errors and
typos.
1. Verify that Implementation is selected from the drop-down list in the Sources window.
2. Select the counter design source in the Sources window to display the related processes
in the Processes window. Click the + next to the Synthesize-XST process to expand the
process group.
4. Double-click the Check Syntax process.
Note: You must correct any errors found in your source files. You can check for errors in
the
Console tab of the Transcript window. If you continue without valid syntax, you will not
be able to simulate or synthesize your design.
5. Close the HDL file.

Entering Timing Constraints


To constrain the design do the following:
1. Select Implementation from the drop-down list in the Sources window.
2. Select the counter HDL source file.
3. Click the + sign next to the User Constraints processes group, and double-click the
Create Timing Constraints process.
ISE runs the Synthesis and Translate steps and automatically creates a User
Constraints File (UCF). You will be prompted with the following message:

Figure 4

Click Yes to add the UCF file to your project.


The counter.ucf file is added to your project and is visible in the Sources window.
The Xilinx Constraints Editor opens automatically.
Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Note: You can also create a UCF file for your project by selecting Project Create New
Source.
5. In the Timing Constraints dialog, enter the following in the Period, Pad to Setup, and
CLock to Pad fields:
Period: 40
Pade to Setup: 10
Clock to Pad: 10
6. Press Enter.
After the information has been entered, the dialog should look like what is shown
below..

Figure 5

Select Timing Constraints under Constraint Type in the Timing Constraints tab and
the newly created timing constraints are displayed as follows:

Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 6
Save the timing constraints. If you are prompted to rerun the TRANSLATE or XST step,
click OK to continue.
9. Close the Constraints Editor.

Implementing the Design


1. Select the counter source file in the Sources window.
2. Open the Design Summary by double-clicking the View Design Summary process in
the Processes tab.
3. Double-click the Implement Design process in the Processes tab.
4. Notice that after Implementation is complete, the Implementation processes have a
green check mark next to them indicating that they completed successfully without Errors
or Warnings.

Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 7

Locate the Performance Summary table near the bottom of the Design Summary Click the
All Constraints Met link in the Timing Constraints field to view the Timing Constraints
report.

Verify

that

the

design

meets

the

specified

timing

requirements.

Figure 8
Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Assigning Pin Location Constraints


Specify the pin locations for the ports of the design so that they are connected correctly
on the Spartan-3 Startup Kit demo board.
To constrain the design ports to package pins, do the following:
1. Verify that counter is selected in the Sources window.
2. Double-click the Floorplan Area/IO/Logic - Post Synthesis process found in the
User Constraints process group. The Xilinx Pinout and Area Constraints Editor
(PACE) opens.
3. Select the Package View tab.
4. In the Design Object List window, enter a pin location for each pin in the Loc column
using the following information:
CLOCK input port connects to FPGA pin T9 (GCK0 signal on board)
COUNT_OUT<0> output port connects to FPGA pin K12 (LD0 signal on board)
COUNT_OUT<1> output port connects to FPGA pin P14 (LD1 signal on board)
COUNT_OUT<2> output port connects to FPGA pin L12 (LD2 signal on board)
COUNT_OUT<3> output port connects to FPGA pin N14 (LD3 signal on board)
DIRECTION input port connects to FPGA pin K13 (SW7 signal on board)
Notice that the assigned pin locations are shown in blue:

Figure 9
Dept.of E&CE, BITM Bellary

Page 0

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

5. Select File Save. You are prompted to select the bus delimiter type based on the
synthesis tool you are using. Select XST Default <> and click OK.
6. Close PACE.
Notice that the Implement Design processes have an orange question mark next to them,
indicating they are out-of-date with one or more of the design files. This is because the
UCF file has been modified.

Download Design to the Spartan-3 Demo Board


This is the last step in the design verification process. This section provides simple
instructions for downloading the counter design to the Spartan-3 Starter Kit demo board.
1. Connect the 5V DC power cable to the power input on the demo board (J4).
2. Connect the download cable between the PC and demo board (J7).
3. Select Implementation from the drop-down list in the Sources window.
4. Select counter in the Sources window.
5. In the Process window, double-click the Configure Target Device process.
6. The Xilinx WebTalk Dialog box may open during this process. Click Decline.
iMPACT opens and the Configure Devices dialog box is displayed.

Figure 10
Dept.of E&CE, BITM Bellary

Page 0

7. In the Welcome dialog box, select Configure devices using Boundary-Scan (JTAG).
8. Verify that Automatically connect to a cable and identify Boundary-Scan chain is
selected.
9. Click Finish.
10. If you get a message saying that there are two devices found, click OK to continue.
The devices connected to the JTAG chain on the board will be detected and displayed in
the iMPACT window.
11. The Assign New Configuration File dialog box appears. To assign a configuration file
to the xc3s200 device in the JTAG chain, select the counter.bit file and click Open.

Figure 11

12. If you get a Warning message, click OK.


13. Select Bypass to skip any remaining devices.
14. Right-click on the xc3s200 device image, and select Program... The Programming
Properties dialog box opens.
15. Click OK to program the device.When programming is complete, the Program
Succeeded message is displayed.

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

Figure 12

On the board, LEDs 0, 1, 2, and 3 are lit, indicating that the counter is running.
16. Close iMPACT without saving.

CLB Overview
For more details on the CLBs, refer to the Using Configurable Logic Blocks chapter in
UG331. The Configurable Logic Blocks (CLBs) constitute the main logic resource for
implementing synchronous as well as combinatorial circuits. Each CLB comprises four
interconnected slices, as shown in Figure 9. These slices are grouped in pairs. Each pair is
organized as a column with an independent carry chain. The nomenclature that the FPGA
Editor part of the Xilinx development software uses to designate slices is as
follows:
The letter X followed by a number identifies columns of slices. The X number counts
up in sequence from the left side of the die to the right. The letter Y followed by a
number identifies the position of each slice in a pair as well as indicating the CLB row.
The Y number counts slices starting from the bottom of the die according to the
sequence: 0, 1, 0, 1 (the first CLB row); 2, 3, 2, 3 (the second CLB row); etc. Figure 9
shows the CLB located in the lower left-hand corner of the die. Slices X0Y0 and X0Y1
make up the column-pair on the left whereas slices X1Y0 and X1Y1 make up the
column-pair on the right. For each CLB, the term left-hand (or SLICEM) indicates the
pair of slices labeled with an even X number, such as X0, and the term right-hand (or
SLICEL) designates the pair of slices with an odd X number, The carry chain, together
with various dedicated arithmetic logic gates, support fast and efficient implementations
of math operations. The carry chain enters the slice as CIN and exits as COUT. Five
multiplexers control the chain: CYINIT, CY0F, and CYMUXF in the lower portion as
well as CY0G and CYMUXG in the upper portion. The dedicated arithmetic logic
includes the exclusive-OR gates XORG and XORF (upper and lower portions of the slice,
respectively) as well as the AND gates GAND and FAND (upperand lower
Dept.of E&CE, BITM Bellary

Page 1

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

portions,respectively).

Internal structure for CLB

Each of the two LUTs (F and G) in a slice have four logic inputs (A1-A4) and a single
output (D). This permits any four-variable Boolean logic operation to be programmed
Dept.of E&CE, BITM Bellary

Page 2

Implementation of Distributed arithmetic based fast block LMS adaptive filter for high throughput

into them. Furthermore, wide function multiplexers can be used to effectively combine
LUTs within the same CLB or across different CLBs, making logic functions with still
more input variables possible. The LUTs in both the right-hand and left-hand slice-pairs
not only support the logic functions described above, but also can function as ROM that is
initialized with data at the time of configuration. The LUTs in the left-hand slice-pair
(even-numbered columns such as X0 in Figure 9) of each CLB support two additional
functions that the right-hand slice-pair (odd-numbered columns such as X1) do not. First,
it is possible to program the left-hand LUTs as distributed RAM. This type of memory
affords moderate amounts of data buffering anywhere along a data path. One left-hand
LUT stores 16 bits. Multiple left-hand LUTs can be combined in various ways to store
larger amounts of data. A dual port option combines two LUTs so that memory access is
possible from two independent data lines. A Distributed ROM option permits pre-loading
the

memory

Dept.of E&CE, BITM Bellary

with

data

during

FPGA

configuration

Page 3

You might also like