Memory Based Hardware Efficient Implementation of FIR Filters
Memory Based Hardware Efficient Implementation of FIR Filters
7
ISSN 1828-6003 July 2013
K. G. Shanthi, N. Nagarajan
Abstract – Finite impulse response (FIR) digital filters are key components used in many digital
signal processing (DSP) systems because of their linear phase, stability, fewer finite precision
errors and regular structure. The real time realization of FIR filter with less hardware
requirement and less latency has become very critical with increasing developments in very large
scale integration (VLSI) technology. The objective of this paper to explore the current trends in the
development of algorithms and architectures for memory based realization of FIR filters that are
mainly concerned with reducing the overall area-delay-power complexities. The purpose of this
study is to compare these architectures based on ROM size, delay and throughput. The results
presented here would assist the researchers in the field of Digital Signal processing to select best
architecture for an application based on requirements. New algorithms and architectures need to
be developed to design area-delay-power-efficient FIR filters for various demanding DSP
applications. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved.
Keywords: Finite Impulse Response Filter, Field Programmable Gate Arrays (FPGA), Application
Specific Integrated Circuit (ASIC), Distributed Arithmetic (DA), Lookup Table (LUT)
Manuscript received and revised June 2013, accepted July 2013 Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
1718
K. G. Shanthi, N. Nagarajan
Application specific integrated circuits (ASIC) and DSP introduced a new CSE algorithm, which searches a
chips. bounded number of Minimal Signed Digit (MSD)
The implementation on ASICs is not preferred due to representations [8]. Douglas L. Maskell, Jussipekka
high development costs and time-to-market factors. Leiwo and Jagdish C. Patra [9] reduced both the
Sequential-execution architecture of programmable DSP coefficient word length and the number of non-zero bits
processors prevents them from achieving the desired in the filter coefficients so that the adder step can be
performance. In this context, FPGA platform provides a minimized that resulted in reducing the hardware
very attractive solution that balance high flexibility with complexity of linear phase FIR digital filters.
the option to reconfigure, time-to-market, cost and
performance [3].
This paper is organized as follows: In Section 2, a III. Algorithms and Architectures
brief overview of the conversion-based multiplier-less for Memory Based FIR Filters
FIR filters is presented. Section 3 explores the
The memory based approach involves the use of
algorithmic aspects and architectural approach of
memories (RAMs, ROMs) or Look-Up Tables (LUTs)
memory based FIR filters and an in-depth review of FIR
that store pre-computed values that can be readout for
filters based on DA. Finally the Conclusion is presented
multiplication operation. With the advancements in the
in Section 4.
VLSI technology, the semiconductor memory has
become cheaper, faster and more efficient in terms power
II. Conversion-Based Multiplier-Less dissipation.
Memory-based FIR filters consequently are gaining
Implementation of FIR Filters
substantial popularity in the DSP environment.
In this approach the coefficients are transformed to These filters result in high-throughput and reduced-
other numeric representations so that the multiplications latency since the memory-access time is usually very
are implemented with adder/subtractors and shifters. A much shorter compared with multiplication time. They
coefficient in "n-bit" signed-digit representation can be have much less dynamic power consumption due to
written as: minimal switching activities associated in obtaining the
n-1 output product/inner product values by memory read
C bi 2i (2) operations. There are two types of memory based FIR
i 0 filters. One of the techniques is the direct memory-based
implementation of FIR filters [10], while the other is
where bi is taken from the set {-1 ,0 ,1 }. based on distributed arithmetic (DA).
The representation that has minimum non-zero digits
and no consecutive non-zero digits is known as the
canonic signed-digit (CSD) representation[2]. Since in III.1. Direct-Memory-Based FIR Filters
shift and add multiplication, non-zero digits represent In the direct-memory-based implementations [10], the
additions (or subtractions), CSD therefore is significantly multiplications of input values with the fixed coefficients
more efficient in adders than binary representations. can be replaced by a ROM or look-up-table (LUT) which
Multipliers [4] in the filter whose coefficients are contains the pre-computed product values for all possible
expressed as canonic signed digit code are realized with values of input samples. Let X be an input word to be
wired-shifters, adders and subtractors. multiplied with a W-bit fixed coefficient C. If X is
Common subexpression elimination [CSE] is a assumed to be an unsigned binary number of word-length
numerical transformation of the constant multiplications N, there are 2N possible values of X, and hence there are
that can lead to efficient hardware implementations in 2N possible values of product Y=C*X. Therefore direct
terms of area, power and speed [5]-[8]. Subexpression memory based implementation of multiplication would
elimination can only be performed on constant require a memory unit of 2N words to be used as LUT
multiplications that operate on a common variable. It is consisting of pre-computed product values corresponding
the process of examining the shift and add to all possible values of X as shown in Fig. 1. The
implementations of constant multiplications and finding product C* Xi is stored at the memory location whose
the redundant operations. address is the same as the binary value of Xi for 0<2N-1,
Once the redundancies are found, these operations can such that if N-bit binary value of Xi is used as address for
be performed once and can be shared among the constant the memory-unit, then the corresponding product value is
multiplications so that number of adders and shifters for read-out from the memory. However, the size of ROM
implementation are minimized. Common subexpression increases exponentially with the input length.
(CSE) techniques attempt to minimize the number of
additions in the multiplier block by reusing terms. These N N+W
terms can be canonic signed digit (CSD) [5], minimal ROM with
signed digit (MSD), or all signed digit (ASD) [7]. X Y=C*X
2N words
Multiplierless FIR Filter Design Algorithms by
Malcolm D. Macleod, and Andrew G. Dempster Fig. 1. Structure of Direct-memory-based multiplier
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1719
K. G. Shanthi, N. Nagarajan
A direct implementation of equation (1) requires N of the LUT output to obtain the desired result. DA-based
number of multiplications where N represents the tap computation is well suited for FPGA realization, because
length. Each of the multipliers which involve the the LUT as well as the shift-add operations, can be
multiplications of input values with the fixed coefficients efficiently mapped to the LUT-based FPGA logic
can be replaced by a ROM or LUT, where each of the structures.
LUTs contains the pre-computed product values for all DA is a bit-serial operation that implements a series of
possible values of input samples. fixed-point MAC operations in a fixed number of steps,
A systolic system consists of a set of interconnected regardless of the number of terms to be calculated. DA is
cells, each capable of performing some simple operation often preferred since it eliminates the need for hardware
[2], [11]. multipliers and is capable of implementing large filters
Systolic designs are very efficient for hardware with very high throughput. Croisier et al had proposed
implementation of computation-intensive DSP the DA algorithm for digital filter implementations in
applications because of the features like simplicity, 1973 [23]. The first detailed discussion of DA was given
regularity and modularity of structure. by Abraham Peled and Bede Liu in 1974 at the Arden
They also produce high-throughput rate by using House Workshop on Digital Signal Processing [24].
pipelining or parallel processing or both. The systolic S.A.White [25] discussed an organization to form the
array for FIR filter of order N is shown in Fig. 2.It inner product of a pair of data vectors and gave a
consists of N Processing elements (PEs), where each PE criterion for minimizing the ROM size and made
during a cycle period performs one MAC operation. modifications to increase the speed by employing
Several algorithms and architectures have been suggested techniques such as bit pairing or partitioning the input
for systolization of FIR filters [12], [13]. words into the most significant half and least significant
half, thereby introducing parallelism in the computation.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1720
K. G. Shanthi, N. Nagarajan
1 B-1
xi j xi j 2 j 2 (9)
N 1
xi xi 0 xi 0
2 j 1
Define dij:
di j xi j xi j j 0
(10)
di j xi 0 xi 0 j 0
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1721
K. G. Shanthi, N. Nagarajan
TABLE II
REDUCED SIZE ROM (2N-1) WITH DA-OBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0 - (C3 +C2+ C1 +C0 )/2
0 0 1 - (C3 +C2+ C1 -C0 )/2
0 1 0 - (C3 +C2 - C1 +C0 )/2
0 1 1 - (C3 +C2 - C1 - C0 )/2
1 0 0 - (C3 - C2+ C1 +C0 )/2
1 0 1 - (C3 -C2+ C1 - C0 )/2
1 1 0 - (C3 - C2- C1 +C0 )/2
1 1 1 - (C3 - C2- C1 - C0 )/2
Fig. 5. Block diagram of the LUT-less DA-OBC (DA-MOBC)
for a 4-tap FIR filter
TABLE III
REDUCED SIZE ROM (2N-2) WITH DA-MOBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0 - (C2+ C1 + C0 )/2
0 0 1 - (C2+ C1 - C0 )/2
0 1 0 - (C2 - C1 + C0 )/2
0 1 1 - (C2 - C1 - C0 )/2
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1722
K. G. Shanthi, N. Nagarajan
TABLE IV
COMPARISON OF VARIOUS ARCHITECTURES FOR A 4 TAP FILTER (N=4). THE SHIFT REGISTER AND THE ADDER/SHIFTER UNITS ARE NOT
CONSIDERED SINCE THEY ARE COMMON FOR ALL STRUCTURES. BC REPRESENTS THE COEFFICIENT WORD LENGTH.
LUT-based DA LUT-less Architecture On-Line DA-LUT
Logic Functions DA-OBC DA-MOBC
(conventional DA) of Yoo & Anderson Architecture
ROM Size 2N x BC 2N-1 x BC (2N-2 to 2) x BC 0 0
XOR gates 0 N N-1 0 0
2x1 MUX 0 BC BC N x BC 0
Adders 0 0 0 N-1 x BC N-1 CLA’s
Tristate Buffer 0 0 0 0 N
Adder/Sub 0 0 N x BC 0 0
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1723
K. G. Shanthi, N. Nagarajan
implementation of FIR filter by systolic decomposition A new hardware architecture using conjugate
of distributed arithmetic based inner-product distributed arithmetic (CDA) for high throughput
computation [34]. hardware implementations of LMS adaptive filters is
A linear array consisting of number of Processing presented where all possible combination sums of the
elements (PEs) and an output cell is shown in Fig. 10. input signal samples are stored in the LUT and updated at
Each PE consists of a ROM of 2M words. Each PE the arrival of every sample using an efficient update
reads the content on its ROM at the location specified by procedure [36], [38].
the input bit vector during a cycle period. The value read
from the ROM is then added to the input available to the
PE from its left. During every cycle period, the sum is
then transferred as output to its right as shown in Figs.
11. Each output cell contains a shift-register and an
adder. It shifts the content of its register left by one
position and then adds the available input to the recently
shifted content in its register during every cycle period.
For high-throughput implementation of FIR filters, a two
dimensional systolic array is used as shown in Figs. 12.
FPGA realization of FIR filters for high-speed and
medium-speed by using modified distributed arithmetic
architectures were suggested by Jiafeng Xie et al., which
made use of pipelined registers and pipelined shift adder Fig. 10. Linear 1-D systolic array for DA-based implementation
tree [35]. of FIR filter
Figs. 12. (a) 2-D systolic array for FIR filter; (b) function of PE; and (c) function of Shift Adder (SA) cell
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1724
K. G. Shanthi, N. Nagarajan
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1725
K. G. Shanthi, N. Nagarajan
Authors’ information
K. G. Shanthi (Corresponding author)
completed her B.E in 1996 from Madras
university, Chennai and obtained her ME in
2005 from the Government college of
technology, Coimbatore. Her major in PG course
is VLSI Design. Her field of interest includes
design of FPGA based VLSI architectures, VLSI
signal processing. She is currently working as
Associate professor at R.M.K Engineering College, Chennai. She is
currently pursuing her research in the field of VLSI Design.
Address: Associate Professor /Department of Electronics &
Communication Engg, R.M.K Engineering College, Chennai,
Tamilnadu, India .Pin code: 601 206.
E-mail: [email protected]
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1726