0% found this document useful (0 votes)
16 views9 pages

Memory Based Hardware Efficient Implementation of FIR Filters

Finite impulse response (FIR) digital filters are key components used in many digital signal processing (DSP) systems because of their linear phase, stability, fewer finite precision errors and regular structure. The real time realization of FIR filter with less hardware requirement and less latency has become very critical with increasing developments in very large scale integration (VLSI) technology. The objective of this paper to explore the current trends in the development of algorithm

Uploaded by

Dr.SHANTHI K.G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

Memory Based Hardware Efficient Implementation of FIR Filters

Finite impulse response (FIR) digital filters are key components used in many digital signal processing (DSP) systems because of their linear phase, stability, fewer finite precision errors and regular structure. The real time realization of FIR filter with less hardware requirement and less latency has become very critical with increasing developments in very large scale integration (VLSI) technology. The objective of this paper to explore the current trends in the development of algorithm

Uploaded by

Dr.SHANTHI K.G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N.

7
ISSN 1828-6003 July 2013

Memory Based Hardware Efficient Implementation of FIR Filters

K. G. Shanthi, N. Nagarajan

Abstract – Finite impulse response (FIR) digital filters are key components used in many digital
signal processing (DSP) systems because of their linear phase, stability, fewer finite precision
errors and regular structure. The real time realization of FIR filter with less hardware
requirement and less latency has become very critical with increasing developments in very large
scale integration (VLSI) technology. The objective of this paper to explore the current trends in the
development of algorithms and architectures for memory based realization of FIR filters that are
mainly concerned with reducing the overall area-delay-power complexities. The purpose of this
study is to compare these architectures based on ROM size, delay and throughput. The results
presented here would assist the researchers in the field of Digital Signal processing to select best
architecture for an application based on requirements. New algorithms and architectures need to
be developed to design area-delay-power-efficient FIR filters for various demanding DSP
applications. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Finite Impulse Response Filter, Field Programmable Gate Arrays (FPGA), Application
Specific Integrated Circuit (ASIC), Distributed Arithmetic (DA), Lookup Table (LUT)

Nomenclature Traditionally, the design methods were mainly


focused on multiplier-based architectures to implement
y[n] The FIR Filter Output the Multiply-and-Accumulate (MAC) blocks that
N Order of the Filter constitute the central piece in FIR filters and several DSP
Ci Constant coefficients functions. These multipliers consume most of the
Xi Input data resources of the system and also involve most of the
B Input Word length computation-time. The number of multiply and
accumulate operations required per filter output increases
with the filter order and thereby real time
I. Introduction
implementations of these filters is a challenging task.
Digital signal processing (DSP) is playing a vital role A discrete-time linear finite impulse response (FIR)
in the significant advancements of digital technology filter generates the output y[n] as a sum of delayed and
taking place currently around the world. Digital scaled input samples x[n].A N- tap FIR digital filter is
communication, speech and image data compression, represented as:
speech recognition, spectral estimation and analysis,
N 1
adaptive filtering applications, wired and wireless
communication, multimedia systems, biomedical y  n   c i  x  n  i  (1)
i 0
instrumentation, satellite and aerospace control, remote
sensing are the major areas where DSP has created a
where y[n] is the FIR filter output, c[i] represents the
major impact [1].
filter coefficients, x[n-i] is the input data and n is the time
The increased daily use of digital technology has led
index starting from 0. A direct implementation of Eq. (1)
to the development of improved algorithms and
requires N Multiply-and-Accumulate blocks, which is
architectures to design the DSP systems with less power
expensive in terms of area and speed.
dissipation, higher speed performance and less area
To resolve this problem many multiplier-less
complexity. Several architectural solutions have been
architectures were proposed in the recent years which are
made to minimize the arithmetic complexities of the
broadly classified in to two basic categories according to
algorithms in order to reduce the overall area-delay-
how they manipulate the filter coefficients for the
power complexities [2]. Finite impulse response (FIR)
multiply operation. The first type of multiplier-less
filter is used as a basic tool in many DSP applications.
technique is the conversion-based approach and the
Digital filters are used to modify signal characteristics
second type is memory based implementation approach.
in time or frequency domain and are used in many DSP
For the past one decade, there has been a growing
systems to perform signal preconditioning, anti-aliasing,
trend to implement DSP functions in Field
band selection, interpolation, low-pass filtering etc [1].
Programmable Gate Arrays (FPGAs) rather than on

Manuscript received and revised June 2013, accepted July 2013 Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved

1718
K. G. Shanthi, N. Nagarajan

Application specific integrated circuits (ASIC) and DSP introduced a new CSE algorithm, which searches a
chips. bounded number of Minimal Signed Digit (MSD)
The implementation on ASICs is not preferred due to representations [8]. Douglas L. Maskell, Jussipekka
high development costs and time-to-market factors. Leiwo and Jagdish C. Patra [9] reduced both the
Sequential-execution architecture of programmable DSP coefficient word length and the number of non-zero bits
processors prevents them from achieving the desired in the filter coefficients so that the adder step can be
performance. In this context, FPGA platform provides a minimized that resulted in reducing the hardware
very attractive solution that balance high flexibility with complexity of linear phase FIR digital filters.
the option to reconfigure, time-to-market, cost and
performance [3].
This paper is organized as follows: In Section 2, a III. Algorithms and Architectures
brief overview of the conversion-based multiplier-less for Memory Based FIR Filters
FIR filters is presented. Section 3 explores the
The memory based approach involves the use of
algorithmic aspects and architectural approach of
memories (RAMs, ROMs) or Look-Up Tables (LUTs)
memory based FIR filters and an in-depth review of FIR
that store pre-computed values that can be readout for
filters based on DA. Finally the Conclusion is presented
multiplication operation. With the advancements in the
in Section 4.
VLSI technology, the semiconductor memory has
become cheaper, faster and more efficient in terms power
II. Conversion-Based Multiplier-Less dissipation.
Memory-based FIR filters consequently are gaining
Implementation of FIR Filters
substantial popularity in the DSP environment.
In this approach the coefficients are transformed to These filters result in high-throughput and reduced-
other numeric representations so that the multiplications latency since the memory-access time is usually very
are implemented with adder/subtractors and shifters. A much shorter compared with multiplication time. They
coefficient in "n-bit" signed-digit representation can be have much less dynamic power consumption due to
written as: minimal switching activities associated in obtaining the
n-1 output product/inner product values by memory read
C  bi 2i (2) operations. There are two types of memory based FIR
i 0 filters. One of the techniques is the direct memory-based
implementation of FIR filters [10], while the other is
where bi is taken from the set {-1 ,0 ,1 }. based on distributed arithmetic (DA).
The representation that has minimum non-zero digits
and no consecutive non-zero digits is known as the
canonic signed-digit (CSD) representation[2]. Since in III.1. Direct-Memory-Based FIR Filters
shift and add multiplication, non-zero digits represent In the direct-memory-based implementations [10], the
additions (or subtractions), CSD therefore is significantly multiplications of input values with the fixed coefficients
more efficient in adders than binary representations. can be replaced by a ROM or look-up-table (LUT) which
Multipliers [4] in the filter whose coefficients are contains the pre-computed product values for all possible
expressed as canonic signed digit code are realized with values of input samples. Let X be an input word to be
wired-shifters, adders and subtractors. multiplied with a W-bit fixed coefficient C. If X is
Common subexpression elimination [CSE] is a assumed to be an unsigned binary number of word-length
numerical transformation of the constant multiplications N, there are 2N possible values of X, and hence there are
that can lead to efficient hardware implementations in 2N possible values of product Y=C*X. Therefore direct
terms of area, power and speed [5]-[8]. Subexpression memory based implementation of multiplication would
elimination can only be performed on constant require a memory unit of 2N words to be used as LUT
multiplications that operate on a common variable. It is consisting of pre-computed product values corresponding
the process of examining the shift and add to all possible values of X as shown in Fig. 1. The
implementations of constant multiplications and finding product C* Xi is stored at the memory location whose
the redundant operations. address is the same as the binary value of Xi for 0<2N-1,
Once the redundancies are found, these operations can such that if N-bit binary value of Xi is used as address for
be performed once and can be shared among the constant the memory-unit, then the corresponding product value is
multiplications so that number of adders and shifters for read-out from the memory. However, the size of ROM
implementation are minimized. Common subexpression increases exponentially with the input length.
(CSE) techniques attempt to minimize the number of
additions in the multiplier block by reusing terms. These N N+W
terms can be canonic signed digit (CSD) [5], minimal ROM with
signed digit (MSD), or all signed digit (ASD) [7]. X Y=C*X
2N words
Multiplierless FIR Filter Design Algorithms by
Malcolm D. Macleod, and Andrew G. Dempster Fig. 1. Structure of Direct-memory-based multiplier

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1719
K. G. Shanthi, N. Nagarajan

A direct implementation of equation (1) requires N of the LUT output to obtain the desired result. DA-based
number of multiplications where N represents the tap computation is well suited for FPGA realization, because
length. Each of the multipliers which involve the the LUT as well as the shift-add operations, can be
multiplications of input values with the fixed coefficients efficiently mapped to the LUT-based FPGA logic
can be replaced by a ROM or LUT, where each of the structures.
LUTs contains the pre-computed product values for all DA is a bit-serial operation that implements a series of
possible values of input samples. fixed-point MAC operations in a fixed number of steps,
A systolic system consists of a set of interconnected regardless of the number of terms to be calculated. DA is
cells, each capable of performing some simple operation often preferred since it eliminates the need for hardware
[2], [11]. multipliers and is capable of implementing large filters
Systolic designs are very efficient for hardware with very high throughput. Croisier et al had proposed
implementation of computation-intensive DSP the DA algorithm for digital filter implementations in
applications because of the features like simplicity, 1973 [23]. The first detailed discussion of DA was given
regularity and modularity of structure. by Abraham Peled and Bede Liu in 1974 at the Arden
They also produce high-throughput rate by using House Workshop on Digital Signal Processing [24].
pipelining or parallel processing or both. The systolic S.A.White [25] discussed an organization to form the
array for FIR filter of order N is shown in Fig. 2.It inner product of a pair of data vectors and gave a
consists of N Processing elements (PEs), where each PE criterion for minimizing the ROM size and made
during a cycle period performs one MAC operation. modifications to increase the speed by employing
Several algorithms and architectures have been suggested techniques such as bit pairing or partitioning the input
for systolization of FIR filters [12], [13]. words into the most significant half and least significant
half, thereby introducing parallelism in the computation.

III.2.1. Conventional DA approach


Consider the inner product of two N point vectors C
Fig. 2. Structure of a linear systolic array for an N-tap FIR filter and X given by:
N -1
The average computation time and the latency of y  n 
direct-memory based implementation is high for large
 ci xi (3)
i 0
transform-lengths and therefore several novel algorithms
have been proposed in the last few years to decompose where Ci represents the constant coefficients, Xi is the
the sinusoidal transforms into multiple number of input data which may change from time to time. Let the
circular convolution or convolution-like structures of input sample represent the data coded as B-bit 2’s
smaller convolution-lengths [14]–[18]. complement binary number such that |xi|<1. The input
These decompositions have resulted in improvement sample is given by:
of throughput performance with substantial reduction of
hardware and computational latency. A concurrent B 1
recursive algorithm is derived for the computation of FIR xi   xi 0   xi j 2 j (4)
filter, and is ported further to a two-dimensional systolic j 1
structure for reduced-latency direct-ROM-based
realization of large order filters [19]. where xi,j ∊ {0, 1}, xi0 is the sign bit and xi, B-1 is the Least
A new approach to LUT design referred to as the odd- significant bit (LSB).Then substituting (4) in (3), the
multiple-storage (OMS) scheme is presented, where only output can be expressed as:
the odd multiples of the fixed coefficient are required to
be stored thereby the memory-size is reduced to half at N 1  B 1 
the cost of some increase in combinational circuit
complexity[20]. By the antisymmetric product coding
y  n   ci   xi0   xi j 2 j  (5)
i 0  j 1 
(APC) approach, the LUT size can also be reduced to
half, where the product words are recoded as  N 1  B 1  N 1 
antisymmetric pairs [21]. Two new approaches are y  n      ci xi 0      ci xi j  2 j (6)
suggested for designing the LUT for LUT-multiplier-  i 0  j 1  i 0 
based implementation, where the memory-size is reduced
to nearly half of the conventional approach [22]. For a given set of Ci (i = 0, 1, 2,…, N − 1), the terms in
the brackets may take one of 2N possible values that can
be precomputed and stored in an LUT. All possible 2N
III.2. FIR Filters Based on Distributed Arithmetic (DA) values of Ci can be read out from the ROM using the N
The main operations required for DA-based bit sequence {xi,j for 0≤i≤N} as address bits.
computation of inner product are a sequence of lookup These intermediate results are accumulated in B clock
table accesses followed by shift-accumulation operations cycles to produce one filter output y[n].

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1720
K. G. Shanthi, N. Nagarajan

1 B-1 
 xi j  xi j 2 j  2    (9)
  
 N 1
xi    xi 0  xi 0 
2  j 1 

Define dij:
di j  xi j  xi j  j  0
(10)
di j  xi 0  xi 0  j  0

where dij ∊ {-1, 1}. Eq. (9) can be rewritten as:


Fig. 3. LUT-based DA implementation of a 4-tap (N =4) FIR filter
1  B 1 
 di j 2 j  2   
 N 1
Original LUT-based DA implementation of a 4-tap (N
xi 
2

 j 0 
(11)
=4) FIR filter consists of three units: the shift register
unit, the DA base unit, and the adder/shifter unit.
Using Eq. (11) in Eq. (3):
The LUT contains all 16 possible combination sums
of the filter weights C0, C1, C2, C3. The bank of shift N 1
1  B 1 
registers in Fig. 3 stores four consecutive input
samples(x[n-i], i=0, 1, 2, 3). The concatenation of
y  n   2 ci   di j 2 j  2 N 1  (12)
i 0  j 0 
rightmost bits of the shift registers becomes the address
of the LUT. The shift register is shifted right at every B 1  N 1
1   N 1 1 
clock cycle. The corresponding LUT entries are also
shifted and accumulated in B consecutive times to
y  n     2 ci di j 2 j    2 ci  2 N 1 (13)
j 0  i 0   i 0 
generate the output y[n]. The sign bits {xi0} are the last
bits to arrive. The clock period in which the sign bits all B 1
simultaneously arrive is called the "sign-bit time”. y  n   D j 2 j  Dinitial 2 N 1 (14)
During the sign-bit time the control signal S = 1, j 0
otherwise S = 0.
The time-complexity of FIR filters based on N 1
1 1 N 1
Distributed Arithmetic is independent of the transform- where D j   2 ci di j , Dinitial   2  ci .
size or the number of filter-taps and depends only on the i 0 i 0
word-length whereas time-complexity of Direct-memory- The OBC scheme is characterized by Eq. (14).
based FIR filters is independent of word-length but Table I shows the content of the ROM for N=4. From
increases linearly with the transform size. Table I, notice that the upper-half and the lower- half
ROM values are mirrored with sign reversed. Therefore
it is possible to reduce the ROM size by a factor of 2 as
III.2.2. Distributed Arithmetic with Offset Binary Coding shown in Table II. Fig. 4 shows a typical architecture for
The memory requirements (2N) of DA-based DA-OBC based implementation of a 4-tap (N =4) FIR
implementation for FIR filter increases exponentially filter. The XOR gates are used for address decoding; the
with the filter order N. With the use of offset binary MUX with the constant Dinitial provides the initial value
coding(OBC) the memory size can be reduced by half to to the shift accumulator. In Fig. 4, two control signals S1
2N-1 words [2], [25]. The input data will be interpreted as and S2 are required, where S1 is 1 when j = 0 and 0
-1 for 0 and +1 for 1 in offset binary coding. Let the otherwise, and S2 is 1 when j = B-1 and 0 otherwise.
input sample xi in offset binary coding be represented as: TABLE I
CONTENT OF THE ROM WITH DA-OBC
1 b3 b2 b1 b0 Contents of ROM
xi   xi    xi   (7)
2 0 0 0 0 - (C3 +C2+ C1 +C0 )/2
0 0 0 1 - (C3 +C2+ C1 -C0 )/2
0 0 1 0 - (C3 +C2 - C1 +C0 )/2
In 2's-complement notation the negative of Eq. (4) is 0 0 1 1 - (C3 +C2 - C1 -C0 )/2
written as: 0 1 0 0 - (C3 - C2 + C1+C0 )/2
0 1 0 1 - (C3 -C2 + C1 - C0 )/2
0 1 1 0 - (C3 - C2- C1 + C0 )/2
B 1
0 1 1 1 - (C3 - C2 - C1 - C0 )/2
 xi   xi 0   xi j 2 j  2 N 1 (8) 1 0 0 0 (C3 - C2 - C1 - C0 )/2
j 1 1 0 0 1 (C3 - C2 - C1 +C0 )/2
1 0 1 0 (C3 - C2 + C1- C0 )/2
1 0 1 1 (C3 -C2+ C1 + C0 )/2
where the over score symbol indicates the complement of 1 1 0 0 (C3 +C2 - C1 - C0 )/2
a bit. From Eqs. (4) and (8), the Eq. (7) can be rewritten 1 1 0 1 (C3 +C2+ C1- C0 )/2
as: 1 1 1 0 (C3 +C2+ C1 - C0 )/2
1 1 1 1 (C3 +C2+ C1+ C0 )/2

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1721
K. G. Shanthi, N. Nagarajan

TABLE II
REDUCED SIZE ROM (2N-1) WITH DA-OBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0 - (C3 +C2+ C1 +C0 )/2
0 0 1 - (C3 +C2+ C1 -C0 )/2
0 1 0 - (C3 +C2 - C1 +C0 )/2
0 1 1 - (C3 +C2 - C1 - C0 )/2
1 0 0 - (C3 - C2+ C1 +C0 )/2
1 0 1 - (C3 -C2+ C1 - C0 )/2
1 1 0 - (C3 - C2- C1 +C0 )/2
1 1 1 - (C3 - C2- C1 - C0 )/2
Fig. 5. Block diagram of the LUT-less DA-OBC (DA-MOBC)
for a 4-tap FIR filter

TABLE III
REDUCED SIZE ROM (2N-2) WITH DA-MOBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0 - (C2+ C1 + C0 )/2
0 0 1 - (C2+ C1 - C0 )/2
0 1 0 - (C2 - C1 + C0 )/2
0 1 1 - (C2 - C1 - C0 )/2

Fig. 4. DA-OBC based implementation of a 4-tap (N =4) FIR filter

III.2.3. Distributed Arithmetic with Modified Offset


Binary Coding (DA-MOBC)
The DA-MOBC can reduce the LUT size from 2N−2 to
as low as 2 by exploiting the observation that if the single
term inside the LUT can be relocated outside the LUT,
then the lower half of the LUT is mirrored version of the
upper half of the LUT with only the signs reversed [26].
From Table II, it can be observed that the ROM values
except C3 term are mirrored along the line between the 4-
Fig. 6. LUT-less Architecture for a 4-tap FIR filter proposed
th and the 5-th rows. Except C3 term, the LUT in Table II by Yoo and Anderson
have only 2N-2 possible values depending on the input
values. Table III illustrates the new ROM table.
LUT size reduction is achieved with the overhead of III.2.5. On-Line DA-LUT Architecture for FIR Filters
control circuits such as XOR gates, MUX (multiplexers), proposed by Eshtawie, Othman
and full adders (FA). While the increase in the number of The tri-state buffer and a carry look ahead adder
XOR gates is proportional to the input vector length B, (CLA) are the basic digital logic units that are used to
the complexities of other control circuits (MUX, FA) construct the on-line LUT DA-LUT Architecture [28] as
increase in proportion to the coefficient word-length as shown in Fig. 7.
shown in Fig. 5. Filter coefficients will pass to the CLA only if their
buffer enable signal value is 1.
III.2.4. Distributed Arithmetic Based LUT-Less Only the needed location contents are calculated
Architecture Proposed by Yoo and Anderson whereas, in the DA technique the contents of locations
that may not be used when processing the input signal
A recursive LUT reduction to the original DA are also computed.
decreases the LUT size by half at every iteration and
eventually the LUT-less DA architecture can be achieved
[27]. From Fig. 3, it can be observed that the lower half
of LUT (locations whose addresses have a 1 in the MSB)
is the same with the sum of the upper half of LUT
(locations whose addresses have a 0 in the MSB) and C3
term.
Thus, LUT size can be reduced by a factor of 2 with
an additional 2x1 MUX and a full adder. After several
iterations of the LUT reduction, final LUT-less DA
architecture for a 4-tap FIR filter is achieved as shown in
Fig. 7. LUT-less Architecture for a 4-tap FIR filter
Fig. 6. with tri-state buffers and CLA adders

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1722
K. G. Shanthi, N. Nagarajan

TABLE IV
COMPARISON OF VARIOUS ARCHITECTURES FOR A 4 TAP FILTER (N=4). THE SHIFT REGISTER AND THE ADDER/SHIFTER UNITS ARE NOT
CONSIDERED SINCE THEY ARE COMMON FOR ALL STRUCTURES. BC REPRESENTS THE COEFFICIENT WORD LENGTH.
LUT-based DA LUT-less Architecture On-Line DA-LUT
Logic Functions DA-OBC DA-MOBC
(conventional DA) of Yoo & Anderson Architecture
ROM Size 2N x BC 2N-1 x BC (2N-2 to 2) x BC 0 0
XOR gates 0 N N-1 0 0
2x1 MUX 0 BC BC N x BC 0
Adders 0 0 0 N-1 x BC N-1 CLA’s
Tristate Buffer 0 0 0 0 N
Adder/Sub 0 0 N x BC 0 0

In DA technique, even if the location content is zero it


will be fetched and added to the partial sum, whereas in
on-line LUT no addition operation occurs when
calculated contents is zero. Hence the execution time for
obtaining the filter output is very short.

III.2.6. Memory Partitioning and Multiple Memory


Bank Algorithms
The main drawback of DA based FIR filter is that as Fig. 8. Implementation of a 4-tap FIR filter
the filter size increases, the memory size requirements of using memory partitioning with m=k=2
the implementation grow exponentially. Memory access
TABLE VI
time can be a bottleneck for speed of the entire system COMPARISON OF VARIOUS REQUIREMENTS WITH AND WITHOUT
when the ROM size is very large. A larger LUT can be MEMORY-PARTITIONING
avoided by partitioning the circuit in to smaller LUTs No. of
Clock cycles
and to combine their outputs with adders. Memory Variants Address Memory size
required
Several Memory-partitioning and multiple memory bits
Without memory
bank approaches along with flexible multi-bit data access partitioning
mechanisms are presented for FIR filtering and inner- N 2N B
(Full LUT
product computation in order to reduce the memory-size implementation)
of DA-based filters [10], [25], [29]-[32]. With Memory- N
The N-tap filter is divided into m-smaller filters each partitioning (ROM k m2
N/m
 or  m 2 k B  log 2  m 
decomposition) m
having k-input lines such that N= m × k and it is assumed
that N is not prime. The total number of clock cycles 20
required for this implementation will be B+log2(m); the
additional second term is the number of clock cycles 15
required to implement an adder tree to calculate the sum 10
LUT Size
of the outputs from m LUTS. The decrease in throughput
is very less with this implementation when compared 5 ClockCycles
with a large LUT required for a high order filter. 0
Hence Eq. (6) is rewritten as: Full LUT Partitioned
LUT
 m-1   z 1 k 1  Fig. 9. Comparison of a 4-tap FIR filter (N=4) with and without
y  n        ci xi 0    memory partitioning with m=k=2 with the input word length B=8
 z 0  i  zk  
 
(15)
B 1  m 1   z 1 k 1  III.2.7. Systolic Architectures for DA-Based
     ci xi j   2 j Implementation of FIR Filters

j 1  z  0   
 i  zk
Systolic architectures can result in cost effective, high
For example, a 32 tap DA FIR filter would require a performance system by exploiting high-level of
large LUT with 232 entries. This problem can be concurrency using pipelining or parallel processing or
overcome by breaking up the LUT into 8 smaller LUT both [11]. Novel one- and two-dimensional systolic
units with each having 4 input lines. structures were designed for computation of circular
Hence a single large LUT with 232 memory elements convolution using distributed arithmetic (DA) that
is replaced by 8 LUTS each having only 24=16 memory resulted in less memory and less area-delay complexity
elements. compared with the other DA-based structures for circular
Fig. 8 shows the implementation of a 4-tap FIR filter convolution [33].
based on equation (15) for m=2 and k=2. One- and two-dimensional fully pipelined computing
structures are presented for area-delay-power-efficient

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1723
K. G. Shanthi, N. Nagarajan

implementation of FIR filter by systolic decomposition A new hardware architecture using conjugate
of distributed arithmetic based inner-product distributed arithmetic (CDA) for high throughput
computation [34]. hardware implementations of LMS adaptive filters is
A linear array consisting of number of Processing presented where all possible combination sums of the
elements (PEs) and an output cell is shown in Fig. 10. input signal samples are stored in the LUT and updated at
Each PE consists of a ROM of 2M words. Each PE the arrival of every sample using an efficient update
reads the content on its ROM at the location specified by procedure [36], [38].
the input bit vector during a cycle period. The value read
from the ROM is then added to the input available to the
PE from its left. During every cycle period, the sum is
then transferred as output to its right as shown in Figs.
11. Each output cell contains a shift-register and an
adder. It shifts the content of its register left by one
position and then adds the available input to the recently
shifted content in its register during every cycle period.
For high-throughput implementation of FIR filters, a two
dimensional systolic array is used as shown in Figs. 12.
FPGA realization of FIR filters for high-speed and
medium-speed by using modified distributed arithmetic
architectures were suggested by Jiafeng Xie et al., which
made use of pipelined registers and pipelined shift adder Fig. 10. Linear 1-D systolic array for DA-based implementation
tree [35]. of FIR filter

III.2.8. DA Based Architectures for Adaptive FIR


Filtering
Adaptive filtering DSP algorithms are employed in
several hand held mobile devices for applications such as
echo cancellation, signal de-noising, and channel
equalization. New hardware adaptive filter architecture
for very high throughput LMS adaptive filters using
distributed arithmetic (DA) has been suggested where
building adaptive DA filters requires recalculating the
contents of LUTs for each adaptation.
By using an auxiliary LUT with special addressing, Figs. 11. (a) Function of PE, (b) Function of output cell
the efficiency and throughput of DA adaptive filters can of 1-D systolic array
be of the same order as fixed DA filters [36], [37].

Figs. 12. (a) 2-D systolic array for FIR filter; (b) function of PE; and (c) function of Shift Adder (SA) cell

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1724
K. G. Shanthi, N. Nagarajan

IV. Conclusion Symp. on Computers and Communications, pp. 297–301,July


1997.
The recent significant researches that are concerned [15] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis,
“Systolic algorithms and a memory-based design approach for a
with reducing the overall area-delay-power complexities unified architecture for the computation of
of memory based realization of FIR filters are presented DCT/DST/IDCT/IDST,”IEEE Trans. Circuits Syst-I: Regular
in this paper. A detailed survey of memory-based Papers, vol. 52, no. 6, pp. 1125–1137, June 2005.
implementation of FIR filters using Distributed [16] C. Cheng and K. K. Parhi, “A novel systolic array structure for
DCT,”IEEE Trans. Circuits Syst-II: Express Briefs, vol. 52, no. 7,
Arithmetic is also presented stating its merits over direct pp. 366–369,July 2005.
memory-based implementation of FIR filters. [17] P. K. Meher, J. C. Patra, and M. N. S. Swamy, “New systolic
The main goal behind this review is to assist the algorithm and array architecture for prime-length discrete sine
researchers in the field of Digital signal processing to transform,” IEEE Trans. Circuits Syst. II: Express Briefs, vol. 54,
no. 3, pp. 262–266,Mar. 2007.
understand the available methods and adopt the same in [18] P. K. Meher and M. N. S. Swamy, “High-throughput memory-
various application environments. based architecture for DHT using a new convolutional
Many algorithms and architectures have been formulation,” IEEETrans. Circuits Syst. II: Express Briefs, vol.
suggested in the literature to reduce the area and time- 54, no. 7, pp. 606–610,July 2007.
[19] P. K. Meher, “Low-latency hardware-efficient memory-based
complexities of memory-based implementation of FIR design for large-order FIR digital filters”, Sixth International
filters but many more efficient algorithms and Conference on Information, Communications and Signal
architectures need to be developed to design flexible Processing(ICICS 2007), Dec. 2007
area-delay-power efficient memory based FIR filters to [20] P. K. Meher, “New approach to LUT implementation and
accumulation for memory-based multiplication,” in Proc. 2009
meet the growing requirements of DSP applications. IEEE Int. Symp.Circuits Syst., ISCAS’09, May 2009, pp. 453–
456.
[21] P. K. Meher, “New look-up-table optimizations for memory-
References based multiplication,” in Proc. Int. Symp. Integr. Circuits
(ISIC’09), Dec.2009.
[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: [22] P. K. Meher, “New approach to lookup table design and memory
Principles, Algorithms and Applications., NJ: Prentice-Hall, 1996. based realization of FIR digital filter”, IEEE Transactions on
[2] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and circuit and systems-I, Vol.57, NO.3, March 2010.
Implementation. New York: Wiley, 1999. [23] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, “Digital
[3] G. R. Goslin, “A Guide to Using Field Programmable Gate Arrays filter for PCM encoded signals,” U.S. Patent 3 777 130, Dec. 4,
(FPGAs) for Application-Specific Digital Signal Processing 1973.
Performance”, XILINX, 1995. [24] A. Peled and B. Liu, “A new hardware realization of digital
[4] M. Yamada, and A. Nishihara, “High-Speed FIR Digital Filter filters,” IEEE Trans. Acoustic, Speech, Signal Process., vol. 22,
with CSD Coefficients Implemented on FPGA”, in Proc. IEEE no. 6, pp.456–462, Dec. 1974.
Design Automation Conference, 2001, pp. 7-8. [25] S. A. White, “Applications of the distributed arithmetic to digital
[5] R. I. Hartley, “Subexpression sharing in filters using canonic signal processing: A tutorial review,” IEEE ASSP Mag., vol. 6,
signed-digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, no. 3, pp. 5–19,Jul. 1989.
no. 10, pp. 677–688, Oct. 1996. [26] P. Choi, S.-C. Shin, and J.-G. Chung, “Efficient ROM size
[6] M. Potkonjak, M. B. Srivastava, and A. Chandrakasan, “Multiple reduction for distributed arithmetic,” in Proc. IEEE Int. Symp.
constant multiplications: Efficient and versatile framework and Circuits System (ISCAS), May 2000, vol. 2, pp. 61–64.
algorithms for exploring common subexpression elimination,” [27] H. Yoo and D. V. Anderson, “Hardware-efficient distributed
IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. arithmetic architecture for high-order digital filters,” in Proc.
15, no. 2, pp. 151–165, Feb. 1996. IEEE Int. Conf. on Acoustics, Speech, Signal Processing
[7] A. G. Dempster and M. D. Macleod, “Generation of signed-digit (ICASSP), Mar. 2005, vol. 5, pp. v/125–v/128.
representations for integer multiplication,” IEEE Signal Process. [28] Mohamed A. Eshtawie and Masuri Othman," On-Line DA-LUT
Lett., vol.11, no. 8, pp. 663–665, Aug. 2004. Architecture for High-Speed High-Order Digital FIR Filters”, in
[8] M. D. Macleod and A. G. Dempster, “Multiplierless FIR filter the tenth IEEE international conference on communication
design algorithms,” IEEE Signal Processing Letters, vol. 12, no. systems, Nov. 2006, Singapore.
3, pp. 186–189,Mar. 2005. [29] C.-F. Chen, “Implementing FIR filters with distributed
[9] Douglas L. Maskell, Jussipekka Leiwo and Jagdish C. Patra,”The arithmetic,” IEEE Trans. Acoustic., Speech, Signal Process., vol.
Design of Multiplierless FIR Filters with a Minimum Adder Step 33, no. 5, pp.1318–1321, Oct. 1985.
and Reduced Hardware complexity,” in Proc. 2006 IEEE [30] K. Nourji and N. Demassieux, “Optimal VLSI architecture for
International Symposium on Circuits and Systems, , p. 4,May distributed arithmetic-based algorithms,” in IEEE International
2006. Conference on Acoustics, Speech, and Signal Processing, vol. 2,
[10] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design automation Apr. 1994, pp. II/509–II/512.
of the memory-based VLSI architectures for FIR filters,” IEEE [31] S.-S. Jeng, H.-C. Lin, and S.-M. Chang, “FPGA implementation
Trans. Consumer. Electronics, vol. 39, no. 3, pp. 619–629, Aug. of FIR filter using M-bit parallel distributed arithmetic,” in
1993. Proc.2006,IEEE Int. Symp. Circuits Systems (ISCAS), May 2006,
[11] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol. p. 4.
15,no. 1, pp. 37–45, Jan. 1982. [32] M. Mehendale, S. D. Sherlekar, and G..Venkatesh “Area-delay
[12] R.Wyrzykowski and S. Ovramenko, “Flexible systolic trade-off in distributed arithmetic based implementation of FIR
architecture for VLSI FIR filters,” Proc. Inst. Elect. Eng.— filters,” in Proc.10th Int. Conf. VLSI Design, Jan. 1997, pp. 124–
Comput. Digit. Techniques,vol. 139, no. 2, pp. 170–172, Mar. 129.
1992. [33] P. K. Meher, “Hardware-efficient systolization of DA-based
[13] B. K. Mohanty and P. K. Meher, “Cost-effective novel flexible calculation of finite digital convolution,” IEEE Trans. Circuits
celllevel systolic architecture for high throughput implementation Syst. II, Exp. Briefs, vol. 53, no. 8, pp. 707–711, Aug. 2006.
of 2-D FIR filters,” Proc. Inst. Elect. Eng.—Comput. Digit. [34] P. K. Meher, S. Chandrasekaran, and A. Amira, “FPGA
Techniques, vol.143, no. 5, pp. 436–439, Nov. 1996. realization of FIR filters by efficient and flexible systolization
[14] D. F. Chiper, “A new systolic array algorithm for memory-based using distributed arithmetic,”IEEE Trans. Signal Process., vol. 56,
VLSI array implementation of DCT,” in Proc. Second IEEE no. 7, pp. 3009–3017, July 2008.

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1725
K. G. Shanthi, N. Nagarajan

[35] Jiafeng Xie n, JianjunHe,GuanzhengTan,” FPGA realization of


FIR filters for high-speed and medium-speed by using modified
distributed arithmetic architectures”, Microelectronics Journal 41,
April 2010 pp. 365–370.
[36] S. Haykin, Adaptive Filter Theory, Prentice Hall, Upper Saddle
River, NJ, 2002.
[37] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V.
Anderson, “LMS adaptive filters using distributed arithmetic for
high throughput,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
52, no. 7, pp. 1327–1337, July 2005.
[38] Walter Huang, Venkatesh Krishnan, and David V. Anderson,”
Conjugate Distributed Arithmetic Adaptive FIR Filters and their
Hardware Implementation”, MWSCAS '06,pp.295-299, Circuits
and Systems, Volume: 2, 2006.

Authors’ information
K. G. Shanthi (Corresponding author)
completed her B.E in 1996 from Madras
university, Chennai and obtained her ME in
2005 from the Government college of
technology, Coimbatore. Her major in PG course
is VLSI Design. Her field of interest includes
design of FPGA based VLSI architectures, VLSI
signal processing. She is currently working as
Associate professor at R.M.K Engineering College, Chennai. She is
currently pursuing her research in the field of VLSI Design.
Address: Associate Professor /Department of Electronics &
Communication Engg, R.M.K Engineering College, Chennai,
Tamilnadu, India .Pin code: 601 206.
E-mail: [email protected]

Nagarajan N. received his B.Tech and M.E. degrees in Electronics


Engineering at M.I.T Chennai. He received his PhD in faculty of I.C.E.
from Anna University, Chennai. He is currently working as Principal
C.I.E.T, Coimbatore. His specialization includes optical, wireless
Adhoc and Sensor Networks.

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7

1726

You might also like