FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
y [n]
I. INTRODUCTION z-1 + z-1 + z-1 ... + z-1 + z-1
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
1-4244-9707-X/06/$20.00 ©2006 IEEE
The coefficients in most of DSP applications for the x0[i]
x1[i]
multiply accumulate operation are constants. The partial x2[i]
x3[i]
LUT
x7[i]
only on the outputs of the input shift registers. The AND + +
SET
x0[i+1] Q
x1[i+1] LUT
x2[i+1]
2. Input sequence is fed into the shift register at the input x3[i+1]
sample rate. The serial output is presented to the RAM based x4[i+1]
+
shift registers (registers are not shown in Figure for simplicity) x5[i+1]
x6[i+1] LUT
at the bit clock rate which is n+1 times (n is number of bits in x7[i+1]
a data input sample) the sample rate. The RAM based shift
register stores the data in a particular address. The outputs of Figure 3. A 2 bit parallel DA FIR filter block diagram
registered LUTs are added and loaded to the scaling
accumulator from LSB to MSB and the result which is the A popular technique for implementing the transposed
filter output will be accumulated over the time. For an n bit form of FIR filters is the use of a multiplier block, instead of
input, n+1 clock cycles are needed for a symmetrical filter to using multipliers for each constant as shown in Figure 4. The
generate the output. multiplications with the set of constants {hk} are replaced by
In conventional MAC method with a limited number of an optimized set of additions and shift operations, involving
MAC engines, as the filter length is increased, the system computation sharing. Further optimization can be done by
sample rate is decreased. This is not the case with serial DA factorizing the expression and finding common
architectures since the filter sample rate is decoupled from the subexpressions. The performance of this filter architecture is
filter length. As the filter length is increased, the throughput is limited by the latency of the biggest adder and is the same as
maintained but more logic resources are consumed. that of the PDA.
Though the serial DA architecture is efficient by
construction, its performance is limited by the fact that the
next input sample can be processed only after every bit of the
current input samples are processed. Each bit of the current
input samples takes one clock cycle to process.
scaling accumulator
x0[i]
x1[i] <<
LUT
x2[i]
x3[i]
+ +
SET
D Q
x4[i] CLR
Q
x5[i] LUT
x6[i] Figure 4. Replacing constant multiplication by multiplier block
x7[i]
The main contribution in this paper is the development of a
Address Data
novel algorithm for optimizing the multiplier block for FIR
0000 0 filters, using a modified algorithm for common subexpression
0001 C0 elimination. The goal of the algorithm is to produce a filter
0010 C0+C1 that can provide the maximum sample rate with the least
… …
1111 C0+C1+C2+C3 amount of hardware. Our algorithm takes into account the
specific features of FPGA slices to reduce the total number of
Figure 2. A serial DA FIR filter block diagram occupied slices. The reduced number of slices also leads to a
reduction in the total power on the FPGA.
Therefore, if the input bitwidth is 12, then a new input can be We compare our results with the industry standard Xilinx
sampled every 12 clock cycles. The performance of the circuit CoregenTM, where we compare the total area and power
can be improved by modifying the architecture to a parallel consumption.
architecture which processes the data bits in groups. Figure 3 The rest of the paper is organized as follows: Section 2
shows the block diagram of a 2 bit parallel DA FIR filter. The presents some related work. In Section3, we describe our filter
tradeoff here is performance for area since increasing the architecture. In Section 4, we present our optimization
number of bits sampled has a significant effect on resource algorithm for reducing the total area of the design. In Section
utilization on FPGA. For instance, doubling the number of bits 5, we describe our experimental setup and present our results.
sampled, doubles the throughput and results in the half the Finally we conclude the paper in Section 6.
number of clock cycles.
This change doubles the number of LUTs as well as the
size of the scaling accumulator. The number of bits being
II. RELATED WORK
processed can be increased to its maximum size which is the
input length n. This gives the maximum throughput to the
filter. For a fully parallel implementation of the DA filter Multiplications with constants have to be performed in
(PDA), the number of LUTs required would be enormous. In many signal processing and communication applications such
this work we show an alternative to the PDA method for as FIR filters, audio, video and image processing. Since
implementing high speed FIR filters that consumes implementing a general purpose multiplier is expensive on an
significantly lesser area and power. FPGA and since we do not really need such a multiplier, when
one of the operands is a constant, there has been a lot of work
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
on deriving efficient structures for constant multiplications [8- X + s X + z-1 s'
digital FIR filters [12]. based on both Distributed Arithmetic Logic Block 2 Logic Block 2
the embedded DSP slices on the FPGA devices. In this work, CLR
Q CLR
Q
we primarily compare our technique with the Coregen Logic Block 1 Logic Block 1
implementation of the Distributed Arithmetic, since that also
is a Multiplierless technique. We show that our designs are (a) (b)
Figure 5. Registered adder at no additional cost
much more area efficient than the DA based approach for fully
parallel filters. We also compare our method with MAC based Performing subexpression elimination can sometimes
implementations, where we achieve significantly higher increase the number of registers substantially, and the overall
performance area could possibly increase. Consider the two expressions F1
Though there has been a lot of work on optimizing and F2 which could be part of the multiplier block.
constant multiplications using adders and employing
redundancy elimination [15-19] , they have not been F1 = A + B + C + D
effectively used for FIR filter design. The closest work to F2 = A + B + C + E
implementing filters with adders is in [20], FIR filters are
implemented using the Add and Shift method. Canonical Figure 6 shows the original unoptimized expression trees.
Signed Digit (CSD) encoding is used for the coefficients to Both the expressions have a minimum critical path of two
minimize the number of additions. The paper discusses how addition cycles. These expressions require a total of six
high speed implementations can be achieved by registering registered adders for the fastest implementation, and no extra
each adder, due to which the critical path becomes equal to the registers are required. From the expressions we can see that
delay of the adder. Registering an adder output comes at no the computation A + B + C is common to both the
extra cost on an FPGA because of the presence of a D flip flop expressions. If we extract this subexpression, we get the
at the output of each LUT. In comparison with [20], we structure shown in Figure 7. Since both D and E need to wait
extensively use common subexpression elimination for for two addition cycles to be added to (A + B + C), we need to
reducing the number of adders and therefore area. use two registers each for D and E, such that new values for
Furthermore, our designs can run with sample rates as high as A,B,C,D and E can be read in at each clock cycle. Assuming
252 Msps (Million samples per second), whereas the designs that the cost of an adder and a register with the same bitwidth
in [20] can run only at 78.6 Msps. are the same, the structure shown in Figure 7 occupies more
In comparison with the other algorithms for common area than the one shown in Figure 6. A more careful
subexpression elimination [15, 16, 18, 19, 21], our method subexpression elimination algorithm would only extract the
takes into account the structure of the FPGA slices (Figure 5) common subexpression A + B (or A+C or B + C). The number
and takes into account both the cost of adders and registers of adders is decreased by one from the original, and no
when performing the optimization. Furthermore, we provide additional registers are added. This is illustrated in Figure 8.
comprehensive evidence of the benefits of our technique The algorithm for performing this kind of optimization is
through experimental results, where we compare our results described in the next section.
with those produced by industry standard tools.
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
Figure 9. Calculating registers required for fastest evaluation
Figure 8. Extracting common subexpression (A+B)
value. To calculate the value of the divisor, we assume that the
cost of a registered adder and a register is the same. We
IV. OPTIMIZATION ALGORITHM calculate the value of a divisor as the number of additions
saved by extracting it minus the number of registers that have
The goal of our optimization is to reduce the area of the to be added. After selecting the best divisor, we rewrite the
multiplier block by reducing the number of adders and any expressions using it. We then generate new divisors from the
additional registers required for the fastest implementation of new terms that have been generated due to rewriting, and add
the FIR filter. We first give a brief overview of the common them to the dynamic list of divisors. The iteration stops when
subexpression elimination methods. A detailed description can there is no valuable divisor remaining in the set of divisors.
be found in [22]. We then present the modified optimization Consider the expressions shown in Figure 6. We need six
algorithm to be used for our work. registered adders and no additional registers for the fastest
evaluation of F1 and F2. Now consider the selection of the
divisor d1 = (A+B). This divisor saves one addition and does
A. Overview of common subexpression elimination not increase the number of registers. Divisors (A + C) and (B
We use a polynomial transformation of constant + C) also have the same value, but (A+B) is selected
multiplications. Given a representation for the constant C, and randomly. The expressions are now rewritten as:
the variable X, the multiplication C*X can be represented as a
summation of terms denoting the decomposition of the d1 = (A + B)
multiplication into shifts and additions as F1 = d1 + C + D
C*X = ∑ ± XLi (V) F2 = d1 + C + E
i
The terms can be either positive or negative when the ReduceArea( {Pi} )
constants are represented using signed digit representations {
such as the Canonical Signed Digit (CSD) representation. The {Pi} = Set of expressions in polynomial form;
exponent of L represents the magnitude of the left shift and the {D} = Set o f divisors = ϕ ;
i’s represent the digit positions of the non-zero digits of the
constants. For example the multiplication 7*X = (100-1)CSD*X //Step 1: Creating divisors and calculating minimum
= X<<3 – X = XL3 – X, using the polynomial transformation. number of registers required
We use the divisors to represent all possible common
subexpressions. Divisors are obtained from an expression by for each expression Pi in {Pi}
looking at every pair of terms in the expression and dividing {
the terms by the minimum exponent of L. For example in the {Dnew} = FindDivisors(Pi);
expression F = XL2 + XL3 + XL5, consider the pair of terms Update frequency statistics of divisors in {D};
(+XL2 + XL3). The minimum exponent of L in the two terms {D} = {D} ∪ { Dnew};
is L2. Dividing by L2, we get the divisor (X + XL). From the Pi->MinRegisters = Calculate Minimum registers required
other two pairs of terms (XL2 + XL5) and (XL3 + XL5), we get for fastest evaluation of Pi ;
}
the divisors (X + XL3) and (X + XL2) respectively.
These divisors are significant, because every common //Step 2: Iterative selection and elimination of best divisor
subexpression in the set of expressions can be detected by while(1)
performing intersections among the set of divisors. {
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
After rewriting the expressions and forming new divisors, the
divisor d2 = (d1 + C) is considered. This divisor saves one Reduction in Resources
adder, but introduces five additional registers, as can be seen
in Figure 7. Therefore this divisor has a value of - 4. No other 80
% Reductio
50 SLICEs
40 LUTs
V. EXPERIMENTS FFs
30
20
The goal of our experiments was to compare the number of 10
resources consumed by our add and shift method with that 0
produced by the cores generated by the commercial 6 10 13 20 28 41 61 119 152
method. Table 1b, shows the same numbers for the filters 1000
Add/Shift
implemented using Xilinx Coregen, using the Parallel 800
600
Coregen
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
uses the DSP blocks available on Virtex IV devices. The [12] "Distributed Arithmetic FIR Filter v9.0," Xilinx Product
number of DSP blocks is equal to the number of taps of the Specification 2004.
filter. The results show that we achieve higher performance as [13] T. Sasao, Y. Iguchi, and T. Suzuki, "On LUT Cascade Realizations
the filter size increases. This is mainly because that critical of FIR Filters," presented at Euromicro Conference on Digital
System Design (DSD), 2005.
path in our design consists of adders while in MAC method,
[14] G.R.Goslin, "A Guide to Using Field Programmable Gate Arrays
critical path consists of multipliers and adders. Another (FPGAs) for Application-Specific Digital Signal Processing
limitation for MAC method is that Xilinx CoregenTM is limited Performance," Xilinx Application Note, San Jose 1995.
to input width of 17 bits due to the embedded DSP block input [15] M.Potkonjak, M.B.Srivastava, and A.P.Chandrakasan, "Multiple
limitation while our add and shift method can accept inputs of Constant Multiplications: Efficient and Versatile Framework and
any width. Algorithms for Exploring Common Subexpression Elimination,"
IEEE Transactions on Computer Aided Design of Integrated
Circuits and Systems, 1996.
[16] R.I.Hartley, "Subexpression sharing in filters using canonic signed
VI. CONCLUSION
digit multipliers," Circuits and Systems II: Analog and Digital
Signal Processing, IEEE Transactions on [see also Circuits and
In this paper we presented a multiplierless technique, Systems II: Express Briefs, IEEE Transactions on], vol. 43, pp.
based on the add and shift method and common subexpression 677-688, 1996.
elimination for low area, low power and high speed [17] H.T.Nguyen and A.Chatterjee, "Number-splitting with shift-and-
implementations of FIR filters. We validated our techniques add decomposition for power and hardware optimization in linear
on Virtex IITM devices where we observed significant area and DSP synthesis," Very Large Scale Integration (VLSI) Systems,
power reductions over traditional Distributed Arithmetic based IEEE Transactions on, vol. 8, pp. 419-424, 2000.
[18] H.-J. Kang, H. Kim, and I.-C. Park, "FIR filter synthesis
techniques. In future, we would like to modify our algorithm algorithms for minimizing the delay and the number of adders,"
to make use of the limited number of embedded multipliers presented at Computer Aided Design, 2000. ICCAD-2000.
available on the FPGA devices. IEEE/ACM International Conference on, 2000.
[19] A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware
Compleity of Linear DSP Systems by Iteratively Eliminating Two
Term Common Subexpressions," presented at Asia South Pacific
Design Automation Conference, Shanghai, 2005.
VII. REFERENCES [20] M. Yamada and A. Nishihara, "High-speed FIR digital filter with
CSD coefficients implemented on FPGA," presented at Design
[1] K.D.Underwood and K.S.Hemmert, "Closing the Gap: CPU and Automation Conference, 2001. Proceedings of the ASP-DAC
FPGA Trends in Sustainable Floating-Point BLAS Performance," 2001. Asia and South Pacific, 2001.
presented at International Symposium on Field-Programmable [21] H.Safiri, M.Ahmadi, G.A.Jullien, and W.C.Miller, "A new
Custom Computing Machines, California, USA, 2004. algorithm for the elimination of common subexpressions in
[2] L.Zhuo and V.K.Prasanna, "Sparse Matrix-Vector Multiplication hardware implementation of digital filters by using genetic
on FPGAs," presented at International Symposium on Field programming," presented at Application-Specific Systems,
Programmable Gate Arrays (FPGA), Monterey, CA, 2005. Architectures, and Processors, 2000. Proceedings. IEEE
[3] Y.Meng, A.P.Brown, R.A.Iltis, T.Sherwood, H.Lee, and International Conference on, 2000.
R.Kastner, "MP Core: Algorithm and Design Techniques for [22] A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware
Efficient Channel Estimation in Wireless Applications," presented complexity by iteratively eliminating two term common
at Design Automation Conference (DAC), Anaheim, CA, 2005. subexpressions," presented at Asia South Pacific Design
[4] B. L. Hutchings and B. E. Nelson, "Gigaop DSP on FPGA," Automation Conference (ASP-DAC), 2005.
presented at Acoustics, Speech, and Signal Processing, 2001.
Proceedings. (ICASSP '01). 2001 IEEE International Conference
on, 2001.
[5] A.Alsolaim, J.Becker, M.Glesner, and J.Starzyk, "Architecture and
Application of a Dynamically Reconfigurable Hardware Array for
Future Mobile Communication Systems," presented at
International Symposium on Field Programmable Custom
Computing Machines (FCCM), 2000.
[6] S.J.Melnikoff, S.F.Quigley, and M.J.Russell, "Implementing a
Simple Continuous Speech Recognition System on an FPGA,"
presented at International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2002.
[7] T.Yokota, M.Nagafuchi, Y.Mekada, T.Yoshinaga, K.Ootsu, and
T.Baba, "A Scalable FPGA-based Custom Computing Machine for
Medical Image Processing," presented at International Symposium
on Field-Programmable Custom Computing Machines (FCCM),
2002.
[8] K.Chapman, "Constant Coefficient Multipliers for the XC4000E,"
Xilinx Technical Report 1996.
[9] K. Wiatr and E. Jamro, "Constant coefficient multiplication in
FPGA structures," presented at Euromicro Conference, 2000.
Proceedings of the 26th, 2000.
[10] M. J. Wirthlin and B. McMurtrey, "Efficient Constant Coefficient
Multiplication Using Advanced FPGA Architectures," presented at
International Conference on Field Programmable Logic and
Applications (FPL), 2001.
[11] M.J.Wirthlin, "Constant Coefficient Multiplication Using Look-Up
Tables," Journal of VLSI Signal Processing, vol. 36, pp. 7-15,
2004.
Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.