Efficient FIR Filter Architectures Suitable For FPGA Implementation
Efficient FIR Filter Architectures Suitable For FPGA Implementation
I. INTRODUCTION
Considerable attention has been placed on the implementation of
signal processing algorithms in VLSI, ranging from full custom VLSI
to general purpose digital signal processors. A variety of approaches Fig. 1. FIR filter tap arithmetic unit, coefficients with two pOWerS-Of-tWO.
to custom implementation of FIR filters have been pursued [1]-[5],
[7]-[9]. In order to attain high performance, parallel implementation
two power-of-two terms for each coefficient value, given that the
strategies such as systolic methods have been applied. Word-parallel,
filter is in cascade form and the coefficient values are derived using
bit-parallel processing techniques appear to scale well with improve-
mixed integer linear programming.
ments in implementation technology and increasing demands for
If the coefficient value is an integer power-of-two, or a sum of
higher performance.
two powers-of-two, the multipliers in a filter tap can be replaced by
Advances in field programmable gate array (FPGA) technology
shifters, as depicted in Fig. 1. Since the coefficients will be fixed
have enabled FPGA’s to be used in a variety of applications. In
for this class of filter, the coefficient values can be realized by
particular, FPGA’s prove particularly useful in data path designs,
appropriately routing the inputs to the full adders in the filter structure.
where the regular structure of the array can be utilized effectively. The
That is, moving the adder inputs k places to the left achieves the same
programmability of FPGA’s adds flexibility not available in custom
effect as would a coefficient value of 2k.
approaches, while retaining relatively high system clock rates. The
disadvantages of FPGA’s are primarily related to the limited number
of logic operations that can be implemented on a particular device, 11. ARCHITECTURES
the constraints on the inputs and outputs to the atomic logic units, The block diagrams of the FIR filter architectures discussed in this
and the limited signal routing options that are available for connecting work are illustrated in Fig. 2. The structure shown in Fig. 2(a) can be
logical operators on the array. Many current FPGA’s architectures are applied to FPGA architectures since the use of global communication
implemented using memory technologies, and hence the advances in can be tolerated in such systems, although more pipelining can be
that area will be reflected in improved FPGA density and speed. used if needed. A structure appropriate for linear phase filters is
This paper presents new parallel FIR filter building blocks suited shown in Fig. 2@). In order to attain high sampling rates using
for implementing filters where each of the coefficient values is a sum conventional FPGA’s, bit-level parallelism is exploited. The overall
or difference of two power-of-two terms. These architectures allow filter architecture is shown in Fig. 3, where the filter taps and final
high sampling rate FIR filters of substantial length to be implemented adder stage are shown. The adder is required to resolve the carries
on current generation FPGA’s. that are generated and propagated through the pipeline.
In binary arithmetic, multiplication by a power-of-two is simply The structure of the filter tap of Fig. 2(a) is shown in Fig. 4.The
a shift operation. Implementation of systems with multiplications two adders, which are necessary for coefficients that are a sum of two
may be simplified by using only a limited number of power-of-two signed powers-of-two, are implemented as two rows of full adders,
terms, so that only a small number of shift and add operations are whose inputs are configured with the appropriate shift for the given
required. These simplifications are, however, achieved at the expense coefficients. The sign of the coefficients is controlled by inverters.
of a deterioration in the frequency response characteristics, the extent The sum and carry signals from the full adders are pipelined using a
of which depends on the number of power-of-two terms used in carry-save addition (CSA) technique in order to increase the sampling
approximating each coefficient value, the architecture of the filter, rate and alleviate potential routing delays in the target implementation
and the optimization technique used to derive the discrete space technology. The input data bus passes through the bit-slice array to
coefficient values. It was demonstrated in [6]that an FIR filter with provide short interconnection distances to the first row of full adders.
-60dB of frequency response ripple magnitude can be realized using This bus may be optionally pipelined depending on the particular
FPGA implementation technology, among other factors. The bits of
Manuscript received March 26, 1993. Portions of this work were presented the input are shifted before summation, as represented by the dotted
at ISCAS ’93 in Chicago, Illinois. This research is partially supported by the
Kansas Technology Enterprise Corporation through the Center for Excellence lines.
in Computer-Aided Systems Engineering and by the University of Kansas The linear phase filter tap of Fig. 2(b) is depicted in Fig. 5. This
General Research allocation 3626-20-0038. This paper was recommended by architecture is similar to the previous case, with the addition of the
Associate Editor Y. C. Lim. upper set of adders and registers which implement the delay and
The author is with the Telecommunications & Information Sciences
Laboratory, Department of Electrical Engineering & Computer Science, sum operations on the input data stream. In this case, the delayed
University of Kansas, Lawrence, KS 66045-2228 USA. and global data bits are summed prior to shifting, as represented by
IEEE, Log Number 9403069. the dotted lines, due to logic unit inputloutput restrictions. While
Authorized licensed use limited to: SHENZHEN UNIVERSITY. Downloaded on October 01,2021 at 08:52:00 UTC from IEEE Xplore. Restrictions apply.
~
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 41, NO. 7, JULY 1994 49 1
I--
i( -N.L i( -N.2 -N-3 +-.
(3)
‘k I I I I -----
lllU
.--
the ripple carry structure does limit performance, most recent P G A these extra, unallocated resources, the low delay vertical routing lines
architectures support high speed carry logic which minimizes the of the FPGA can be used more effectively. The extra resources allow
problem. the number of vertical routing lines to be minimized, as illustrated
in Fig. 4, where the additional data path leads to lower congestion in
111. FPGA IMPLEMENTATION the routing channel between the columns. A tap with Bd input data
path bits and B, accumulation path bits can thus be implemented
An FIR filter tap as shown in Fig. 4 can be implemented in two
using 2B, logic blocks. The final adder required by the filter can be
array columns of Xilinx XC3100-series FPGA’s. Because of the high
degree of spatial and temporal locality, most signal routing delays are implemented on the FPGA or using an additional chip. Larger filters
not critical, as they are with typical high performance FPGA designs. can be realized by cascading several FPGA’s.
Each of the bit slices for the tap require two combinational logic Typical filter characteristics have been implemented on an Xilinx
blocks (CLB’s) in the array for implementation. The extensive local XC3195 FPGA using this architecture. The XC3195 has an array of
routing capability of typical FPGA’s can be used for the majority of 22 by 22 (484) CLB’s. For example, an eleven tap lowpass FIR filter
signals within and between taps. Fig. 6 illustrates the local routing with the passband cut-off at 0. Ifs, the stopband beginning 0. 15f8,
required between CLB’s, where column “1” maps to the first set and -18dB stopband rejection was designed. An input data word size
of full adders for a given tap, and column “2” maps to the second of 10 bits was used; the 22 rows provide sufficient intermediate word
set. The globally routed input data signals are distributed using the width protection against overflow. All of the columns of the array
horizontal and vertical nets running the length and width of the chip were required for the eleven taps. The final accumulation stage was
between the rows and columns of CLB’s. not performed on the array. The maximum sampling rate for this
The primary concern is with routing of the shift lines. In most design was 30 MHz. The delay is highly dependent on the input data
realizations, the accumulation path will have a wider word width routing, and so higher sampling rates may be attainable for other
than the input data from the shifter, in order to account for overflow filter responses (with careful routing).
and round-off problems. For example, if the input data is B d bits Linear phase FIR filter taps can be implemented in three array
wide, the accumulation path will most likely be B, 2 2Bd bits wide. columns of Xilinx XC4000-series FPGA’s, as depicted in Fig. 5.
This implies that the input data path will use fewer routing lines in Because the XC4000 series supports dedicated carry logic, the ripple
each FPGA column than will the accumulation path. By exploiting carry chain can be used to implement the adder for the input and
Authorized licensed use limited to: SHENZHEN UNIVERSITY. Downloaded on October 01,2021 at 08:52:00 UTC from IEEE Xplore. Restrictions apply.
492 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 41. NO. I, JULY 1994
Fig.
1~ di a, C6 16 d2 K2 CI S.
IV. CONCLUSION
J ,
A new parallel FIR digital filter structure which allows efficient
FPGA implementation of filters whose coefficient values are sums or
differences of power-of-two terms was presented. Digital FIR filters
with over one hundred taps based on this architecture should be
possible by the end of the decade if current technological trends
continue. Examples based on Xilinx XC3100 and XC4000 FPGA's
were given, although other programmable logic devices such as
the AT&T ORCA components will also support this architecture.
Automatic programming, from filter specifications to FPGA program,
is straightforward.
delayed data. A 19-tap linear phase filter can be supported on an [I] D. E. Borth, I. A. Gerson, J. R. Haug, and C. D. Thompson, "A flexible
XC4020 component, which has 900 CLB's. Based on the Xilinx adaptive FIR filter VLSI IC," IEEE J. Select. Areas Commun., vol.
SAC-6, no. 3, pp. 494-503, Apr. 1988.
timing analyzer, sampling rates on the order of 15-20 MHz can be [2] J. B. Evans, Y. C. Lim, and B. Liu, "A high speed programmable digital
obtained. FIR filter," IEEE Int. Con$ Acoust., Speech, Sig. Pmc., Apr. 1990.
Authorized licensed use limited to: SHENZHEN UNIVERSITY. Downloaded on October 01,2021 at 08:52:00 UTC from IEEE Xplore. Restrictions apply.
~
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 41, NO. I, JULY 1994 493
[3] J. Gallia et al., “High-performance BiCMOS 100k-gate array,” IEEE J. After simplification, we get
Solid State Circuits, vol. SC-25, no. 1, pp. 142-149, Feb. 1990.
[4] M. Hatamian and S. Rao, “A 100 MHz 40-tap programmable FIR filter
chip? in IEEE Int. Symp. Circuits Syst., pages 3053-3056, May 1990.
SW(n)= SW(n - M z ) - pLAS(n1 M,)S*(n, + +MI)
[5] R. Jain, P. Yang, and T. Yoshino, “Firgen: A computer-aided design x 6W(n - M z ) b(n) +
system for high performance FIR filter integrated circuits,” IEEE Trans.
Sig. Proc., vol. 39, no. 7, pp. 1655-1668, July 1991.
+
= [I - pLAS(n1 M1)ST(nl MI)] +
[6] Y. C. Lim and B. Liu. “Design of cascade form FIR filters with x SW(n - M z ) b(n) +
discrete valued coefficients,” IEEE Trans. Acoust., Speech, Sig.1 Proc.,
vol. ASSP-36, pp. 1735-1739, NOV.1988.
= F(n)SW(n- M 2 ) b(n) + (A. 11)
[7] S. Powell and P. Chau, “Reduced complexity programmable FIR filters,”
in IEEE In?. Symp. Circuits Syst., pp. 561-564, May 1992. where
[8] P. Yang, T. Yoshino, R. Jain, and W. Gass, “A functional silicon
compiler for high speed FIR digital filters,” IEEE In?. Con$ Acousr., b(n) = pLA[e,(n + M I ) r ( n l + MI)
Speech, Sig. Proc., pp. 1329-1332, Apr. 1990.
[9] T. Yoshino, , R. Jain, et al., “A 100-MHz 64-tap FIR digital filter in
-w ~- m() r (~n l + Lil)S(nl+ M ~ )
0.8 p m BiCMOS gate array,” IEEE J. Solid State Circuits, vol. 25, no. + S(n1 + M1)P(n + MI)] +Wn), (A.12)
6, pp. 1494-1501, Dec. 1990.
+
and F(n) = [I - pLAS(n1 M l ) S T ( n l MI)]. +
Except for the definition of F(n) and b(n), (A.ll) is identical in
form to the corresponding equation arising in the analysis of LMS in
[4]. Hence, we employ the following result from [4, Appendix A]
Corrections to “Finite-Precision Analysis
of the Pipelined ADPCM Coder”’ (n)]R]=
tr[E[6W(n)SWT tr[b(n)bT(.)I (A.13)
2 p L A - p2LA2tr[R]
‘
In the above paper,’ the last page was omitted from the May issue.
The missing text is as follows.
Taking the trace of (A.14) we get
+ +
where P(n) = 6 s ( n ) - b l ( n - M I ) 63(n) Sq(n). Now, we
multiply (A.8) by S ’ ( n l ) and employ (3.4) to get tr[E[b(n)b’(n)]]= p2LA2[(Jmin 0 ; ) n : N + + tr(R)nPIWopt12
+
eb(n)S‘(nI) = eq(n)S(nl) eq(n)r(nl) + tr(R)a$]+ NnK
- w T ( n- M)r(nl)S(nl) = p2LA2[(IW,pt12a~a$)tr(R) +
- M)
- S(n1)ST(n1)bW(n + S(n1)/3(n). + Na:(Jmin + a;)] + NU:^. (A.15)
(-4.9) Substituting (A.15) into (A.13) and the result, along with (A.6), into
Employing (A.9), we rewrite (3.3) as follows (A.3), we obtain (3.11).
+
W(n) 6W(n) = W(n - M 2 ) + 6W(n - M2)
REFERENCES
+ pLAeb(n + Ml)G‘(nl + MI) + 6 2 ( n ) [l] N. R. Shanbhag and K. K. Parhi, “Relaxed look-ahead pipelined LMS
= W(n - M z ) + 6W(n - M 2 ) adaptive filters and their application to ADPCM coder,” in IEEE Trans.
+ p L A [ e , ( n+ M l ) S ( n l + MI) on Cim. and Syst.-II, pp. 753-766, Dec. 1993.
[2] N. R. Shanbhag and K. K. Parhi, “A pipelined adaptive lattice filter
+ e q ( n+ M I ) r ( n l + M I ) architecture,” IEEE Trans. on Sig. Proc., vol. 41, pp. 1925-1939, May,
- w T ( n- .wz)r(nl+ M ~ ) S ( T U+ krl) 1993.
[3] K. K. Parhi and D. G. Messerschmitt, “Pipeline interleaving and
- S(n1 +M1)S*(n1 + M,)6W(n - Mz) parallelism in recursive digital filters-Part I: Pipelining using scattered
+ + M1)P(n + MI)]+ 62(n).
S(n1
look-ahead and decomposition,” IEEE Trans. on Acoust., Speech and
Signal Proc.. vol. 37, pp. 1099-1 117, July 1989.
(A.lO) [4] C. Caraiscos and B. Liu, “A roundoff error analysis of the LMS adaptive
algorithm,” IEEE Trans. Acoust. Speech, and Sig. Proc., vol. 32, pp.
I-, IEEE Trans. Circuits Syst., vol. 41, no. 5, May 1994. 34-41, Feb. 1984.
Manuscript received February 2, 1993;revised October 15, 1993. This paper [5] N. R. Shanbhag and K. K. Parhi, “Roundoff error analysis of the
was recommended by Associate Editor G. S. Moschytz. Research for this pipelined ADPCM coder,” Proc. IEEE Intl. Symp. on Circ. and Syst.,
paper was supported by the army research office by contract number DAAL- Chicago, IL, pp. 886889, May 1993.
906-0063. [6] C. Leiserson and J. Saxe, “Optimizing synchronous systems,” J. of VUI
Naresh R. Shanbhag was with the Department of Electrical Engineering, and Comput. Sysr., vol. 1, pp. 41-67, 1983.
University of Minnesota. He is now with AT&T Bell Laboratories at Murray [7] B. Widrow et al., “Stationary and non-stationary learning characteristics
Hill, NJ 07974, USA. of the LMS adaptive filter,” Proc. IEEE, vol. 64, pp. 1151-1162, Aug.
Keshab K. Parhi is with the Department of Electrical Engineering, Univer- 1976.
sity of Minnesota, 200 Union Street S. E., Minneapolis, MN 55455, USA. [8] B. Widrow er al., “Adaptive noise cancelling: principles and applica-
IEEE Log Number 9400250.. tions,” Proc. IEEE, vol. 63, pp. 1692-1716, Dec. 1975.
- . ___ __ ~ -
Authorized licensed use limited to: SHENZHEN UNIVERSITY. Downloaded on October 01,2021 at 08:52:00 UTC from IEEE Xplore. Restrictions apply.