0% found this document useful (0 votes)
7 views

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

FPGA

Uploaded by

don
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

FPGA

Uploaded by

don
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

FPGA Implementation of High Speed FIR Filters

Using Add and Shift Method


Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
University Of California, Santa Barbara, CA 93106
E-mail: [email protected], [email protected], [email protected]

most of the current generation FPGAs such as Virtex IITM


Abstract-We present a method for implementing high speed have embedded multipliers to handle these multiplications, the
Finite Impulse Response (FIR) filters using just registered adders number of
and hardwired shifts. We extensively use a modified common these multipliers is typically limited. Furthermore, the size of
subexpression elimination algorithm to reduce the number of these multipliers is limited to only 18 bits, which limits the
adders. We target our optimizations to Xilinx Virtex II devices
where we compare our implementations with those produced by precision of the computations for high speed requirements.
Xilinx CoregenTM using Distributed Arithmetic. We observe up The ideal implementation would involve a sharing of the
to 50% reduction in the number of slices and up to 75% Combinational Logic Blocks (CLBs) and these multipliers. In
reduction in the number of LUTs for fully parallel this paper, we present a technique that is better than
implementations. We also observed up to 50% reduction in the conventional techniques for implementation on the CLBs.
total dynamic power consumption of the filters. Our designs
perform significantly faster than the MAC filters, which use X [n]
embedded multipliers.
x hL-1 x hL-2 x hL-3 x h1 x h0

y [n]
I. INTRODUCTION z-1 + z-1 + z-1 ... + z-1 + z-1

Figure 1. A MAC FIR filter block diagram


FPGAs are being increasingly used for a variety of
computationally intensive applications, mainly in the realm of An alternative to the above approach is Distributed
Digital Signal Processing (DSP) and communications [1-7]. Arithmetic (DA) which is a well known method to save
Due to rapid increases in the technology, current generation of resources. Using DA method, the filter can be implemented
FPGAs contain a very high number of Configurable Logic either in bit serial or fully parallel mode to trade bandwidth for
Blocks (CLBs), and are becoming more feasible for area utilization. Assuming coefficients c[n] are known
implementing a wide range of applications. The high non- constants, equation (I) can be rewritten as follows:
recurring engineering (NRE) costs and long development time
for ASICs are making FPGAs more attractive for application y[n] = ∑ c[n] · x[n] n = 0, 1, …, N-1 (II)
specific DSP solutions. DSP functions such as FIR filters
and transforms are used in a number of applications such as Variable x[n] can be represented by:
communication and multimedia. These functions are major
determinants of the performance and power consumption of x [n] = ∑ xb [n] · 2b b=0, 1, …, B-1 (III)
the whole system. Therefore it is important to have good tools xb [n] € [0, 1]
for optimizing these functions.
Equation (I) represents the output of an L tap FIR filter, where xb [n] is the bth bit of x[n] and B is the input width.
which is the convolution of the latest L input samples. L is the Finally, the inner product can be rewritten as follows:
number of coefficients h(k) of the filter, and x(n) represents
the input time series. y = ∑ c[n] ∑ xb [k] · 2b
= c[0] (xB-1 [0]2B-1 + xB-2 [0]2B-2 + … + x0 [0]20 )
y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1 (I) + c[1] (xB-1 [1]2B-1 + xB-2 [1]2B-2 + … + x0 [1]20 )
+…
The conventional tapped delay line realization of this inner + c[N-1] (xB-1 [N-1]2B-1 + xB-2 [0]2B-2 + … + x0 [N-
product is shown in Figure 1. This implementation translates 1]20 )
to L multiplications and L-1 additions per sample to compute
the result. This can be implemented using a single Multiply = (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-
Accumulate (MAC) engine, but it would require L MAC 1])2B-1 +(c[0] xB-2 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-
cycles, before the next input sample can be processed. Using a 1])2B-2
parallel implementation with L MACs can speed up the +…
performance L times. A general purpose multiplier occupies a + (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1])20
large area on FPGAs. Since all the multiplications are with = ∑ 2b ∑ c[n] · xb [k] (IV)
constants, the full flexibility of a general purpose multiplier is
not required, and the area can be vastly reduced using where n=0, 1, …, N-1 and b=0, 1, …, B-1
techniques developed for constant multiplication. Though

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
1-4244-9707-X/06/$20.00 ©2006 IEEE
The coefficients in most of DSP applications for the x0[i]
x1[i]
multiply accumulate operation are constants. The partial x2[i]
x3[i]
LUT

products are obtained by multiplying the coefficients ci by +


scaling accumulator
multiplying one bit of data xi at a time in AND operation. x4[i]
x5[i]
These partial products should be added and the result depend x6[i] LUT <<

x7[i]
only on the outputs of the input shift registers. The AND + +
SET

functions and adders can be replaced by Look Up Tables


D Q

x0[i+1] Q

(LUTs) that gives the partial product. This is shown in Figure


CLR

x1[i+1] LUT
x2[i+1]
2. Input sequence is fed into the shift register at the input x3[i+1]
sample rate. The serial output is presented to the RAM based x4[i+1]
+

shift registers (registers are not shown in Figure for simplicity) x5[i+1]
x6[i+1] LUT

at the bit clock rate which is n+1 times (n is number of bits in x7[i+1]
a data input sample) the sample rate. The RAM based shift
register stores the data in a particular address. The outputs of Figure 3. A 2 bit parallel DA FIR filter block diagram
registered LUTs are added and loaded to the scaling
accumulator from LSB to MSB and the result which is the A popular technique for implementing the transposed
filter output will be accumulated over the time. For an n bit form of FIR filters is the use of a multiplier block, instead of
input, n+1 clock cycles are needed for a symmetrical filter to using multipliers for each constant as shown in Figure 4. The
generate the output. multiplications with the set of constants {hk} are replaced by
In conventional MAC method with a limited number of an optimized set of additions and shift operations, involving
MAC engines, as the filter length is increased, the system computation sharing. Further optimization can be done by
sample rate is decreased. This is not the case with serial DA factorizing the expression and finding common
architectures since the filter sample rate is decoupled from the subexpressions. The performance of this filter architecture is
filter length. As the filter length is increased, the throughput is limited by the latency of the biggest adder and is the same as
maintained but more logic resources are consumed. that of the PDA.
Though the serial DA architecture is efficient by
construction, its performance is limited by the fact that the
next input sample can be processed only after every bit of the
current input samples are processed. Each bit of the current
input samples takes one clock cycle to process.
scaling accumulator
x0[i]
x1[i] <<
LUT
x2[i]
x3[i]
+ +
SET
D Q

x4[i] CLR
Q

x5[i] LUT
x6[i] Figure 4. Replacing constant multiplication by multiplier block
x7[i]
The main contribution in this paper is the development of a
Address Data
novel algorithm for optimizing the multiplier block for FIR
0000 0 filters, using a modified algorithm for common subexpression
0001 C0 elimination. The goal of the algorithm is to produce a filter
0010 C0+C1 that can provide the maximum sample rate with the least
… …
1111 C0+C1+C2+C3 amount of hardware. Our algorithm takes into account the
specific features of FPGA slices to reduce the total number of
Figure 2. A serial DA FIR filter block diagram occupied slices. The reduced number of slices also leads to a
reduction in the total power on the FPGA.
Therefore, if the input bitwidth is 12, then a new input can be We compare our results with the industry standard Xilinx
sampled every 12 clock cycles. The performance of the circuit CoregenTM, where we compare the total area and power
can be improved by modifying the architecture to a parallel consumption.
architecture which processes the data bits in groups. Figure 3 The rest of the paper is organized as follows: Section 2
shows the block diagram of a 2 bit parallel DA FIR filter. The presents some related work. In Section3, we describe our filter
tradeoff here is performance for area since increasing the architecture. In Section 4, we present our optimization
number of bits sampled has a significant effect on resource algorithm for reducing the total area of the design. In Section
utilization on FPGA. For instance, doubling the number of bits 5, we describe our experimental setup and present our results.
sampled, doubles the throughput and results in the half the Finally we conclude the paper in Section 6.
number of clock cycles.
This change doubles the number of LUTs as well as the
size of the scaling accumulator. The number of bits being
II. RELATED WORK
processed can be increased to its maximum size which is the
input length n. This gives the maximum throughput to the
filter. For a fully parallel implementation of the DA filter Multiplications with constants have to be performed in
(PDA), the number of LUTs required would be enormous. In many signal processing and communication applications such
this work we show an alternative to the PDA method for as FIR filters, audio, video and image processing. Since
implementing high speed FIR filters that consumes implementing a general purpose multiplier is expensive on an
significantly lesser area and power. FPGA and since we do not really need such a multiplier, when
one of the operands is a constant, there has been a lot of work

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
on deriving efficient structures for constant multiplications [8- X + s X + z-1 s'

13]. All these techniques are based on computing constant y y

multiplications using table lookups and additions. The method


of Distributed Arithmetic [12, 14] which is the most popular
method for implementing Multiplierless FIR filters, is also
based on table lookup. The XilinxTM CORE Generator has a
X1 X1
y1 LUT D
SET
Q s1 y1 LUT D
SET
Q s'1

highly parameterizable, optimized filter core for implementing CLR


Q CLR
Q

digital FIR filters [12]. based on both Distributed Arithmetic Logic Block 2 Logic Block 2

as well as MAC (Multiply Accumulate) based architectures. It carry carry

generates synthesized core that targeting a wide range of


Xilinx devices. The MAC based implementations make use of X0
y0 LUT D
SET
Q s0
X0
y0 LUT D
SET
Q s'0

the embedded DSP slices on the FPGA devices. In this work, CLR
Q CLR
Q

we primarily compare our technique with the Coregen Logic Block 1 Logic Block 1
implementation of the Distributed Arithmetic, since that also
is a Multiplierless technique. We show that our designs are (a) (b)
Figure 5. Registered adder at no additional cost
much more area efficient than the DA based approach for fully
parallel filters. We also compare our method with MAC based Performing subexpression elimination can sometimes
implementations, where we achieve significantly higher increase the number of registers substantially, and the overall
performance area could possibly increase. Consider the two expressions F1
Though there has been a lot of work on optimizing and F2 which could be part of the multiplier block.
constant multiplications using adders and employing
redundancy elimination [15-19] , they have not been F1 = A + B + C + D
effectively used for FIR filter design. The closest work to F2 = A + B + C + E
implementing filters with adders is in [20], FIR filters are
implemented using the Add and Shift method. Canonical Figure 6 shows the original unoptimized expression trees.
Signed Digit (CSD) encoding is used for the coefficients to Both the expressions have a minimum critical path of two
minimize the number of additions. The paper discusses how addition cycles. These expressions require a total of six
high speed implementations can be achieved by registering registered adders for the fastest implementation, and no extra
each adder, due to which the critical path becomes equal to the registers are required. From the expressions we can see that
delay of the adder. Registering an adder output comes at no the computation A + B + C is common to both the
extra cost on an FPGA because of the presence of a D flip flop expressions. If we extract this subexpression, we get the
at the output of each LUT. In comparison with [20], we structure shown in Figure 7. Since both D and E need to wait
extensively use common subexpression elimination for for two addition cycles to be added to (A + B + C), we need to
reducing the number of adders and therefore area. use two registers each for D and E, such that new values for
Furthermore, our designs can run with sample rates as high as A,B,C,D and E can be read in at each clock cycle. Assuming
252 Msps (Million samples per second), whereas the designs that the cost of an adder and a register with the same bitwidth
in [20] can run only at 78.6 Msps. are the same, the structure shown in Figure 7 occupies more
In comparison with the other algorithms for common area than the one shown in Figure 6. A more careful
subexpression elimination [15, 16, 18, 19, 21], our method subexpression elimination algorithm would only extract the
takes into account the structure of the FPGA slices (Figure 5) common subexpression A + B (or A+C or B + C). The number
and takes into account both the cost of adders and registers of adders is decreased by one from the original, and no
when performing the optimization. Furthermore, we provide additional registers are added. This is illustrated in Figure 8.
comprehensive evidence of the benefits of our technique The algorithm for performing this kind of optimization is
through experimental results, where we compare our results described in the next section.
with those produced by industry standard tools.

III. FILTER ARCHITECTURE

We base our filter architecture on the transposed form of


the FIR filter as shown in Figure 1. The filter can be divided
into two main parts, the multiplier block and the delay block,
and is illustrated in Figure 4. In the multiplier block, the
current input variable x[n] is multiplied by all the coefficients
of the filter to produce the yi outputs. These yi outputs are then Figure 6. Unoptimized expression trees
delayed and added in the delay block to produce the filter
output y[n].
We perform all our optimizations in the multiplier block.
The constant multiplications are decomposed into registered
additions and hardwire shifts. The additions are performed
using two input adders, which are arranged in the fastest tree
structure. We use registered adders, so that the performance of
the filter is only limited by the slowest adder. We use common
subexpression elimination extensively, to reduce the number
of adders, which leads to a reduction in the area. To
synchronize all the intermediate values in the computation, we
insert registers in the dataflow, wherever necessary.
Figure 7. Extracting common expression (A + B + C)

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
Figure 9. Calculating registers required for fastest evaluation
Figure 8. Extracting common subexpression (A+B)
value. To calculate the value of the divisor, we assume that the
cost of a registered adder and a register is the same. We
IV. OPTIMIZATION ALGORITHM calculate the value of a divisor as the number of additions
saved by extracting it minus the number of registers that have
The goal of our optimization is to reduce the area of the to be added. After selecting the best divisor, we rewrite the
multiplier block by reducing the number of adders and any expressions using it. We then generate new divisors from the
additional registers required for the fastest implementation of new terms that have been generated due to rewriting, and add
the FIR filter. We first give a brief overview of the common them to the dynamic list of divisors. The iteration stops when
subexpression elimination methods. A detailed description can there is no valuable divisor remaining in the set of divisors.
be found in [22]. We then present the modified optimization Consider the expressions shown in Figure 6. We need six
algorithm to be used for our work. registered adders and no additional registers for the fastest
evaluation of F1 and F2. Now consider the selection of the
divisor d1 = (A+B). This divisor saves one addition and does
A. Overview of common subexpression elimination not increase the number of registers. Divisors (A + C) and (B
We use a polynomial transformation of constant + C) also have the same value, but (A+B) is selected
multiplications. Given a representation for the constant C, and randomly. The expressions are now rewritten as:
the variable X, the multiplication C*X can be represented as a
summation of terms denoting the decomposition of the d1 = (A + B)
multiplication into shifts and additions as F1 = d1 + C + D
C*X = ∑ ± XLi (V) F2 = d1 + C + E
i
The terms can be either positive or negative when the ReduceArea( {Pi} )
constants are represented using signed digit representations {
such as the Canonical Signed Digit (CSD) representation. The {Pi} = Set of expressions in polynomial form;
exponent of L represents the magnitude of the left shift and the {D} = Set o f divisors = ϕ ;
i’s represent the digit positions of the non-zero digits of the
constants. For example the multiplication 7*X = (100-1)CSD*X //Step 1: Creating divisors and calculating minimum
= X<<3 – X = XL3 – X, using the polynomial transformation. number of registers required
We use the divisors to represent all possible common
subexpressions. Divisors are obtained from an expression by for each expression Pi in {Pi}
looking at every pair of terms in the expression and dividing {
the terms by the minimum exponent of L. For example in the {Dnew} = FindDivisors(Pi);
expression F = XL2 + XL3 + XL5, consider the pair of terms Update frequency statistics of divisors in {D};
(+XL2 + XL3). The minimum exponent of L in the two terms {D} = {D} ∪ { Dnew};
is L2. Dividing by L2, we get the divisor (X + XL). From the Pi->MinRegisters = Calculate Minimum registers required
other two pairs of terms (XL2 + XL5) and (XL3 + XL5), we get for fastest evaluation of Pi ;
}
the divisors (X + XL3) and (X + XL2) respectively.
These divisors are significant, because every common //Step 2: Iterative selection and elimination of best divisor
subexpression in the set of expressions can be detected by while(1)
performing intersections among the set of divisors. {

B. Optimization algorithm Find d = Divisor in {D} with greatest Value;


// Value = Num Additions reduced – Num Registers Added;

We first calculate the minimum number of registers if( d == NULL) break;


required for our design. We calculate this by arranging the Rewrite affected expressions in {Pi} using d;
original expressions in the fastest possible tree structure, and
then inserting registers. For example, for the six term Remove divisors in {D} that have become invalid;
expression F = A + B + C + D + E + F, we have the fastest Update frequency statistics of affected divisors;
tree structure with three addition steps, and we require one
register to synchronize the intermediate values, such that new {Dnew} = Set of new divisors from new terms added
values for A,B,C,D,E,F can be read in every clock cycle. This by division;
{D} = {D} ∪ {Dnew};
is illustrated in Figure 9.
}
We first generate all the divisors for the set of expressions }
describing the multiplier block. We then use an iterative
algorithm, where we extract the divisor that has the greatest
Figure 10. Optimization algorithm to reduce area

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
After rewriting the expressions and forming new divisors, the
divisor d2 = (d1 + C) is considered. This divisor saves one Reduction in Resources
adder, but introduces five additional registers, as can be seen
in Figure 7. Therefore this divisor has a value of - 4. No other 80

valuable divisors can be found and the iteration stops. We end 70

up with the expressions shown in Figure 8. 60

% Reductio
50 SLICEs
40 LUTs
V. EXPERIMENTS FFs
30
20
The goal of our experiments was to compare the number of 10
resources consumed by our add and shift method with that 0
produced by the cores generated by the commercial 6 10 13 20 28 41 61 119 152

CoregenTM tool, based on Distributed Arithmetic. Besides the # of Taps

resources, we also compared the power consumption of the


two implementations, and also measured the performance. For Figure 11. Reduction in resources
our experiments, we considered 9 FIR filters of various sizes
(6, 10, 13, 20, 28, 41, 61, 119 and 151 tap filters). We targeted Figure 12 compares power consumption for our add/shift
the Xilinx Virtex II device for our experiments. The constants method versus CoregenTM. From the results we can observe up
were normalized to 17 digit of precision and the input samples to 50% reduction in dynamic power consumption. We did not
were assumed to be 12 bits wide. For the add and shift include the quiescent power into our calculation since that
method, we decomposed all the constant multiplications into value is the same for both methods. The power consumption is
additions and shifts and optimized the expressions using the the result of applying the same test stimulus to both designs
algorithm explained in Section 4.2. We used the Xilinx and measuring the power using XPower tools provided by
Integrated Software Environment (ISE) for performing Xilinx ISE software.
synthesis and implementation of the designs. All the designs
were synthesized for maximum performance. Dynamic Power Consumption
Table 1a shows the resources utilized for the various filters
and the performance in terms of Million samples per second 1600
1400
(Msps) for the filters implemented using the add and shift 1200
Power (mw

method. Table 1b, shows the same numbers for the filters 1000
Add/Shift
implemented using Xilinx Coregen, using the Parallel 800
600
Coregen

Distributed Arithmetic (PDA) method. 400


200
0
Table 1a. Filter Synthesis using Add Shift method 6 10 13 20 28 41 61 119
Filter size (# of taps)
Filter Performance
Slices LUTs FFs
(# taps) (Msps)
6 264 213 509 251 Figure 12. Power consumption
10 474 406 916 222
13 386 334 749 252 Comparison with MAC filters using embedded multipliers
20 856 705 1650 250
28 1294 1145 2508 227 CoregenTM can produce FIR filters based on the Multiply
41 2154 1719 4161 223 Accumulate (MAC) method, which makes use of the
61 3264 2591 6303 192
embedded multipliers and DSP blocks. We implemented the
119 6009 4821 11551 203
FIR filters using the MAC method to compare the resource
151 7579 6098 14611 180
usage and performance with our add and shift method. Due to
tool limitations we had to do the experiments for Virtex IV
Figure 11 plots the reduction in the number of resources, in device . We present the synthesis results in terms of number of
terms of the number of Slices, Look Up Tables (LUTs) and slices on the Virtex IV device and the performance in Msps in
the number of Flip Flops (FFs). From the results, we can Table 2.
observe an average reduction of 58.7% in the number of
LUTs, and about 25% reduction in the number of slices and Table 2. Comparing with MAC filter on Virtex IV
FFs. Though our algorithm does not optimize for performance,
the synthesis produces better performance in most of the Add Shift MAC
Filter
cases, and for the 13 and 20 tap filters, we observe about 26% Method filter
(# taps)
improvement in performance. Slices Msps Slices Msps
6 264 296 219 262
Table 1b. Filter Synthesis using Coregen (PDA method) 10 475 296 418 253
Filter Performance 13 387 296 462 253
Slices LUTs FFs
(# taps) (Msps) 20 851 271 790 251
6 524 774 1012 245 28 1303 305 886 251
10 781 1103 1480 222 41 2178 296 1660 243
13 929 1311 1775 199 61 3284 247 1947 242
20 1191 1631 2288 199
119 6025 294 3581 241
28 1774 2544 3381 199
41 2475 3642 4748 222 151 7623 294 7631 215
61 3528 5335 6812 199
119 6484 9754 12539 205
From the table, it can be seen that the MAC filter uses fewer
151 8274 12525 15988 199 number of slices compared to the add-shift method, but it also

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.
uses the DSP blocks available on Virtex IV devices. The [12] "Distributed Arithmetic FIR Filter v9.0," Xilinx Product
number of DSP blocks is equal to the number of taps of the Specification 2004.
filter. The results show that we achieve higher performance as [13] T. Sasao, Y. Iguchi, and T. Suzuki, "On LUT Cascade Realizations
the filter size increases. This is mainly because that critical of FIR Filters," presented at Euromicro Conference on Digital
System Design (DSD), 2005.
path in our design consists of adders while in MAC method,
[14] G.R.Goslin, "A Guide to Using Field Programmable Gate Arrays
critical path consists of multipliers and adders. Another (FPGAs) for Application-Specific Digital Signal Processing
limitation for MAC method is that Xilinx CoregenTM is limited Performance," Xilinx Application Note, San Jose 1995.
to input width of 17 bits due to the embedded DSP block input [15] M.Potkonjak, M.B.Srivastava, and A.P.Chandrakasan, "Multiple
limitation while our add and shift method can accept inputs of Constant Multiplications: Efficient and Versatile Framework and
any width. Algorithms for Exploring Common Subexpression Elimination,"
IEEE Transactions on Computer Aided Design of Integrated
Circuits and Systems, 1996.
[16] R.I.Hartley, "Subexpression sharing in filters using canonic signed
VI. CONCLUSION
digit multipliers," Circuits and Systems II: Analog and Digital
Signal Processing, IEEE Transactions on [see also Circuits and
In this paper we presented a multiplierless technique, Systems II: Express Briefs, IEEE Transactions on], vol. 43, pp.
based on the add and shift method and common subexpression 677-688, 1996.
elimination for low area, low power and high speed [17] H.T.Nguyen and A.Chatterjee, "Number-splitting with shift-and-
implementations of FIR filters. We validated our techniques add decomposition for power and hardware optimization in linear
on Virtex IITM devices where we observed significant area and DSP synthesis," Very Large Scale Integration (VLSI) Systems,
power reductions over traditional Distributed Arithmetic based IEEE Transactions on, vol. 8, pp. 419-424, 2000.
[18] H.-J. Kang, H. Kim, and I.-C. Park, "FIR filter synthesis
techniques. In future, we would like to modify our algorithm algorithms for minimizing the delay and the number of adders,"
to make use of the limited number of embedded multipliers presented at Computer Aided Design, 2000. ICCAD-2000.
available on the FPGA devices. IEEE/ACM International Conference on, 2000.
[19] A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware
Compleity of Linear DSP Systems by Iteratively Eliminating Two
Term Common Subexpressions," presented at Asia South Pacific
Design Automation Conference, Shanghai, 2005.
VII. REFERENCES [20] M. Yamada and A. Nishihara, "High-speed FIR digital filter with
CSD coefficients implemented on FPGA," presented at Design
[1] K.D.Underwood and K.S.Hemmert, "Closing the Gap: CPU and Automation Conference, 2001. Proceedings of the ASP-DAC
FPGA Trends in Sustainable Floating-Point BLAS Performance," 2001. Asia and South Pacific, 2001.
presented at International Symposium on Field-Programmable [21] H.Safiri, M.Ahmadi, G.A.Jullien, and W.C.Miller, "A new
Custom Computing Machines, California, USA, 2004. algorithm for the elimination of common subexpressions in
[2] L.Zhuo and V.K.Prasanna, "Sparse Matrix-Vector Multiplication hardware implementation of digital filters by using genetic
on FPGAs," presented at International Symposium on Field programming," presented at Application-Specific Systems,
Programmable Gate Arrays (FPGA), Monterey, CA, 2005. Architectures, and Processors, 2000. Proceedings. IEEE
[3] Y.Meng, A.P.Brown, R.A.Iltis, T.Sherwood, H.Lee, and International Conference on, 2000.
R.Kastner, "MP Core: Algorithm and Design Techniques for [22] A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware
Efficient Channel Estimation in Wireless Applications," presented complexity by iteratively eliminating two term common
at Design Automation Conference (DAC), Anaheim, CA, 2005. subexpressions," presented at Asia South Pacific Design
[4] B. L. Hutchings and B. E. Nelson, "Gigaop DSP on FPGA," Automation Conference (ASP-DAC), 2005.
presented at Acoustics, Speech, and Signal Processing, 2001.
Proceedings. (ICASSP '01). 2001 IEEE International Conference
on, 2001.
[5] A.Alsolaim, J.Becker, M.Glesner, and J.Starzyk, "Architecture and
Application of a Dynamically Reconfigurable Hardware Array for
Future Mobile Communication Systems," presented at
International Symposium on Field Programmable Custom
Computing Machines (FCCM), 2000.
[6] S.J.Melnikoff, S.F.Quigley, and M.J.Russell, "Implementing a
Simple Continuous Speech Recognition System on an FPGA,"
presented at International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2002.
[7] T.Yokota, M.Nagafuchi, Y.Mekada, T.Yoshinaga, K.Ootsu, and
T.Baba, "A Scalable FPGA-based Custom Computing Machine for
Medical Image Processing," presented at International Symposium
on Field-Programmable Custom Computing Machines (FCCM),
2002.
[8] K.Chapman, "Constant Coefficient Multipliers for the XC4000E,"
Xilinx Technical Report 1996.
[9] K. Wiatr and E. Jamro, "Constant coefficient multiplication in
FPGA structures," presented at Euromicro Conference, 2000.
Proceedings of the 26th, 2000.
[10] M. J. Wirthlin and B. McMurtrey, "Efficient Constant Coefficient
Multiplication Using Advanced FPGA Architectures," presented at
International Conference on Field Programmable Logic and
Applications (FPL), 2001.
[11] M.J.Wirthlin, "Constant Coefficient Multiplication Using Look-Up
Tables," Journal of VLSI Signal Processing, vol. 36, pp. 7-15,
2004.

Authorized licensed use limited to: Arizona State University. Downloaded on July 02,2024 at 17:47:27 UTC from IEEE Xplore. Restrictions apply.

You might also like