0% found this document useful (0 votes)

88 views10 pages

Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay

Uploaded by

anil kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views10 pages

Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay

Uploaded by

anil kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

362 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO.

2, FEBRUARY 2014

Area-Delay-Power Efficient Fixed-Point LMS

Adaptive Filter With Low Adaptation-Delay
Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE

Abstract— In this paper, we present an efficient architec- on the delayed versions of weights and the number of delays
ture for the implementation of a delayed least mean square in weights varies from 1 to N. Van and Feng [10] have
adaptive filter. For achieving lower adaptation-delay and proposed a systolic architecture, where they have used rel-
area-delay-power efficient implementation, we use a novel partial
product generator and propose a strategy for optimized balanced atively large processing elements (PEs) for achieving a lower
pipelining across the time-consuming combinational blocks of adaptation delay with the critical path of one MAC operation.
the structure. From synthesis results, we find that the proposed Ting et al. [11] have proposed a fine-grained pipelined design
design offers nearly 17% less area-delay product (ADP) and to limit the critical path to the maximum of one addition time,
nearly 14% less energy-delay product (EDP) than the best of the which supports high sampling frequency, but involves a lot of
existing systolic structures, on average, for filter lengths N = 8,
16, and 32. We propose an efficient fixed-point implementation area overhead for pipelining and higher power consumption
scheme of the proposed architecture, and derive the expression than in [10], due to its large number of pipeline latches.
for steady-state error. We show that the steady-state mean Further effort has been made by Meher and Maheshwari [12]
squared error obtained from the analytical result matches with to reduce the number of adaptation delays. Meher and Park
the simulation result. Moreover, we have proposed a bit-level have proposed a 2-bit multiplication cell, and used that with an
pruning of the proposed architecture, which provides nearly
20% saving in ADP and 9% saving in EDP over the pro- efficient adder tree for pipelined inner-product computation to
posed structure before pruning without noticeable degradation of minimize the critical path and silicon area without increasing
steady-state-error performance. the number of adaptation delays [13], [14].
Index Terms— Adaptive filters, circuit optimization, fixed-point The existing work on the DLMS adaptive filter does not
arithmetic, least mean square (LMS) algorithms. discuss the fixed-point implementation issues, e.g., location of
radix point, choice of word length, and quantization at various
I. I NTRODUCTION stages of computation, although they directly affect the conver-
gence performance, particularly due to the recursive behavior
T HE LEAST MEAN SQUARE (LMS) adaptive filter is
the most popular and most widely used adaptive filter,
not only because of its simplicity but also because of its
of the LMS algorithm. Therefore, fixed-point implementation
issues are given adequate emphasis in this paper. Besides,
we present here the optimization of our previously reported
satisfactory convergence performance [1], [2]. The direct-form
design [13], [14] to reduce the number of pipeline delays along
LMS adaptive filter involves a long critical path due to an
with the area, sampling period, and energy consumption. The
inner-product computation to obtain the filter output. The
proposed design is found to be more efficient in terms of the
critical path is required to be reduced by pipelined imple-
power-delay product (PDP) and energy-delay product (EDP)
mentation when it exceeds the desired sample period. Since
compared to the existing structures.
the conventional LMS algorithm does not support pipelined
In the next section, we review the DLMS algorithm, and in
implementation because of its recursive behavior, it is modified
Section III, we describe the proposed optimized architecture
to a form called the delayed LMS (DLMS) algorithm [3]–[5],
for its implementation. Section IV deals with fixed-point
which allows pipelined implementation of the filter.
implementation considerations and simulation studies of the
A lot of work has been done to implement the DLMS algo-
convergence of the algorithm. In Section V, we discuss the
rithm in systolic architectures to increase the maximum usable
synthesis of the proposed architecture and comparison with the
frequency [3], [6], [7] but, they involve an adaptation delay of
existing architectures. Conclusions are given in Section VI.
∼ N cycles for filter length N, which is quite high for large-
order filters. Since the convergence performance degrades II. R EVIEW OF D ELAYED LMS A LGORITHM
considerably for a large adaptation delay, Visvanathan et al. [8] The weights of LMS adaptive filter during the nth iteration
have proposed a modified systolic architecture to reduce the are updated according to the following equations [2]:
adaptation delay. A transpose-form LMS adaptive filter is
suggested in [9], where the filter output at any instant depends wn+1 = wn + μ · en · xn (1a)
where
Manuscript received September 9, 2012; revised December 5, 2012; en = dn − yn yn = wnT · xn (1b)
accepted January 8, 2013. Date of publication February 4, 2013; date of
current version January 17, 2014. (Corresponding author: S. Y. Park.)
where the input vector xn , and the weight vector wn at the nth
The authors are with the Institute for Infocomm Research, 138632 iteration are, respectively, given by
Singapore (e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available xn = [x n , x n−1 , . . . , x n−N+1 ]T
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2239321 wn = [wn (0), wn (1), . . . , wn (N − 1)]T ,
1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 363

0
input sample, xn FIR FILTER filter output, yn LMS (n =0, n2=0)
1

Mean Squared Error (dB)

BLOCK DLMS (n1=5, n2=1)
-20
DLMS (n1=7, n2=2)
_ desired
signal
mD new weights + dn
-40

error, en -60
WEIGHT-
mD -80
UPDATE BLOCK
0 100 200 300 400 500 600 700
Iteration Number
Fig. 1. Structure of the conventional delayed LMS adaptive filter.
Fig. 3. Convergence performance of system identification with LMS and
modified DLMS adaptive filters.
desired signal,dn
input sample, xn
ERROR-COMPUTATION BLOCK update equation of the modified DLMS algorithm is given by
wn+1 = wn + μ · en−n1 · xn−n1 (3a)
new weights error en
where
n1D n2D n1D en−n1 = dn−n1 − yn−n1 (3b)
and
yn = wn−n
T
2
· xn . (3c)
WEIGHT-UPDATE BLOCK We notice that, during the weight update, the error with n 1
delays is used, while the filtering unit uses the weights delayed
Fig. 2. Structure of the modified delayed LMS adaptive filter. by n 2 cycles. The modified DLMS algorithm decouples com-
putations of the error-computation block and the weight-update
block and allows us to perform optimal pipelining by feed-
dn is the desired response, yn is the filter output, and en forward cut-set retiming of both these sections separately to
denotes the error computed during the nth iteration. μ is the minimize the number of pipeline stages and adaptation delay.
step-size, and N is the number of weights used in the LMS The adaptive filters with different n 1 and n 2 are simulated
adaptive filter. for a system identification problem. The 10-tap band-pass filter
In the case of pipelined designs with m pipeline stages, with impulse response
the error en becomes available after m cycles, where m is sin(w H (n − 4.5)) sin(w L (n − 4.5))
called the “adaptation delay.” The DLMS algorithm therefore hn = −
π(n − 4.5) π(n − 4.5)
uses the delayed error en−m , i.e., the error corresponding to for n = 0, 1, 2, . . . , 9, otherwise h n = 0 (4)
(n − m)th iteration for updating the current weight instead of
the recent-most error. The weight-update equation of DLMS is used as the unknown system as in [10]. w H and w L
adaptive filter is given by represent the high and low cutoff frequencies of the passband,
and are set to w H = 0.7π and w L = 0.3π, respectively. The
wn+1 = wn + μ · en−m · xn−m . (2) step size μ is set to 0.4. A 16-tap adaptive filter identifies
the unknown system with Gaussian random input x n of zero
The block diagram of the DLMS adaptive filter is shown in mean and unit variance. In all cases, outputs of known system
Fig. 1, where the adaptation delay of m cycles amounts to are of unity power, and contaminated with white Gaussian
the delay introduced by the whole of adaptive filter structure noise of −70 dB strength. Fig. 3 shows the learning curve
consisting of finite impulse response (FIR) filtering and the of MSE of the error signal en by averaging 20 runs for the
weight-update process. conventional LMS adaptive filter (n 1 = 0, n 2 = 0) and DLMS
It is shown in [12] that the adaptation delay of conventional adaptive filters with (n 1 = 5, n 2 = 1) and (n 1 = 7, n 2 = 2).
LMS can be decomposed into two parts: one part is the delay It can be seen that, as the total number of delays increases,
introduced by the pipeline stages in FIR filtering, and the other the convergence is slowed down, while the steady-state MSE
part is due to the delay involved in pipelining the weight- remains almost the same in all cases. In this example, the
update process. Based on such a decomposition of delay, the MSE difference between the cases (n 1 = 5, n 2 = 1) and
DLMS adaptive filter can be implemented by a structure shown (n 1 = 7, n 2 = 2) after 2000 iterations is less than 1 dB,
in Fig. 2. on average.
Assuming that the latency of computation of error is n 1
cycles, the error computed by the structure at the nth cycle III. P ROPOSED A RCHITECTURE
is en−n1 , which is used with the input samples delayed by As shown in Fig. 2, there are two main computing blocks
n 1 cycles to generate the weight-increment term. The weight- in the adaptive filter architecture: 1) the error-computation
364 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 2, FEBRUARY 2014

Fig. 4. Proposed structure of the error-computation block.

Fig. 5. Proposed structure of PPG. AOC stands for AND / OR cell.

block, and 2) weight-update block. In this Section, we discuss (AOC).1 Each of the 2-to-3 decoders takes a 2-b digit (u 1 u 0 )
the design strategy of the proposed structure to minimize the as input and produces three outputs b0 = u 0 · u¯1 , b1 = u¯0 · u 1 ,
adaptation delay in the error-computation block, followed by and b2 = u 0 · u 1 , such that b0 = 1 for (u 1 u 0 ) = 1, b1 = 1 for
the weight-update block. (u 1 u 0 ) = 2, and b2 = 1 for (u 1 u 0 ) = 3. The decoder output
b0 , b1 and b2 along with w, 2w, and 3w are fed to an AOC,
where w, 2w, and 3w are in 2’s complement representation
and sign-extended to have (W + 2) bits each. To take care
A. Pipelined Structure of the Error-Computation Block of the sign of the input samples while computing the partial
product corresponding to the most significant digit (MSD), i.e.,
The proposed structure for error-computation unit of an
(u L−1 u L−2 ) of the input sample, the AOC (L/2 − 1) is fed
N-tap DLMS adaptive filter is shown in Fig. 4. It consists of N
with w, −2w, and −w as input since (u L−1 u L−2 ) can have
number of 2-b partial product generators (PPG) corresponding
four possible values 0, 1, −2, and −1.
to N multipliers and a cluster of L/2 binary adder trees,
2) Structure of AOCs: The structure and function of an AOC
followed by a single shift–add tree. Each subblock is described
are depicted in Fig. 6. Each AOC consists of three AND cells
in detail.
and two OR cells. The structure and function of AND cells and
1) Structure of PPG: The structure of each PPG
is shown in Fig. 5. It consists of L/2 number of 1 We have assumed the word length of the input L to be even, which is
2-to-3 decoders and the same number of AND / OR cells valid in most practical cases.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 365

TABLE I
L OCATION OF P IPELINE L ATCHES FOR L = 8 AND N = 8, 16, AND 32
Error-Computation Block Weight-Update Block
N
Adder Tree Shift–add Tree Shift–add Tree
8 Stage-2 Stage-1 and 2 Stage-1
16 Stage-3 Stage-1 and 2 Stage-1
32 Stage-3 Stage-1 and 2 Stage-2

(a)

lines, to reduce the critical path to one addition time. If we

introduce pipeline latches after every addition, it would require
L(N − 1)/2 + L/2 − 1 latches in log2 N + log2 L − 1 stages,
which would lead to a high adaptation delay and introduce
a large overhead of area and power consumption for large
(b) (c)
values of N and L. On the other hand, some of those pipeline
Fig. 6. Structure and function of AND/OR cell. Binary operators · and + in
latches are redundant in the sense that they are not required to
(b) and (c) are implemented using AND and OR gates, respectively. maintain a critical path of one addition time. The final adder
in the shift–add tree contributes to the maximum delay to the
critical path. Based on that observation, we have identified
OR cells are depicted by Fig. 6(b) and (c), respectively. Each the pipeline latches that do not contribute significantly to the
AND cell takes an n-bit input D and a single bit input b, and critical path and could exclude those without any noticeable
consists of n AND gates. It distributes all the n bits of input increase of the critical path. The location of pipeline latches
D to its n AND gates as one of the inputs. The other inputs for filter lengths N = 8, 16, and 32 and for input size L = 8
of all the n AND gates are fed with the single-bit input b. As are shown in Table I. The pipelining is performed by a feed-
shown in Fig. 6(c), each OR cell similarly takes a pair of n-bit forward cut-set retiming of the error-computation block [15].
input words and has n OR gates. A pair of bits in the same
bit position in B and D is fed to the same OR gate. B. Pipelined Structure of the Weight-Update Block
The output of an AOC is w, 2w, and 3w corresponding
The proposed structure for the weight-update block is shown
to the decimal values 1, 2, and 3 of the 2-b input (u 1 u 0 ),
in Fig. 8. It performs N multiply-accumulate operations of the
respectively. The decoder along with the AOC performs a
form (μ × e) × x i + wi to update N filter weights. The step
multiplication of input operand w with a 2-b digit (u 1 u 0 ), such
size μ is taken as a negative power of 2 to realize the multipli-
that the PPG of Fig. 5 performs L/2 parallel multiplications of
cation with recently available error only by a shift operation.
input word w with a 2-b digit to produce L/2 partial products
Each of the MAC units therefore performs the multiplication
of the product word wu.
of the shifted value of error with the delayed input samples
3) Structure of Adder Tree: Conventionally, we should have
x i followed by the additions with the corresponding old
performed the shift-add operation on the partial products of
weight values wi . All the N multiplications for the MAC
each PPG separately to obtain the product value and then
operations are performed by N PPGs, followed by N shift–
added all the N product values to compute the desired inner
add trees. Each of the PPGs generates L/2 partial products
product. However, the shift-add operation to obtain the product
corresponding to the product of the recently shifted error
value increases the word length, and consequently increases
value μ × e with L/2, the number of 2-b digits of the input
the adder size of N − 1 additions of the product values. To
word x i , where the subexpression 3μ × e is shared within the
avoid such increase in word size of the adders, we add all the
multiplier. Since the scaled error (μ × e) is multiplied with all
N partial products of the same place value from all the N
the N delayed input values in the weight-update block, this
PPGs by one adder tree.
subexpression can be shared across all the multipliers as well.
All the L/2 partial products generated by each of the N
This leads to substantial reduction of the adder complexity.
PPGs are thus added by (L/2) binary adder trees. The outputs
The final outputs of MAC units constitute the desired updated
of the L/2 adder trees are then added by a shift-add tree
weights to be used as inputs to the error-computation block as
according to their place values. Each of the binary adder trees
well as the weight-update block for the next iteration.
require log2 N stages of adders to add N partial product, and
the shift–add tree requires log2 L − 1 stages of adders to add
L/2 output of L/2 binary adder trees.2 The addition scheme C. Adaptation Delay
for the error-computation block for a four-tap filter and input As shown in Fig. 2, the adaptation delay is decomposed into
word size L = 8 is shown in Fig. 7. For N = 4 and L = 8, the n 1 and n 2 . The error-computation block generates the delayed
adder network requires four binary adder trees of two stages error by n 1 − 1 cycles as shown in Fig. 4, which is fed to the
each and a two-stage shift–add tree. In this figure, we have weight-update block shown in Fig. 8 after scaling by μ; then
shown all possible locations of pipeline latches by dashed the input is delayed by 1 cycle before the PPG to make the
total delay introduced by FIR filtering be n 1 . In Fig. 8, the
2 When L is not a power of 2, log L should be replaced by log L. weight-update block generates wn−1−n2 , and the weights are
2 2
366 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 2, FEBRUARY 2014

Fig. 7. Adder-structure of the filtering unit for N = 4 and L = 8.

Fig. 8. Proposed structure of the weight-update block.

delayed by n 2 + 1 cycles. However, it should be noted that the set to 1 from a latch in the shift-add tree in the weight-update
delay by 1 cycle is due to the latch before the PPG, which block.
is included in the delay of the error-computation block, i.e.,
n 1 . Therefore, the delay generated in the weight-update block IV. F IXED -P OINT I MPLEMENTATION , O PTIMIZATION ,
becomes n 2 . If the locations of pipeline latches are decided S IMULATION , AND A NALYSIS
as in Table I, n 1 becomes 5, where three latches are in the In this section, we discuss the fixed-point implementation
error-computation block, one latch is after the subtraction in and optimization of the proposed DLMS adaptive filter. A bit-
Fig. 4, and the other latch is before PPG in Fig. 8. Also, n 2 is level pruning of the adder tree is also proposed to reduce the
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 367

given as the input. For this purpose, the specific scaling/sign

extension and truncation/zero padding are required.
Since the LMS algorithm performs learning so that y
has the same sign as d, the error signal e can also be set to
Fig. 9. Fixed-point representation of a binary number (X i : integer word- have the same representation as y without overflow after the
length; X f : fractional word-length). subtraction.
It is shown in [4] that the convergence of an N-tap DLMS
TABLE II adaptive filter with n 1 adaptation delay will be ensured if
F IXED -P OINT R EPRESENTATION OF THE S IGNALS OF THE P ROPOSED
2
DLMS A DAPTIVE F ILTER (μ = 2−(L i +log2 N ) ) 0<μ< (5)
Signal Name Fixed-Point Representation
(σx2 (N − 2) + 2n 1 − 2)σx2
x (L , L i ) where σx2 is the average power of input samples. Furthermore,
w (W, Wi ) if the value of μ is defined as (power of 2) 2−n , where n ≤
p (W + 2, Wi + 2) Wi + L i +log2 N, the multiplication with μ is equivalent to the
q (W + 2 + log2 N, Wi + 2 + log2 N ) change of location of the radix point. Since the multiplication
y, d, e (W, Wi + L i + log2 N ) with μ does not need any arithmetic operation, it does not
μe (W, Wi ) introduce any truncation error. If we need to use a smaller step
r (W + 2, Wi + 2) size, i.e., n > Wi + L i + log2 N, some of the LSBs of en need
s (W, Wi ) to be truncated. If we assume that n = L i + log2 N, i.e., μ =
2−(L i +log2 N) , as in Table II, the representation of μen should
x, w, p, q, y, d, and e can be found in the error-computation block of be (W, Wi ) without any truncation. The weight increment
Fig. 4. μe, r, and s are defined in the weight-update block in Fig. 8. It is
to be noted that all the subscripts and time indices of signals are omitted term s (shown in Fig. 8), which is equivalent to μen x n , is
for simplicity of notation. required to have fixed-point representation (W + L, Wi + L i ).
However, only Wi MSBs in the computation of the shift–add
hardware complexity without noticeable degradation of steady- tree of the weight-update circuit are to be retained, while the
state MSE. rest of the more significant bits of MSBs need to be discarded.
This is in accordance with the assumptions that, as the weights
converge toward the optimal value, the weight increment terms
A. Fixed-Point Design Considerations
become smaller, and the MSB end of error term contains
For fixed-point implementation, the choice of word lengths more number of zeros. Also, in our design, L − L i LSBs
and radix points for input samples, weights, and internal of weight increment terms are truncated so that the terms
signals need to be decided. Fig. 9 shows the fixed-point have the same fixed-point representation as the weight values.
representation of a binary number. Let (X, X i ) be a fixed-point We also assume that no overflow occurs during the addition
representation of a binary number where X is the word length for the weight update. Otherwise, the word length of the
and X i is the integer length. The word length and location of weights should be increased at every iteration, which is not
radix point of x n and wn in Fig. 4 need to be predetermined desirable. The assumption is valid since the weight increment
by the hardware designer taking the design constraints, terms are small when the weights are converged. Also when
such as desired accuracy and hardware complexity, into overflow occurs during the training period, the weight updating
consideration. Assuming (L, L i ) and (W, Wi ), respectively, as is not appropriate and will lead to additional iterations to
the representations of input signals and filter weights, all other reach convergence. Accordingly, the updated weight can be
signals in Figs. 4 and 8 can be decided as shown in Table II. computed in truncated form (W, Wi ) and fed into the error-
The signal pi j , which is the output of PPG block (shown in computation block.
Fig. 4), has at most three times the value of input coefficients.
Thus, we can add two more bits to the word length and to the
integer length of the coefficients to avoid overflow. The output B. Computer Simulation of the Proposed DLMS Filter
of each stage in the adder tree in Fig. 7 is one bit more than the The proposed fixed-point DLMS adaptive filter is used for
size of input signals, so that the fixed-point representation of system identification used in Section II. μ is set to 0.5, 0.25,
the output of the adder tree with log2 N stages becomes (W + and 0.125 for filter lengths 8, 16, and 32, respectively, such
log2 N + 2, Wi + log2 N + 2). Accordingly, the output of the that the multiplication with μ does not require any additional
shift–add tree would be of the form (W +L +log2 N, Wi +L i + circuits. For the fixed-point simulation, the word length and
log2 N), assuming that no truncation of any least significant radix point of the input and coefficient are set to L = 16,
bits (LSB) is performed in the adder tree or the shift–add tree. L i = 2, W = 16, Wi = 0, and the Gaussian random input x n
However, the number of bits of the output of the shift–add tree of zero mean and unit variance is scaled down to fit in with the
is designed to have W bits. The most significant W bits need representation of (16, 2). The fixed-point data type of all the
to be retained out of (W + L + log2 N) bits, which results in other signals are obtained from Table II. Each learning curve
the fixed-point representation (W, Wi + L i + log2 N) for y, as is averaged over 50 runs to obtain a clean curve. The proposed
shown in Table II. Let the representation of the desired signal design was coded in C++ using SystemC fixed-point library for
d be the same as y, even though its quantization is usually different orders of the band-pass filter, that is, N = 8, N = 16,
368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 2, FEBRUARY 2014

0 TABLE III
N=8
E STIMATED AND S IMULATED S TEADY-S TATE MSE S OF THE F IXED -P OINT
Mean Squared Error (dB)

N=16
-20 N=32 DLMS A DAPTIVE F ILTER (L = W = 16)
Filter Length Step Size (μ) Simulation Analysis

-40 N =8 2−1 −71.01 dB −70.97 dB

N = 16 2−2 −64.84 dB −64.97 dB
N = 32 2−3 −58.72 dB −58.95 dB
-60

-80 The second term can be calculated as

0 200 400 600 800 1000
Iteration Number
E|α nT wn |2 = |wn∗ |2 (m 2αn + σα2n ) (12)
Fig. 10. Mean-squared-error of the fixed-point DLMS filter output for system
identification for N = 8, 16, and 32.
where wn∗ is the optimal Wiener vector, and m αn and σα2n are
defined as the mean and variance of αn when x n is truncated
to the fixed-point type of (L, L i ), as listed in Table II. αn can
and N = 32. The corresponding convergence behaviors are
be modeled as a uniform distribution with following mean and
obtained, as shown in Fig. 10. It is found that, as the filter
variance:
order increases, not only the convergence becomes slower, but
the steady-state MSE also increases. m αn = 2−(L−L i ) /2 (13a)
σα2n = 2−2(L−L i ) /12. (13b)
C. Steady-State Error Estimation
In this section, the MSE of output of the proposed DLMS For the calculation of the third term E|ηn |2 in (11), we have
adaptive filter due to the fixed-point quantization is analyzed. used the fact that the output from shift–add tree in the error-
Based on the models introduced in [16] and [17], the MSE computation block is of the type (W, Wi + L i + log2 N) after
of output in the steady state is derived in terms of parameters the final truncation. Therefore
listed in Table II. Let us denote the primed symbols as the
E|ηn |2 = m 2ηn + ση2n . (14)
truncated quantities due to the fixed-point representation, so
that the input and the desired signals can be written as where
xn = xn + α n (6) m 2ηn = 2−(W −(Wi +L i +log2 N)) /2 (15a)
dn = d n + βn (7) −2(W −(Wi +L i +log2 N))
ση2n =2 /12. (15b)
where α n and βn are input quantization noise vector and
quantization noise of desired signal, respectively. The weight The last term E|ρ nT xn |2 in (11) can be obtained by using
vector can be written as the derivation proposed in [17] as
−1
wn = wn + ρ n (8) T 2 2 i k Rki
N σγ2n − m 2γn
E|ρ n xn | = m γn + (16)
μ2 2μ
where ρ n is the error vector of current weights due to the finite
precision. The output signal yn and weight-update equation where Rki represents the (k, i )th entry of the matrix E(xn xnT ).
can accordingly be modified, respectively, to the forms For the weight update in (10), the first operation is to multiply
en with μ, which is equivalent to moving only the location
yn = w n xn + ηn
T
(9)
of the radix point and, therefore, does not introduce any

wn+1 = wn + μen xn + γ n (10) truncation error. The truncation after multiplication of μen
with x n is only required to be considered in order to evaluate
where ηn and γ n are the errors due to the truncation of output
γn . Then, we have
from the shift–add tree in the error-computation block and
weight-update block, respectively. The steady-state MSE in m 2γn = 2−(W −Wi ) /2 (17a)
the fixed-point representation can be expressed as −2(W −Wi )
σγ2n =2 /12. (17b)
E|dn − yn |2 = E|en |2 + E|α nT wn |2 + E|ηn |2 + E|ρ nT xn |2
(11) For a large μ, the truncation error ηn from the error-
computation block becomes the dominant error source, and
where E| · | is the operator for mathematical expectation, (11) can be approximated as E|ηn |2 . The MSE values are
and the terms en , α nT wn , ηn , and ρ nT xn are assumed to be estimated from analytical expressions as well as from the
uncorrelated. simulation results by averaging over 50 experiments. Table III
The first term E|en |2 , where en = dn − yn , is the excess shows that the steady-state MSE computed from analytical
MSE from infinite precision computation, whereas the other expression matches with that of simulation of the proposed
three terms are due to finite-precision arithmetic. architecture for different values of N and μ.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 369

-30
N=8
N=16

Mean Squared Error (dB)

-40 N=32

-50

-60

-70

-80
4 6 8 10 12 14 16 18
k1

Fig. 12. MSE at the steady-state versus k1 for N = 8, 16, and 32 (L =

W = 16).

sum of truncated values for the worst case can be formulated as

Fig. 11. Dot-diagram for optimization of the adder tree in the case of N = 4,
L = 8, and W = 8. 2 −1
k k1
1
bworst = N 2i = N k2 2k1 +1 − (4k2 − 1) . (19)
3
j =0 i=2 j
D. Adder-Tree Optimization In the example of Fig. 11, bworst amounts to 684. Meanwhile,
The adder tree and shift–add tree for the computation of the LSB weight of the output of adder tree after final truncation
yn can be pruned for further optimization of area, delay, and is 210 in the example. Therefore, there might be one bit
power complexity. To illustrate the proposed pruning optimiza- difference in the output of adder tree due to pruning. The
tion of adder tree and shift–add tree for the computation of truncation error from each row (total 12 rows from row p00
filter output, we take a simple example of filter length N = 4, to row p32 in Fig. 11) has a uniform distribution, and if the
considering the word lengths L and W to be 8. The dot individual errors is assumed to be independent of each other,
diagram of the adder tree is shown in Fig. 11. Each row of the mean and variance of the total error introduced can be
the dot diagram contains 10 dots, which represent the partial calculated as the sum of means and variances of each random
products generated by the PPG unit, for W = 8. We have four variable. However, it is unlikely that outputs from the same
sets of partial products corresponding to four partial products PPG are uncorrelated since it is generated from the same
of each multiplier, since L = 8. Each set of partial products of input sample. It would not be straightforward to estimate the
the same weight values contains four terms, since N = 4. The distribution of error from the pruning. However, as the value
final sum without truncation should be 18 b. However, we use of bworst is closer to or larger than the LSB weight of the
only 8 b in the final sum, and the rest 10 b are finally discarded. output after final truncation, the pruning will affect the overall
To reduce the computational complexity, some of the LSBs of error more. Fig. 12 illustrates the steady-state MSE in terms
inputs of the adder tree can be truncated, while some guard of k1 for N = 8, 16, and 32 when L = W = 16 to show
bits can be used to minimize the impact of truncation on the how much the pruning affects the output MSE. When k1 is
error performance of the adaptive filter. In Fig. 11, four bits less than 10 for N = 8, the MSE deterioration is less than 1
are taken as the guard bits and the rest six LSBs are truncated. dB compared to the case when the pruning is not applied.
To have more hardware saving, the bits to be truncated are not
generated by the PPGs, so the complexity of PPGs also gets V. C OMPLEXITY C ONSIDERATIONS
reduced. The hardware and time complexities of proposed design,
ηn defined in (9) increases if we prune the adder tree, and those of the structure of [11] and the best of systolic structures
the worst case error is caused when all the truncated bits are 1. [10] are listed in Table IV. The original DLMS structure
For the calculation of the sum of truncated values in the worst proposed in [4] is also listed in this table. It is found that
case, let us denote k1 as the bit location of MSB of truncated the proposed design has a shorter critical path of one addition
bits and Nk2 as the number of rows that are affected by the time as that of [11], and lower adaptation delay than the others.
truncation. In the example of Fig. 11, k1 and k2 are set to If we consider each multiplier to have (L − 1) adders, then
5 and 3, respectively, since the bit positions from 0 to 5 are the existing designs involve 16N adders, while the proposed
truncated and a total of 12 rows are affected by the truncation one involves 10N + 2 adders for L = 8. Similarly, it involves
for N = 4. Also, k2 can be derived using k1 as less number of delay registers compared with others.
We have coded the proposed designs in VHDL and synthe-
k1 W W
k2 = + 1 for k1 < , otherwise k2 = (18) sized by the Synopsys Design Compiler using CMOS 65-nm
2 2 2 library for different filter orders. The word length of the input
since the number of truncated bits is reduced by 2 for every samples and weights are chosen to be 8, i.e., L = W = 8.
group of N rows as shown in Fig. 11. Using k1 , k2 , and N, the The step size μ is chosen to be 1/2 L i +log2 N to realize
370 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 2, FEBRUARY 2014

TABLE IV
C OMPARISON OF H ARDWARE AND T IME C OMPLEXITIES OF D IFFERENT A RCHITECTURES FOR L = 8
Hardware Elements
Design Critical Path n1 n2
No. of Adders No. of Multipliers No. of Registers
Long et al [4] TM + T A log2 N + 1 0 2N 2N 3N + 2 log2 N + 1
Ting et al. [11] TA log2 N + 5 3 2N 2N 10N + 8
Van and Feng [10] TM + T A N/4 + 3 0 2N 2N 5N + 3
Proposed Design TA 5 1 10N + 2 0 2N + 14 + E †

† E = 24, 40 and 48 for N = 8, 16 and 32, respectively. Besides, proposed design needs additional 24N AND cells and 16N OR cells. The 2s complement
operator in Figs. 5 and 8 is counted as one adder, and it is assumed that the multiplication with the step size does not need the multiplier over all the
structures.

TABLE V
P ERFORMANCE C OMPARISON OF DLMS A DAPTIVE F ILTER BASED ON S YNTHESIS R ESULT U SING CMOS 65-nm L IBRARY
Filter DAT Latency Area Leakage EPS ADP EDP ADP EDP
Design
Length, N (ns) (cycles) (sq.μm) Power (mW) (mW×ns) (sq. μm× ns) (mW×ns2 ) Reduction Reduction

8 1.14 8 24204 0.13 18.49 26867 21.08 − −

Ting et al [11] 16 1.19 9 48049 0.27 36.43 55737 43.35 − −
32 1.25 10 95693 0.54 72.37 116745 90.47 − −
8 1.35 5 13796 0.08 7.29 18349 9.84 − −
Van and Feng [10] 16 1.35 7 27739 0.16 14.29 36893 19.30 − −
32 1.35 11 55638 0.32 27.64 73998 37.31 − −
8 1.14 5 14029 0.07 8.53 15572 9.72 15.13% 1.14%
Proposed Design-I 16 1.19 5 26660 0.14 14.58 31192 17.34 15.45% 10.10%
32 1.25 5 48217 0.27 21.00 58824 26.25 20.50% 29.64%
8 0.99 5 12765 0.06 8.42 12382 8.33 32.51% 15.22%
Proposed Design-II 16 1.15 5 24360 0.13 14.75 27526 16.96 25.38% 12.07%
32 1.15 5 43233 0.24 21.23 48853 24.41 33.98% 34.55%

DAT: data-arrival time; ADP: area–delay product; EPS: energy per sample; EDP: energy–delay product. ADP and EDP reductions in last two columns are
improvements of proposed designs over [10] in percentage. Proposed Design-I: without optimization, Proposed Design-II: after optimization of adder tree
with k1 = 5.

its multiplication without any additional circuitry. The word TABLE VI

length of all the other signals are determined based on the FPGA I MPLEMENTATIONS OF P ROPOSED D ESIGNS FOR L = 8 AND
types listed in Table II. We have also coded structures proposed N = 8, 16, AND 32
in [10] and [11] using VHDL, and synthesized using the same Proposed Design-I Proposed Design-II
Design
library and synthesis options in the Design Compiler for a NOS MUF NOS MUF
fair comparison. In Table V, we have shown the synthesis Xilinx Virtex-4 (XC4VSX35-10FF668)
results of the proposed designs and existing designs in terms
N =8 1024 148.7 931 151.6
of data arrival time (DAT), area, energy per sample (EPS),
N = 16 2036 121.2 1881 124.6
ADP, and EDP obtained for filter lengths N = 8, 16, and 32.
The proposed design-I before pruning of the adder tree has N = 32 4036 121.2 3673 124.6
the same DAT as the design in [11] since the critical paths Xilinx Spartan-3A DSP (XC3SD1800A-4FG676)
of both designs are same as T A as shown in Table IV, while N =8 1025 87.2 966 93.8
the design in [10] has a longer DAT which is equivalent to N = 16 2049 70.3 1915 74.8
T A + TM . However, the proposed design-II after the pruning
N = 32 4060 70.3 3750 75.7
of the adder tree has a slightly smaller DAT than the existing
designs. Also, the proposed designs could reduce the area
NOS stands for the number of slices. MUF stands for the maximum usable
by using a PPG based on common subexpression sharing, frequency in [MHz].
compared to the existing designs. As shown in Table V, the
reduction in area is more significant in the case of N = 32
since more sharing can be obtained in the case of large order involves ∼ 17% less ADP and ∼ 14% less EDP than the best
filters. The proposed designs could achieve less area and more previous work of [10], on average, for filter lengths N = 8, 16,
power reduction compared with [11] by removing redundant and 32. The proposed design-II, similarly, achieves ∼ 31% less
pipeline latches, which are not required to maintain a critical ADP and nearly ∼ 21% less EDP than the structure of [10]
path of one addition time. It is found that the proposed design-I for the same filters. The optimization of the adder tree of the
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 371

proposed structure with k1 = 5 offers ∼ 20% less ADP and [6] H. Herzberg and R. Haimi-Cohen, “A systolic array realization of an
∼ 9% less EDP over the structure before optimization of the LMS adaptive filter and the effects of delayed adaptation,” IEEE Trans.
Signal Process., vol. 40, no. 11, pp. 2799–2803, Nov. 1992.
adder tree. [7] M. D. Meyer and D. P. Agrawal, “A high sampling rate delayed LMS
The proposed designs were also implemented on the field- filter architecture,” IEEE Trans. Circuits Syst. II, Analog Digital Signal
programmable gate array (FPGA) platform of Xilinx devices. Process., vol. 40, no. 11, pp. 727–729, Nov. 1993.
The number of slices (NOS) and the maximum usable fre- [8] S. Ramanathan and V. Visvanathan, “A systolic architecture for
LMS adaptive filtering with minimal adaptation delay,” in Proc.
quency (MUF) using two different devices of Spartan-3A Int. Conf. Very Large Scale Integr. (VLSI) Design, Jan. 1996,
(XC3SD1800A-4FG676) and Virtex-4 (XC4VSX35-10FF668) pp. 286–289.
are listed in Table VI. The proposed design-II, after the [9] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, “High speed
FPGA-based implementations of delayed-LMS filters,” J. Very Large
pruning, offers nearly 11.86% less slice-delay product, which Scale Integr. (VLSI) Signal Process., vol. 39, nos. 1–2, pp. 113–131,
is calculated as the average NOS/MUF, for N = 8, 16, 32, Jan. 2005.
and two devices. [10] L. D. Van and W. S. Feng, “An efficient systolic architecture for
the DLMS adaptive filter and its applications,” IEEE Trans. Circuits
Syst. II, Analog Digital Signal Process., vol. 48, no. 4, pp. 359–366,
Apr. 2001.
VI. C ONCLUSION [11] L.-K. Ting, R. Woods, and C. F. N. Cowan, “Virtex FPGA imple-
mentation of a pipelined adaptive LMS predictor for electronic support
We proposed an area–delay-power efficient low adaptation- measures receivers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
delay architecture for fixed-point implementation of LMS vol. 13, no. 1, pp. 86–99, Jan. 2005.
adaptive filter. We used a novel PPG for efficient implementa- [12] P. K. Meher and M. Maheshwari, “A high-speed FIR adap-
tion of general multiplications and inner-product computation tive filter architecture using a modified delayed LMS algo-
rithm,” in Proc. IEEE Int. Symp. Circuits Syst., May 2011,
by common subexpression sharing. Besides, we have proposed pp. 121–124.
an efficient addition scheme for inner-product computation to [13] P. K. Meher and S. Y. Park, “Low adaptation-delay LMS adaptive
reduce the adaptation delay significantly in order to achieve filter part-I: Introducing a novel multiplication cell,” in Proc. IEEE Int.
Midwest Symp. Circuits Syst., Aug. 2011, pp. 1–4.
faster convergence performance and to reduce the critical [14] P. K. Meher and S. Y. Park, “Low adaptation-delay LMS adaptive filter
path to support high input-sampling rates. Aside from this, part-II: An optimized architecture,” in Proc. IEEE Int. Midwest Symp.
we proposed a strategy for optimized balanced pipelining Circuits Syst., Aug. 2011, pp. 1–4.
across the time-consuming blocks of the structure to reduce [15] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and
Implementation. New York, USA: Wiley, 1999.
the adaptation delay and power consumption, as well. The [16] C. Caraiscos and B. Liu, “A roundoff error analysis of the LMS adaptive
proposed structure involved significantly less adaptation delay algorithm,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 1,
and provided significant saving of ADP and EDP compared to pp. 34–41, Feb. 1984.
[17] R. Rocher, D. Menard, O. Sentieys, and P. Scalart, “Accuracy evaluation
the existing structures. We proposed a fixed-point implemen- of fixed-point LMS algorithm,” in Proc. IEEE Int. Conf. Acoust., Speech,
tation of the proposed architecture, and derived the expression Signal Process., May 2004, pp. 237–240.
for steady-state error. We found that the steady-state MSE
obtained from the analytical result matched well with the Pramod Kumar Meher (SM’03) is currently a
simulation result. We also discussed a pruning scheme that Senior Scientist with the Institute for Infocomm
provides nearly 20% saving in the ADP and 9% saving in Research, Singapore. His research interests include
design of dedicated and reconfigurable architectures
EDP over the proposed structure before pruning, without a for computation-intensive algorithms pertaining to
noticeable degradation of steady-state error performance. The signal, image and video processing, communication,
highest sampling rate that could be supported by the ASIC bio-informatics and intelligent computing.
Dr. Meher is a Fellow of the Institution of Elec-
implementation of the proposed design ranged from about 870 tronics and Telecommunication Engineers, India. He
to 1010 MHz for filter orders 8 to 32. When the adaptive filter has served as an Associate Editor for the IEEE
is required to be operated at a lower sampling rate, one can use T RANSACTIONS ON C IRCUITS AND S YSTEMS —II:
E XPRESS B RIEFS during 2008–2011 and as a speaker for the Distinguished
the proposed design with a clock slower than the maximum Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011–2012.
usable frequency and a lower operating voltage to reduce the He is continuing to serve as Associate Editor for the IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS , IEEE T RANSACTIONS
power consumption further.
ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS , and Journal
of Circuits, Systems, and Signal Processing. He was a recipient of the
Samanta Chandrasekhar Award for excellence in research in engineering and
R EFERENCES technology for the year 1999.

[1] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Englewood

Sang Yoon Park (S’03–M’11) received the B.S.,
Cliffs, NJ, USA: Prentice-Hall, 1985. M.S., and Ph.D. degrees from the Department of
[2] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Filters. Hobo- Electrical Engineering and Computer Science, Seoul
ken, NJ, USA: Wiley, 2003. National University, Seoul, Korea, in 2000, 2002,
[3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementation and 2006, respectively.
of a delayed LMS transversal adaptive filter,” in Proc. IEEE Int. Symp. He joined the School of Electrical and Elec-
Circuits Syst., May 1990, pp. 1943–1946. tronic Engineering, Nanyang Technological Univer-
[4] G. Long, F. Ling, and J. G. Proakis, “The LMS algorithm with delayed sity, Singapore, as a Research Fellow in 2007.
coefficient adaptation,” IEEE Trans. Acoust., Speech, Signal Process., Since 2008, he has been with the Institute for Info-
vol. 37, no. 9, pp. 1397–1405, Sep. 1989. comm Research, Singapore, where he is currently
[5] G. Long, F. Ling, and J. G. Proakis, “Corrections to ‘The LMS algorithm a Research Scientist. His research interests include
with delayed coefficient adaptation’,” IEEE Trans. Signal Process., dedicated/reconfigurable architectures and algorithms for low-power/low-
vol. 40, no. 1, pp. 230–232, Jan. 1992. area/high-performance digital signal processing and communication systems.