Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay
Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay
2, FEBRUARY 2014
Abstract— In this paper, we present an efficient architec- on the delayed versions of weights and the number of delays
ture for the implementation of a delayed least mean square in weights varies from 1 to N. Van and Feng [10] have
adaptive filter. For achieving lower adaptation-delay and proposed a systolic architecture, where they have used rel-
area-delay-power efficient implementation, we use a novel partial
product generator and propose a strategy for optimized balanced atively large processing elements (PEs) for achieving a lower
pipelining across the time-consuming combinational blocks of adaptation delay with the critical path of one MAC operation.
the structure. From synthesis results, we find that the proposed Ting et al. [11] have proposed a fine-grained pipelined design
design offers nearly 17% less area-delay product (ADP) and to limit the critical path to the maximum of one addition time,
nearly 14% less energy-delay product (EDP) than the best of the which supports high sampling frequency, but involves a lot of
existing systolic structures, on average, for filter lengths N = 8,
16, and 32. We propose an efficient fixed-point implementation area overhead for pipelining and higher power consumption
scheme of the proposed architecture, and derive the expression than in [10], due to its large number of pipeline latches.
for steady-state error. We show that the steady-state mean Further effort has been made by Meher and Maheshwari [12]
squared error obtained from the analytical result matches with to reduce the number of adaptation delays. Meher and Park
the simulation result. Moreover, we have proposed a bit-level have proposed a 2-bit multiplication cell, and used that with an
pruning of the proposed architecture, which provides nearly
20% saving in ADP and 9% saving in EDP over the pro- efficient adder tree for pipelined inner-product computation to
posed structure before pruning without noticeable degradation of minimize the critical path and silicon area without increasing
steady-state-error performance. the number of adaptation delays [13], [14].
Index Terms— Adaptive filters, circuit optimization, fixed-point The existing work on the DLMS adaptive filter does not
arithmetic, least mean square (LMS) algorithms. discuss the fixed-point implementation issues, e.g., location of
radix point, choice of word length, and quantization at various
I. I NTRODUCTION stages of computation, although they directly affect the conver-
gence performance, particularly due to the recursive behavior
T HE LEAST MEAN SQUARE (LMS) adaptive filter is
the most popular and most widely used adaptive filter,
not only because of its simplicity but also because of its
of the LMS algorithm. Therefore, fixed-point implementation
issues are given adequate emphasis in this paper. Besides,
we present here the optimization of our previously reported
satisfactory convergence performance [1], [2]. The direct-form
design [13], [14] to reduce the number of pipeline delays along
LMS adaptive filter involves a long critical path due to an
with the area, sampling period, and energy consumption. The
inner-product computation to obtain the filter output. The
proposed design is found to be more efficient in terms of the
critical path is required to be reduced by pipelined imple-
power-delay product (PDP) and energy-delay product (EDP)
mentation when it exceeds the desired sample period. Since
compared to the existing structures.
the conventional LMS algorithm does not support pipelined
In the next section, we review the DLMS algorithm, and in
implementation because of its recursive behavior, it is modified
Section III, we describe the proposed optimized architecture
to a form called the delayed LMS (DLMS) algorithm [3]–[5],
for its implementation. Section IV deals with fixed-point
which allows pipelined implementation of the filter.
implementation considerations and simulation studies of the
A lot of work has been done to implement the DLMS algo-
convergence of the algorithm. In Section V, we discuss the
rithm in systolic architectures to increase the maximum usable
synthesis of the proposed architecture and comparison with the
frequency [3], [6], [7] but, they involve an adaptation delay of
existing architectures. Conclusions are given in Section VI.
∼ N cycles for filter length N, which is quite high for large-
order filters. Since the convergence performance degrades II. R EVIEW OF D ELAYED LMS A LGORITHM
considerably for a large adaptation delay, Visvanathan et al. [8] The weights of LMS adaptive filter during the nth iteration
have proposed a modified systolic architecture to reduce the are updated according to the following equations [2]:
adaptation delay. A transpose-form LMS adaptive filter is
suggested in [9], where the filter output at any instant depends wn+1 = wn + μ · en · xn (1a)
where
Manuscript received September 9, 2012; revised December 5, 2012; en = dn − yn yn = wnT · xn (1b)
accepted January 8, 2013. Date of publication February 4, 2013; date of
current version January 17, 2014. (Corresponding author: S. Y. Park.)
where the input vector xn , and the weight vector wn at the nth
The authors are with the Institute for Infocomm Research, 138632 iteration are, respectively, given by
Singapore (e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available xn = [x n , x n−1 , . . . , x n−N+1 ]T
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2239321 wn = [wn (0), wn (1), . . . , wn (N − 1)]T ,
1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 363
0
input sample, xn FIR FILTER filter output, yn LMS (n =0, n2=0)
1
error, en -60
WEIGHT-
mD -80
UPDATE BLOCK
0 100 200 300 400 500 600 700
Iteration Number
Fig. 1. Structure of the conventional delayed LMS adaptive filter.
Fig. 3. Convergence performance of system identification with LMS and
modified DLMS adaptive filters.
desired signal,dn
input sample, xn
ERROR-COMPUTATION BLOCK update equation of the modified DLMS algorithm is given by
wn+1 = wn + μ · en−n1 · xn−n1 (3a)
new weights error en
where
n1D n2D n1D en−n1 = dn−n1 − yn−n1 (3b)
and
yn = wn−n
T
2
· xn . (3c)
WEIGHT-UPDATE BLOCK We notice that, during the weight update, the error with n 1
delays is used, while the filtering unit uses the weights delayed
Fig. 2. Structure of the modified delayed LMS adaptive filter. by n 2 cycles. The modified DLMS algorithm decouples com-
putations of the error-computation block and the weight-update
block and allows us to perform optimal pipelining by feed-
dn is the desired response, yn is the filter output, and en forward cut-set retiming of both these sections separately to
denotes the error computed during the nth iteration. μ is the minimize the number of pipeline stages and adaptation delay.
step-size, and N is the number of weights used in the LMS The adaptive filters with different n 1 and n 2 are simulated
adaptive filter. for a system identification problem. The 10-tap band-pass filter
In the case of pipelined designs with m pipeline stages, with impulse response
the error en becomes available after m cycles, where m is sin(w H (n − 4.5)) sin(w L (n − 4.5))
called the “adaptation delay.” The DLMS algorithm therefore hn = −
π(n − 4.5) π(n − 4.5)
uses the delayed error en−m , i.e., the error corresponding to for n = 0, 1, 2, . . . , 9, otherwise h n = 0 (4)
(n − m)th iteration for updating the current weight instead of
the recent-most error. The weight-update equation of DLMS is used as the unknown system as in [10]. w H and w L
adaptive filter is given by represent the high and low cutoff frequencies of the passband,
and are set to w H = 0.7π and w L = 0.3π, respectively. The
wn+1 = wn + μ · en−m · xn−m . (2) step size μ is set to 0.4. A 16-tap adaptive filter identifies
the unknown system with Gaussian random input x n of zero
The block diagram of the DLMS adaptive filter is shown in mean and unit variance. In all cases, outputs of known system
Fig. 1, where the adaptation delay of m cycles amounts to are of unity power, and contaminated with white Gaussian
the delay introduced by the whole of adaptive filter structure noise of −70 dB strength. Fig. 3 shows the learning curve
consisting of finite impulse response (FIR) filtering and the of MSE of the error signal en by averaging 20 runs for the
weight-update process. conventional LMS adaptive filter (n 1 = 0, n 2 = 0) and DLMS
It is shown in [12] that the adaptation delay of conventional adaptive filters with (n 1 = 5, n 2 = 1) and (n 1 = 7, n 2 = 2).
LMS can be decomposed into two parts: one part is the delay It can be seen that, as the total number of delays increases,
introduced by the pipeline stages in FIR filtering, and the other the convergence is slowed down, while the steady-state MSE
part is due to the delay involved in pipelining the weight- remains almost the same in all cases. In this example, the
update process. Based on such a decomposition of delay, the MSE difference between the cases (n 1 = 5, n 2 = 1) and
DLMS adaptive filter can be implemented by a structure shown (n 1 = 7, n 2 = 2) after 2000 iterations is less than 1 dB,
in Fig. 2. on average.
Assuming that the latency of computation of error is n 1
cycles, the error computed by the structure at the nth cycle III. P ROPOSED A RCHITECTURE
is en−n1 , which is used with the input samples delayed by As shown in Fig. 2, there are two main computing blocks
n 1 cycles to generate the weight-increment term. The weight- in the adaptive filter architecture: 1) the error-computation
364 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 2, FEBRUARY 2014
block, and 2) weight-update block. In this Section, we discuss (AOC).1 Each of the 2-to-3 decoders takes a 2-b digit (u 1 u 0 )
the design strategy of the proposed structure to minimize the as input and produces three outputs b0 = u 0 · u¯1 , b1 = u¯0 · u 1 ,
adaptation delay in the error-computation block, followed by and b2 = u 0 · u 1 , such that b0 = 1 for (u 1 u 0 ) = 1, b1 = 1 for
the weight-update block. (u 1 u 0 ) = 2, and b2 = 1 for (u 1 u 0 ) = 3. The decoder output
b0 , b1 and b2 along with w, 2w, and 3w are fed to an AOC,
where w, 2w, and 3w are in 2’s complement representation
and sign-extended to have (W + 2) bits each. To take care
A. Pipelined Structure of the Error-Computation Block of the sign of the input samples while computing the partial
product corresponding to the most significant digit (MSD), i.e.,
The proposed structure for error-computation unit of an
(u L−1 u L−2 ) of the input sample, the AOC (L/2 − 1) is fed
N-tap DLMS adaptive filter is shown in Fig. 4. It consists of N
with w, −2w, and −w as input since (u L−1 u L−2 ) can have
number of 2-b partial product generators (PPG) corresponding
four possible values 0, 1, −2, and −1.
to N multipliers and a cluster of L/2 binary adder trees,
2) Structure of AOCs: The structure and function of an AOC
followed by a single shift–add tree. Each subblock is described
are depicted in Fig. 6. Each AOC consists of three AND cells
in detail.
and two OR cells. The structure and function of AND cells and
1) Structure of PPG: The structure of each PPG
is shown in Fig. 5. It consists of L/2 number of 1 We have assumed the word length of the input L to be even, which is
2-to-3 decoders and the same number of AND / OR cells valid in most practical cases.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 365
TABLE I
L OCATION OF P IPELINE L ATCHES FOR L = 8 AND N = 8, 16, AND 32
Error-Computation Block Weight-Update Block
N
Adder Tree Shift–add Tree Shift–add Tree
8 Stage-2 Stage-1 and 2 Stage-1
16 Stage-3 Stage-1 and 2 Stage-1
32 Stage-3 Stage-1 and 2 Stage-2
(a)
delayed by n 2 + 1 cycles. However, it should be noted that the set to 1 from a latch in the shift-add tree in the weight-update
delay by 1 cycle is due to the latch before the PPG, which block.
is included in the delay of the error-computation block, i.e.,
n 1 . Therefore, the delay generated in the weight-update block IV. F IXED -P OINT I MPLEMENTATION , O PTIMIZATION ,
becomes n 2 . If the locations of pipeline latches are decided S IMULATION , AND A NALYSIS
as in Table I, n 1 becomes 5, where three latches are in the In this section, we discuss the fixed-point implementation
error-computation block, one latch is after the subtraction in and optimization of the proposed DLMS adaptive filter. A bit-
Fig. 4, and the other latch is before PPG in Fig. 8. Also, n 2 is level pruning of the adder tree is also proposed to reduce the
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 367
0 TABLE III
N=8
E STIMATED AND S IMULATED S TEADY-S TATE MSE S OF THE F IXED -P OINT
Mean Squared Error (dB)
N=16
-20 N=32 DLMS A DAPTIVE F ILTER (L = W = 16)
Filter Length Step Size (μ) Simulation Analysis
-30
N=8
N=16
-50
-60
-70
-80
4 6 8 10 12 14 16 18
k1
TABLE IV
C OMPARISON OF H ARDWARE AND T IME C OMPLEXITIES OF D IFFERENT A RCHITECTURES FOR L = 8
Hardware Elements
Design Critical Path n1 n2
No. of Adders No. of Multipliers No. of Registers
Long et al [4] TM + T A log2 N + 1 0 2N 2N 3N + 2 log2 N + 1
Ting et al. [11] TA log2 N + 5 3 2N 2N 10N + 8
Van and Feng [10] TM + T A N/4 + 3 0 2N 2N 5N + 3
Proposed Design TA 5 1 10N + 2 0 2N + 14 + E †
† E = 24, 40 and 48 for N = 8, 16 and 32, respectively. Besides, proposed design needs additional 24N AND cells and 16N OR cells. The 2s complement
operator in Figs. 5 and 8 is counted as one adder, and it is assumed that the multiplication with the step size does not need the multiplier over all the
structures.
TABLE V
P ERFORMANCE C OMPARISON OF DLMS A DAPTIVE F ILTER BASED ON S YNTHESIS R ESULT U SING CMOS 65-nm L IBRARY
Filter DAT Latency Area Leakage EPS ADP EDP ADP EDP
Design
Length, N (ns) (cycles) (sq.μm) Power (mW) (mW×ns) (sq. μm× ns) (mW×ns2 ) Reduction Reduction
DAT: data-arrival time; ADP: area–delay product; EPS: energy per sample; EDP: energy–delay product. ADP and EDP reductions in last two columns are
improvements of proposed designs over [10] in percentage. Proposed Design-I: without optimization, Proposed Design-II: after optimization of adder tree
with k1 = 5.
proposed structure with k1 = 5 offers ∼ 20% less ADP and [6] H. Herzberg and R. Haimi-Cohen, “A systolic array realization of an
∼ 9% less EDP over the structure before optimization of the LMS adaptive filter and the effects of delayed adaptation,” IEEE Trans.
Signal Process., vol. 40, no. 11, pp. 2799–2803, Nov. 1992.
adder tree. [7] M. D. Meyer and D. P. Agrawal, “A high sampling rate delayed LMS
The proposed designs were also implemented on the field- filter architecture,” IEEE Trans. Circuits Syst. II, Analog Digital Signal
programmable gate array (FPGA) platform of Xilinx devices. Process., vol. 40, no. 11, pp. 727–729, Nov. 1993.
The number of slices (NOS) and the maximum usable fre- [8] S. Ramanathan and V. Visvanathan, “A systolic architecture for
LMS adaptive filtering with minimal adaptation delay,” in Proc.
quency (MUF) using two different devices of Spartan-3A Int. Conf. Very Large Scale Integr. (VLSI) Design, Jan. 1996,
(XC3SD1800A-4FG676) and Virtex-4 (XC4VSX35-10FF668) pp. 286–289.
are listed in Table VI. The proposed design-II, after the [9] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, “High speed
FPGA-based implementations of delayed-LMS filters,” J. Very Large
pruning, offers nearly 11.86% less slice-delay product, which Scale Integr. (VLSI) Signal Process., vol. 39, nos. 1–2, pp. 113–131,
is calculated as the average NOS/MUF, for N = 8, 16, 32, Jan. 2005.
and two devices. [10] L. D. Van and W. S. Feng, “An efficient systolic architecture for
the DLMS adaptive filter and its applications,” IEEE Trans. Circuits
Syst. II, Analog Digital Signal Process., vol. 48, no. 4, pp. 359–366,
Apr. 2001.
VI. C ONCLUSION [11] L.-K. Ting, R. Woods, and C. F. N. Cowan, “Virtex FPGA imple-
mentation of a pipelined adaptive LMS predictor for electronic support
We proposed an area–delay-power efficient low adaptation- measures receivers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
delay architecture for fixed-point implementation of LMS vol. 13, no. 1, pp. 86–99, Jan. 2005.
adaptive filter. We used a novel PPG for efficient implementa- [12] P. K. Meher and M. Maheshwari, “A high-speed FIR adap-
tion of general multiplications and inner-product computation tive filter architecture using a modified delayed LMS algo-
rithm,” in Proc. IEEE Int. Symp. Circuits Syst., May 2011,
by common subexpression sharing. Besides, we have proposed pp. 121–124.
an efficient addition scheme for inner-product computation to [13] P. K. Meher and S. Y. Park, “Low adaptation-delay LMS adaptive
reduce the adaptation delay significantly in order to achieve filter part-I: Introducing a novel multiplication cell,” in Proc. IEEE Int.
Midwest Symp. Circuits Syst., Aug. 2011, pp. 1–4.
faster convergence performance and to reduce the critical [14] P. K. Meher and S. Y. Park, “Low adaptation-delay LMS adaptive filter
path to support high input-sampling rates. Aside from this, part-II: An optimized architecture,” in Proc. IEEE Int. Midwest Symp.
we proposed a strategy for optimized balanced pipelining Circuits Syst., Aug. 2011, pp. 1–4.
across the time-consuming blocks of the structure to reduce [15] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and
Implementation. New York, USA: Wiley, 1999.
the adaptation delay and power consumption, as well. The [16] C. Caraiscos and B. Liu, “A roundoff error analysis of the LMS adaptive
proposed structure involved significantly less adaptation delay algorithm,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 1,
and provided significant saving of ADP and EDP compared to pp. 34–41, Feb. 1984.
[17] R. Rocher, D. Menard, O. Sentieys, and P. Scalart, “Accuracy evaluation
the existing structures. We proposed a fixed-point implemen- of fixed-point LMS algorithm,” in Proc. IEEE Int. Conf. Acoust., Speech,
tation of the proposed architecture, and derived the expression Signal Process., May 2004, pp. 237–240.
for steady-state error. We found that the steady-state MSE
obtained from the analytical result matched well with the Pramod Kumar Meher (SM’03) is currently a
simulation result. We also discussed a pruning scheme that Senior Scientist with the Institute for Infocomm
provides nearly 20% saving in the ADP and 9% saving in Research, Singapore. His research interests include
design of dedicated and reconfigurable architectures
EDP over the proposed structure before pruning, without a for computation-intensive algorithms pertaining to
noticeable degradation of steady-state error performance. The signal, image and video processing, communication,
highest sampling rate that could be supported by the ASIC bio-informatics and intelligent computing.
Dr. Meher is a Fellow of the Institution of Elec-
implementation of the proposed design ranged from about 870 tronics and Telecommunication Engineers, India. He
to 1010 MHz for filter orders 8 to 32. When the adaptive filter has served as an Associate Editor for the IEEE
is required to be operated at a lower sampling rate, one can use T RANSACTIONS ON C IRCUITS AND S YSTEMS —II:
E XPRESS B RIEFS during 2008–2011 and as a speaker for the Distinguished
the proposed design with a clock slower than the maximum Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011–2012.
usable frequency and a lower operating voltage to reduce the He is continuing to serve as Associate Editor for the IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS , IEEE T RANSACTIONS
power consumption further.
ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS , and Journal
of Circuits, Systems, and Signal Processing. He was a recipient of the
Samanta Chandrasekhar Award for excellence in research in engineering and
R EFERENCES technology for the year 1999.