235933recursive Identification and Parameter Estimation Hanfu Chen Instant Download
235933recursive Identification and Parameter Estimation Hanfu Chen Instant Download
https://ebookbell.com/product/recursive-identification-and-
parameter-estimation-hanfu-chen-4718668
https://ebookbell.com/product/recursive-models-of-dynamic-linear-
economies-course-book-lars-peter-hansen-thomas-j-sargent-51952826
https://ebookbell.com/product/recursive-macroeconomic-theory-fourth-
edition-lars-ljungqvist-56635444
https://ebookbell.com/product/recursive-macroeconomic-theory-2nd-
edition-lars-ljungqvist-thomas-j-sargent-2375288
https://ebookbell.com/product/recursive-desire-rereading-epic-
tradition-reprint-jeremy-m-downes-23900602
Recursive Estimation And Timeseries Analysis An Introduction For The
Student And Practitioner 2nd Edition Peter C Young
https://ebookbell.com/product/recursive-estimation-and-timeseries-
analysis-an-introduction-for-the-student-and-practitioner-2nd-edition-
peter-c-young-2451416
https://ebookbell.com/product/recursive-filtering-for-2d-shiftvarying-
systems-with-communication-constraints-1st-edition-jinling-
liang-33562186
https://ebookbell.com/product/recursive-models-of-dynamic-linear-
economies-lars-peter-hansen-thomas-j-sargent-34749378
https://ebookbell.com/product/recursive-streamflow-forecasting-a-
statespace-approach-jozsef-szilagyi-andras-szollosinagy-4422132
https://ebookbell.com/product/recursive-macroeconomic-theory-third-
edition-lars-ljungqvist-4763342
Recursive
Identification and
Parameter Estimation
Han-Fu Chen
Wenxiao Zhao
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Chen, Hanfu.
Recursive identification and parameter estimation / authors, Han‑Fu Chen, Wenxiao
Zhao.
pages cm
Includes bibliographical references and index.
ISBN 978‑1‑4665‑6884‑6 (hardback)
1. Systems engineering‑‑Mathematics. 2. Parameter estimation. 3. Recursive
functions. I. Zhao, Wenxiao. II. Title.
TA168.C476 2014
620’.004201519536‑‑dc23 2014013190
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
v
vi
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
vii
Symbols
w
Ω basic space −→ weak convergence
ω element, or sample R real line including +∞
MT transpose of matrix M and −∞
A, B sets in Ω B 1-dimensional Borel σ -
∅ empty set algebra
σ (ξ ) σ -algebra generated by m(ξ ) median of random vari-
random variable ξ able ξ
P probability measure ν var total variation norm of
AB union of sets A and B signed measure ν
A B intersection of sets A aN ∼ bN c1 bN ≤ aN ≤ c2 bN
and B ∀ N ≥ 1 for some pos-
AΔB symmetric difference of itive constants c1 and
sets A and B c2
IA indicator function of set ⊗ Kronecker product
A det A determinant of matrix A
Eξ mathematical expecta- Adj A adjoint of matrix A
tion of random variable n! factorial of n
ξ Cnk combinatorial number
λmax (X) maximal eigenvalue of of k from n: Cnk =
matrix X n!
k!(n−k)!
λmin (X) minimal eigenvalue of
Re{a} real part of complex
matrix X
number a
X norm of matrix X de-
1 Im{a} imaginary part of com-
fined as (λmax (X T X)) 2 plex number a
a.s.
−→ almost sure conver- M+ pseudo-inverse of ma-
gence trix M
P
−→ convergence in proba- [a] integer part of real num-
bility ber a
ix
Abbreviations
AR autoregression NARX nonlinear autoregression
ARMA autoregressive and mov- with exogenous input
ing average PCA principal component anal-
ARMAX autoregressive and mov- ysis
ing average with exoge- PE persistent excitation
nous input RM Robbins–Monro
ARX autoregression with ex- SA stochastic approximation
ogenous input SAAWET stochastic approximation
DRPA distributed randomized algorithm with expanding
PageRank algorithm truncations
EIV errors-in-variables SISO single-input single-output
ELS extended least squares SPR strictly positive realness
GCT general convergence theo- a.s. almost surely
rem iff if and only if
LS least squares iid independent and identi-
MA moving average cally distributed
MFD matrix fraction descrip- i.o. infinitely often
tion mds martingale difference se-
MIMO multi-input multi-output quence
Preface
To build a mathematical model based on the observed data is a common task for
systems in diverse areas including not only engineering systems, but also physical
systems, social systems, biological systems and others. It may happen that there is
no a priori knowledge concerning the system under consideration; one then faces the
“black box” problem. In such a situation the “black box” is usually approximated by
a linear or nonlinear system, which is selected by minimizing a performance index
depending on the approximation error. However, in many cases, from physical or me-
chanical thinking or from human experiences one may have some a priori knowledge
about the system. For example, it may be known that the data are statically linearly
related, or they are related by a linear dynamic system but with unknown coefficients
and orders, or they can be fit into a certain type of nonlinear systems, etc. Then,
the problem of building a mathematical model is reduced to fixing the uncertainties
contained in the a priori knowledge by using the observed data, e.g., estimating coef-
ficients and orders of a linear system or identifying the nonlinear system on the basis
of the data. So, from a practical application point of view, when building a mathemat-
ical model, one first has to fix the model class the system belongs to on the basis of
available information. After this, one may apply an appropriate identification method
proposed by theoreticians to perform the task.
Therefore, for control theorists the topic of system identification consists of doing
the following things: 1) Assume the model class is known. It is required to design
an appropriate algorithm to identify the system from the given class by using the
available data. For example, if the class of linear stochastic systems is assumed, then
one has to propose an identification algorithm to estimate the unknown coefficients
and orders of the system on the basis of the input–output data of the system. 2) The
control theorists then have to justify that the proposed algorithm works well in the
sense that the estimates converge to the true ones as the data size increases, if the
applied data are really generated by a system belonging to the assumed class. 3) It
has also to be clarified what will happen if the data do not completely match the
assumed model class, either because the data are corrupted by errors or because the
true system is not exactly covered by the assumed class.
xi
xii Recursive Identification and Parameter Estimation
When the model class is parameterized, then the task of system identification
consists of estimating parameters characterizing the system generating the data and
also clarifying the properties of the derived estimates. If the data are generated by
a system belonging to the class of linear stochastic systems, then the identification
algorithm to be proposed should estimate the coefficients and orders of the system
and also the covariance matrix of the system noise. Meanwhile, properties such as
strong consistency, convergence rate and others of the estimates should be investigat-
ed. Even if the model class is not completely parameterized, for example, the class of
Hammerstein systems, the class of Wiener systems, and the class of nonlinear ARX
systems where each system in the class contains nonlinear functions, and the purpose
of system identification includes identifying the nonlinear function f (·) concerned,
the identification task can still be transformed to a parameter estimation problem. The
obvious way is to parameterize f (·) by approximating it with a linear combination
of basis functions with unknown coefficients, which are to be estimated. However, it
may also be carried out in a nonparametric way. In fact, the value of f (x) at any fixed
x can be treated as a parameter to estimate, and then one can interpolate the obtained
estimates for f (x) at different x, and the resulting interpolating function may serve
as the estimate of f (·).
As will be shown in the book, not only system identification but also many prob-
lems from diverse areas such as adaptive filtering and other problems from signal
processing and communication, adaptive regulation and iterative learning control,
principal component analysis, some problems connected with network systems such
as consensus control of multi-agent systems, PageRank of web, and many others can
be transformed to parameter estimation problems.
In mathematical statistics there are various types of parameter estimates whose
behaviors basically depend on the statistical assumptions made on the data. In the
present book, with the possible exception of Sections 3.1 and 3.2 in Chapter 3, the
estimated parameter denoted by x0 is treated as a root of a function g(·), called the
regression function. It is clear that an infinite number of functions may serve as such
a g(·), e.g., g(x) = A(x − x0 ), g(x) = sin x − x0 , etc. Therefore, the original prob-
lem may be treated as root-seeking for a regression function g(·). Moreover, it is
desired that root-seeking can be carried out in a recursive way in the sense that the
(k + 1)th estimate xk+1 for x0 can easily be obtained from the previous estimate xk
by using the data Ok+1 available at time k + 1. It is important to note that any da-
ta Ok+1 at time k + 1 may be viewed as an observation on g(xk ), because we can
always write Ok+1 = g(xk ) + εk+1 , where εk+1 Ok+1 − g(xk ) is treated as the ob-
servation noise. It is understandable that the properties of {εk } depend upon not only
the uncertainties contained in {Ok } but also the selection of g(·). So, it is hard to
expect that {εk } can satisfy any condition required by the convergence theorems for
the classical root-seeking algorithms, say, for the Robbins–Monro (RM) algorithm.
This is why a modified version of the RM algorithm is introduced, which, in fact, is a
stochastic approximation algorithm with expanding truncations (SAAWET). It turns
out that SAAWET works very well to deal with the parameter estimation problems
transformed from various areas.
In Chapter 1, the basic concept of probability theory and some information from
Preface xiii
function defining the NARX system are estimated by SAAWET incorporated with
kernel functions, and the strong consistency of the estimates is established as well.
Chapter 5 addresses the problems arising from different areas that are solved by
SAAWET. We limit ourselves to present the most recent results including principal
component analysis, consensus control of the multi-agent system, adaptive regula-
tion for Hammerstein and Wiener systems, and PageRank of webs. As a matter of
fact, the proposed approach has successfully solved many other problems such as
adaptive filtering, blind identification, iterative learning control, adaptive stabiliza-
tion and adaptive control, adaptive pole assignment, etc. We decided not to include
all of them, because either they are not the newest results or some of them have been
presented elsewhere.
Some information concerning the nonnegative matrices is provided in Appendix
B, which is essentially used in Sections 5.2 and 5.4.
Thebookiswrittenforstudents,researchers,andengineersworking in systems and
control, signal processing, communication, and mathematical statistics. The target
the book aims at is not only to show the results themselves on system identifica-
tion and parameter estimation presented in Chapters 3–5, but more importantly to
demonstrate how to apply the proposed approach to solve problems from different
areas.
The support of the National Science Foundation of China, the National Center for
Mathematics and Interdisciplinary Sciences, and the Key Laboratory of Systems and
Control, Chinese Academy of Sciences are gratefully acknowledged. The authors
would like to express their gratitude to Professor Haitao Fang and Dr. Biqiang Mu
for their helpful discussions.
xv
About the Authors
Having graduated from the Leningrad (St. Petersburg) State University, Han-Fu Chen
joined the Institute of Mathematics, Chinese Academy of Sciences (CAS). Since
1979, he has been with the Institute of Systems Science, now a part of the Academy
of Mathematics and Systems Science, CAS. He is a professor at the Key Laboratory
of Systems and Control of CAS. His research interests are mainly in stochastic sys-
tems, including system identification, adaptive control, and stochastic approximation
and its applications to systems, control, and signal processing. He has authored and
coauthored more than 200 journal papers and 7 books.
Professor Chen served as an IFAC Council member (2002–2005), president of
the Chinese Association of Automation (1993–2002), and a permanent member of
the Council of the Chinese Mathematics Society (1991–1999).
He is an IEEE fellow, IFAC fellow, a member of TWAS, and a member of CAS.
Wenxiao Zhao earned his BSc degree from the Department of Mathematics,
Shandong University, China in 2003 and a PhD degree from the Institute of Sys-
tems Science, AMSS, the Chinese Academy of Sciences (CAS) in 2008. After this
he was a postdoctoral student at the Department of Automation, Tsinghua University.
During this period he visited the University of Western Sydney, Australia, for nine
months. Dr. Zhao then joined the Institute of Systems of Sciences, CAS in 2010. He
now is with the Key Laboratory of Systems and Control, CAS as an associate profes-
sor. His research interests are in system identification, adaptive control, and system
biology. He serves as the general secretary of the IEEE Control Systems Beijing
Chapter and an associate editor of the Journal of Systems Science and Mathematical
Sciences.
xvii
Chapter 1
Dependent Random
Vectors
CONTENTS
1.1 Some Concepts of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Independent Random Variables, Martingales, and Martingale
Difference Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Markov Chains with State Space (Rm , B m ) . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Mixing Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1
2 Recursive Identification and Parameter Estimation
and lim inf An ⊂ lim sup An , where the abbreviation “i.o.” is to designate “infinitely
n→∞ n→∞
often.”
If lim inf An = lim sup An = A, then A is called the limit of the sequence {An }n≥1 .
n→∞ n→∞
From the above definition we see that the random variables are in fact the mea-
surable functions from (Ω, F ) to (R, B). A measurable function f from (R, B) to
(R, B) is usually called the Borel measurable function.
It can be shown that the distribution function is nondecreasing and left-
continuous. If Fξ is differentiable, then its derivative fξ (x) dFξ (x)/dx is called
the probability density function of ξ , or, simply, density function.
The n-dimensional vector ξ = [ξ1 · · · ξn ]T is called a random vector if ξi is a
random variable for each i = 1, · · · , n. The n-dimensional distribution function and
density function are, respectively, defined by
Fξ (x1 , · · · , xn ) P (ξ1 < x1 , · · · , ξn < xn ) , (1.1.8)
and
x1 xn
Fξ (x1 , · · · , xn ) ··· fξ (t1 , · · · ,tn )dt1 · · · dtn . (1.1.9)
−∞ −∞
Dependent Random Vectors 5
for fixed scalars μ and σ with σ > 0. In the n-dimensional case the Gaussian density
function is defined as
1 1 T −1
n 1 exp − (x − μ ) Σ (x − μ ) , x ∈ Rn (1.1.11)
(2π ) 2 (det Σ) 2 2
It can be shown that Sn converges as n → ∞ and the limit is defined as the mathe-
matical expectation of ξ , i.e., E ξ = Ω ξ dP lim Sn .
n→∞
In the following, for A ∈ F , by A ξ dP we mean Ω ξ IA dP where IA is the indi-
cator of A:
1, if ω ∈ A,
IA (ω )
0, otherwise.
For a random variable ξ , define the nonnegative random variables
Theorem 1.1.4 (Fubini) If (Ω, F , P) is the product space of two probability spaces
(Ωi , Fi , Pi ), i = 1, 2 and X = X(ω1 , ω2 ) is a random variable on (Ω, F ) for which
the mathematical expectation exists, then
g (E ξ ) ≤ Eg(ξ ). (1.1.16)
Hölder Inequality. Let 1 < p < ∞, 1 < q < ∞, and 1p + 1q = 1. For random
variables ξ and η with E|ξ | p < ∞ and E|η |q < ∞, it holds that
1 1
E|ξ η | ≤ (E|ξ | p ) p (E|η |q ) q . (1.1.18)
In the case p = q = 2, the Hölder inequality is also named as the Schwarz inequality.
Minkowski Inequality. If E|ξ | p < ∞ and E|η | p < ∞ for some p ≥ 1, then
1 1 1
(E|ξ + η | p ) p ≤ (E|ξ | p ) p + (E|η | p ) p . (1.1.19)
Cr -Inequality.
r
n
n
|ξi | ≤ Cr |ξi |r , (1.1.20)
i=1 i=1
1, if r < 1,
where Cr =
nr−1 , if r ≥ 1.
We now introduce the concepts of convergence of random variables.
Definition 1.1.8 Let ξ and {ξn }n≥1 be random variables. The sequence {ξn }n≥1 is
a.s.
said to converge to ξ with probability one or almost surely, denoted by ξn −→ ξ ,
n →∞
if P ω : ξn (ω ) −→ ξ (ω ) = 1. {ξn }n≥1 is said to converge to ξ in probabili-
n→∞
P
ty, denoted by ξn −→ ξ , if P (|ξn − ξ | > ε ) = o(1) for any ε > 0. {ξn }n≥1 is said
n→∞
w
to weakly converge or to converge in distribution to ξ , denoted by ξn −→ ξ , if
n→∞
Fξn (x) −→ Fξ (x) at any x where Fξ (x) is continuous. {ξn }n≥1 is said to converge
n→∞
to ξ in the mean square sense if E|ξn − ξ |2 −→ 0.
n→∞
Dependent Random Vectors 7
In what follows iff is the abbreviation of “if and only if.” The relationship between
various types of convergence is demonstrated by the following theorem.
P w
(ii) If ξn −→ ξ , then ξn −→ ξ .
n→∞ n→∞
P
(iii) If E|ξn − ξ |2 −→ 0, then ξn −→ ξ .
n→∞ n→∞
a.s. P P
(iv) ξn −→ ξ iff sup j≥n |ξ j − ξ | −→ 0 iff supm>n |ξm − ξn | −→ 0.
n→∞ n→∞ n→∞
P
(v) ξn −→ ξ iff supm>n P(|ξm − ξn | > ε ) = o(1) ∀ ε > 0.
n→∞
η dP = ξ dP ∀ A ∈ F1 . (1.1.21)
A A
Definition
Themevents Ai ∈ F , i = 1, · · · , n are said to be mutually independent
1.2.1
if P ∩mj=1 Ai j = j=1 P(Ai j ) for any subset [i1 < · · · < im ] ⊂ [1, · · · , n]. The σ -
algebras Fi ⊂ F , i = 1, · · · , n are said to be mutually independent if P ∩mj=1 Ai j =
m
j=1 P(Ai j ) for any Ai j ∈ Fi j , j = 1, · · · , m with [i1 < · · · < im ] being any subset of
[1, · · · , n]. The random variables {ξ1 , · · · , ξn } are called mutually independent if the
σ -algebras σ (ξi ) generated by ξi , i = 1, · · · , n are mutually independent. Let {ξi }i≥1
be a sequence of random variables. {ξi }i≥1 is called mutually independent if for any
n ≥ 1 and any set of indices {i1 , · · · , in }, the random variables {ξik }nk=1 are mutually
independent.
∞
Definition 1.2.2 The tail σ -algebra of a sequence {ξk }k≥1 is k=1 σ {ξ j , j ≥ k}. The
sets of the tail σ -algebra are called tail events and the random variables measurable
with respect to the tail σ -algebra are called tail variables.
Theorem 1.2.1 (Kolmogorov Zero–One Law) Tail events of an iid sequence {ξk }k≥1
have probabilities either zero or one.
where
E f (x, η ), if E f (x, η ) exists,
g(x) =
0, otherwise.
n
ξk − cn
k=1
1 −→ 0 a.s. p ∈ (0, 2) (1.2.5)
n→∞
np
if and only if E|ξk | p < ∞, where the constant c = E ξk if p ∈ [1, 2), while c is arbitrary
if p ∈ (0, 1).
10 Recursive Identification and Parameter Estimation
As will be seen in the later chapters, the convergence analysis of many identifica-
tion algorithms relies on the almost sure convergence of a series of random vectors,
which may not satisfy the independent assumption and thus Theorem 1.2.3 is not
directly applicable. Therefore, we need results on a.s. convergence for the sum of
dependent random variables, which are summarized in what follows.
We now introduce the concept of martingale, which is a generalization of the
sum of zero-mean mutually independent random variables, and is widely applied in
diverse research areas.
Definition 1.2.3 Let {ξk }k≥1 be a sequence of random variables and {Fk }k≥1 be
a sequence of nondecreasing σ -algebras. If ξk is Fk -measurable for each k ≥ 1,
then we call {ξk , Fk }k≥1 an adapted process. An adapted process {ξk , Fk }k≥1 with
E|ξk | < ∞ ∀k ≥ 1 is called a submartingale if E[ξn |Fm ] ≥ ξm a.s. ∀n ≥ m, a su-
permartingale if E[ξn |Fm ] ≤ ξm a.s. ∀n ≥ m, and a martingale if it is both a su-
permartingale and a submartingale, i.e., E[ξn |Fm ] = ξm a.s. ∀n ≥ m. An adapt-
ed process {ξk , Fk }k≥1 is named as a martingale difference sequence (mds) if
E[ξk+1 |Fk ] = 0 a.s. ∀k ≥ 1.
Theorem 1.2.5 (Doob maximal inequality) Assume {ξk }k≥1 is a nonnegative sub-
martingale. Then for any λ > 0,
1
P max ξ j ≥ λ ≤ ξn dP. (1.2.6)
1≤ j ≤ n λ
max ξ j ≥λ
1≤ j≤n
Further,
1p p p 1p
E max ξ jp ≤ E ξn (1.2.7)
1≤ j≤n p−1
{ω : T (ω ) = k} ∈ Fk ∀k ≥ 1. (1.2.8)
Lemma 1.2.1 Let {ξk , Fk }k≥1 be adapted, T a stopping time, and B a Borel set. Let
TB be the first time at which the process {ξk }k≥1 hits the set B after time T , i.e.,
inf{k : k > T, ξk ∈ B}
TB (1.2.9)
∞, if ξk ∈ B for all k > T.
Then TB is a stopping time.
Proof. The conclusion follows from the following expression:
−1
k
[TB = k] = {[T = i] ∩ [ξi+1 ∈ B, · · · , ξk−1 ∈ B, ξk ∈ B]} ∈ Fk ∀ k ≥ 1.
i=0
Let {ξk , Fk }, k = 1, · · · , N be a submartingale. For a nonempty interval (a, b),
define
T0 0,
min{1 ≤ k ≤ N : ξk ≤ a},
T1
N + 1, if ξk > a, k = 1, · · · , N,
min{T1 < k ≤ N : ξk ≥ b},
T2
N + 1, if ξk < b ∀k : T1 < k ≤ N, or T1 = N + 1,
..
.
min{T2m−2 < k ≤ N : ξk ≤ a},
T2m−1
N + 1, if ξk > a ∀k : T2m−2 < k ≤ N, or T2m−2 = N + 1,
min{T2m−1 < k ≤ N : ξk ≥ b},
T2m
N + 1, if ξk < b ∀k : T2m−1 < k ≤ N, or T2m−1 = N + 1.
The largest m for which ξ2m ≥ b is called the number of up-crossings of the
interval (a, b) by the submartingale {ξk , Fk }Nk=1 and is denoted by β (a, b).
Theorem 1.2.6 (Doob) For the submartingale {ξk , Fk }Nk=1 the following inequali-
ties hold
E(ξN − a)+ E(ξN )+ + |a|
E β (a, b) ≤ ≤ (1.2.10)
b−a b−a
where (ξN )+ is defined by (1.1.13).
Proof. See Appendix A.
Theorem 1.2.7 (Doob) Let {ξk , Fk }k≥1 be a submartingale with supk E(ξk )+ < ∞.
Then there is a random variable ξ with E|ξ | < ∞ such that
lim ξk = ξ a.s. (1.2.11)
k→∞
12 Recursive Identification and Parameter Estimation
We have presented some results on the a.s. convergence of some random series
and sub- or super-martingales. However, a martingale or an mds may converge not
on the whole space Ω but on its subset. In the following we present the set where a
martingale or an mds converges.
Let {ξk , Fk }k≥0 with ξk ∈ Rm be an adapted sequence, and let G be a Borel set
in B m . Then the first exit time T of {ξk }k≥0 from G defined by
min{k : ξk ∈ G}
T=
∞, if ξk ∈ G, ∀ k ≥ 0
∞
A ω: E ξk2 |Fk−1 < ∞ . (1.2.12)
k=1
Proof. It suffices to prove (i) since (ii) is reduced to (i) if ξk is replaced by −ξk .
The detailed proof is given in Appendix A.
Dependent Random Vectors 13
k
where c is a positive constant. Then ηk = i=1 yi converges on S as k → ∞.
Theorem 1.2.13 generalizes Theorem 1.2.8. For its proof we refer to Appendix A.
For analyzing the asymptotical properties of stochastic systems we often need to
know the behavior of partial sums of an mds with weights. In the sequel we introduce
such a result to be frequently used in later chapters.
For a sequence of matrices {Mk }k≥1 and a sequence of nondecreasing positive
numbers {bk }k≥1 , by Mk = O(bk ) we mean
and by Mk = o(bk ),
lim Mk /bk = 0.
k→∞
We introduce a technical lemma, known as the Kronecker lemma.
Lemma 1.2.4 (Kronecker lemma) If {bk }k≥1 is a sequence of positive numbers non-
decreasingly diverging to infinity and if for a sequence of matrices {Mk }k≥1 ,
∞
1
Mk < ∞, (1.2.22)
bk
k=1
k
then i=1 Mi = o(bk ).
k
1 +η
Mi ξi+1 = O sk (α ) log(sαk (α ) + e) α a.s. ∀ η > 0, (1.2.23)
i=0
α1
where sk (α ) =
k α
i=0 Mi .
Dependent Random Vectors 15
Example 1.2.1 In the one dimensional case if {ξk }k≥1 is iid with E ξk = 0 and E ξk2 <
k 1
∞, then from Theorem 1.2.14 we have i=1 ξi = O k 2 (log k) 2 +η a.s. ∀ η > 0.
1
Thus the estimate given by Theorem 1.2.14 is not as sharp as those given by the law
of the iterative logarithm but the conditions required here are much more general.
where uk and yk are the system input and output, respectively, and {εk } is a sequence
of mutually independent zero-mean random variables. Define
⎡ ⎤
a1 · · · · · · · · · a p b1 · · · · · · bq
⎡ ⎤
⎢1
⎢ 0 ··· 0 0 0⎥ ⎥ εk
⎡ ⎤
yk ⎢0 1 0 0 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ ⎢ .. . . .. .. .. .. .. ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢. . . . . . . ⎥ ⎢.⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢yk+1− p ⎥ ⎢ . ⎥ ⎢0⎥
ϕk = ⎢⎢ ⎥ ⎢ . ⎥
0 · · · · · · 0 ⎥ , ξk = ⎢ ⎥
u ⎥ ,A = ⎢ 0 . 1 0 ⎢uk ⎥ .
⎢ k ⎥ ⎢0 0 0 0 ··· ··· 0 ⎥ ⎥ ⎢ ⎥
⎢ . ⎥ ⎢ ⎢0⎥
⎣ .. ⎦ ⎢. .. . ⎥ ⎢ ⎥
⎢ .. ··· 0 1 . .. ⎥ ⎢.⎥
uk+1−q ⎢ ⎥ ⎣ .⎦
.
⎢.
⎣ .. .. .. .. .. ⎥ ⎦
··· . . . . 0
0 ··· ··· ··· ··· ··· 0 1 0
(1.3.2)
and hence the regressor sequence {ϕk }k≥0 is a Markov chain valued in (R p+q ,B p+q )
provided {ξk }k≥0 is a sequence of mutually independent random vectors. Thus, in
a certain sense, the analysis of the system (1.3.1) can resort to investigating the
properties of the chain {ϕk }k≥0 .
space, will frequently be used in the later chapters to establish the a.s. convergence
of recursive algorithms.
Assume {xk }k≥0 is a sequence of random vectors valued in Rm . If
for any A ∈ B m , then the sequence {xk }k≥0 is said to be a Markov chain with the
state space (Rm , B m ). Further, if the right-hand side of (1.3.4) does not depend on
the time index k, i.e.,
for some initial probability measure P0 (·) of x0 . Then by (1.3.6) and Theorem 1.1.4,
it can inductively be proved that for any A ∈ B m ,
and further,
Pk (A) = P0 (A), k ≥ 1.
Dependent Random Vectors 17
The initial probability measure P0 (·) of x0 satisfying (1.3.7) is called the invariant
probability measure of the chain {xk }k≥0 .
It should be noted that for a chain {xk }k≥0 , its invariant probability measure does
not always exist, and if exists, it may not be unique.
Denote the total variation norm of a signed measure ν (·) on (Rm , B m ) by ν var ,
i.e.,
ν var = ν+ (dx) + ν− (dx),
Rm Rm
Definition 1.3.1 The chain {xk }k≥0 is called ergodic if there exists a probability
measure PIV (·) on (Rm , B m ) such that
for any x ∈ Rm . Further, if there exist constants 0 < ρ < 1 and M > 0 possibly
depending on x, i.e., M = M(x) such that
The probability measure PIV (·) is, in fact, the invariant probability measure of
{xk }k≥0 . It is clear that if the chain {xk }k≥0 is ergodic, then its invariant probability
measure is unique. In what follows we introduce criteria for ergodicity and geometric
ergodicity of the chain {xk }k≥0 valued in (Rm , B m ). For this, we first introduce some
definitions and related results, which the ergodicity of Markov chains is essentially
based on.
Definition 1.3.2 The chain {xk }k≥0 valued in (Rm , B m ) is called μ -irreducible if
there exists a measure μ (·) on (Rm , B m ) such that
∞
Pk (x, A) > 0 (1.3.10)
k=1
for any x ∈ Rm and any A ∈ B m with μ (A) > 0. The measure μ (·) is called the
maximal irreducibility measure of {xk }k≥0 if
Formula (1.3.10) indicates that for the μ -irreducible chain {xk }k≥0 , starting from
any initial state x0 = x ∈ Rm , the probability that in a finite number of steps the
sequence {xk }k≥0 enters any set A with positive μ -measure is always positive. In the
following, when we say that the chain {xk }k≥0 is μ -irreducible, we implicitly assume
that μ (·) is the maximal irreducibility measure of {xk }k≥0 .
Definition 1.3.3 Suppose A1 , · · · , Ad are disjoint sets in B m . For the chain {xk }k≥0 ,
if
(i) P(x, Ai+1 ) = 1 ∀ x ∈ Ai , i = 1, · · · , d − 1,
and
(ii) P(x, A1 ) = 1 ∀ x ∈ Ad ,
then {A1 , · · · , Ad } is called a d-cycle of {xk }k≥0 .
The d-cycle is called maximal if
(iii) there exists a measure ν (·) on (Rm , B m ) such that
d
ν (Ai ) > 0, i = 1, · · · , d and v Rm / Ai = 0, (1.3.11)
i=1
and
(iv) for any sets {A1 , · · · , Ad } satisfying (i) and (ii) with d replaced by d , d must
divide d.
The integer d is called the period of {xk }k≥0 if the d-cycle of {xk }k≥0 is maximal.
When the period equals 1, the chain {xk }k≥0 is called aperiodic.
The small set is another concept related to ergodicity of Markov chains valued
in (Rm , B m ). Let us first recall the ergodic criterion for Markov chains valued in a
countable state space.
Suppose that the chain {ϕk }k≥0 takes values in {1, 2, 3, · · · } and its transition
probability is denoted by pi j = P{ϕk+1 = j|ϕk = i}, i, j = 1, 2, 3, · · · . It is known that
if {ϕk }k≥0 is irreducible, aperiodic, and there exist a finite set C ⊂ {1, 2, 3, · · · }, a
nonnegative function g(·), and constants K > 0 and δ > 0 such that
where Pn (x, ·) is the n-step transition probability of the chain {xk }k≥0 , s(x) is a mea-
surable function on (Rm , B m ), and ν (·) is a measure on (Rm , B m ).
Definition 1.3.4 Assume {xk }k≥0 is a μ -irreducible chain. We say that {xk }k≥0 sat-
isfies the minorization condition M(m0 , β , s, ν ), where m0 ≥ 1 is an integer, β > 0 a
constant, s(x) a nonnegative measurable function on (Rm , B m ) with Eμ (s) > 0, and
ν (·) a probability measure on (Rm , B m ), if
The function s(x) and the probability measure ν (·) are called the small function and
small measure, respectively. If s(x) equals some indicator function, i.e., s(x) = IC (x)
for some C ∈ B m with μ (C) > 0 and
Lemma 1.3.1 Suppose that the μ -irreducible chain {xk }k≥0 satisfies the minoriza-
tion condition M(m0 , β , s, ν ). Then
(i) the small measure ν (·) is also an irreducibility measure for {xk }k≥0 , and
(ii) the set C {x : s(x) ≥ γ } for any constant γ > 0 is small, whenever it is
μ -positive.
Lemma 1.3.2 Suppose that the chain {xk }k≥0 is μ -irreducible. Then,
(i) for any set B with μ (B) > 0, there exists a small set C ⊂ B;
(ii) if s(x) is small, so is E(s(xn )|x0 = x) ∀ n ≥ 1; and
(iii) if both s(·) and s (·) are small functions, so is s(·) + s (·).
Theorem 1.3.1 Suppose that the chain {xk }k≥0 is μ -irreducible. If either
(i) there exists a small set C ∈ B m with μ (C) > 0 and an integer n, possibly
depending on C, such that
or
(ii) there exists a set A ∈ B m with μ (A) > 0 such that for any B ⊂ A, B ∈ B m
with μ (B) > 0 and for some positive integer n possibly depending on B,
Theorem 1.3.2 Suppose that {xk }k≥0 is a μ -irreducible, aperiodic Markov chain
valued in (Rm , B m ).
(i) Let s(x) be a small function. Then, any set C with μ (C) > 0 satisfying
l
inf E(s(xk )|x0 = x) > 0 (1.3.20)
x∈C
k=0
l
inf Pk (x, B) > 0, (1.3.21)
x∈C
k=0
Theorem 1.3.3 Assume that the chain {xk }k≥0 is irreducible and aperiodic. If there
exist a nonnegative measurable function g(·), a small set S, and constants ρ ∈ (0, 1),
c1 > 0, and c2 > 0 such that
then there exist a probability measure PIV (·) and a nonnegative measurable function
M(x) such that
Under assumptions different from those required in Theorem 1.3.3 we have dif-
ferent kinds of ergodicity. To this end, we introduce the following definition.
Definition 1.3.5 For the chain {xk }k≥0 , the following property is called the Doeblin
condition: There exist a probability measure ν (·) and some constants 0 < ε < 1, 0 <
δ < 1 such that Pk0 (x, A) ≥ δ ∀ x ∈ Rm for an integer k0 whenever ν (A) > ε .
Theorem 1.3.4 Suppose that the chain {xk }k≥0 is irreducible and aperiodic. If
{xk }k≥0 satisfies the Doeblin condition, then there exist a probability measure PIV (·)
and constants M > 0, 0 < ρ < 1 such that
Theorem 1.3.5 Assume that the chain {xk }k≥0 is irreducible and aperiodic. If there
exist a nonnegative measurable function g(·), a small set S, and constants c1 > 0 and
c2 > 0 such that
Theorems 1.3.3, 1.3.4, and 1.3.5 are usually called the geometrically ergodic cri-
terion, the uniformly ergodic criterion, and the ergodic criterion, respectively. It is
clear that the geometrical ergodicity is stronger than the ergodicity, but weaker than
the uniform ergodicity. Next, we show that for a large class of stochastic dynamic
systems the geometrical ergodicity takes place if a certain stability condition holds.
By μn (·) we denote the Lebesgue measure on (Rn , B n ).
Let us consider the ergodicity of the single-input single-output (SISO) nonlinear
ARX (NARX) system
where uk and yk are the system input and output, respectively, εk is the noise, (p0 , q0 )
are the known system orders, and f (·) is a nonlinear function.
The NARX system (1.3.29) is a straightforward generalization of the linear ARX
system and covers a large class of dynamic phenomena. This point will be made clear
in the later chapters.
By denoting
and
the NARX system (1.3.29) is transformed to the following state space model
Thus {xk }k≥0 is a Markov chain if {ξk }k≥0 satisfies certain probability condi-
tions, e.g., if {ξk }k≥0 is a sequence of mutually independent random variables. Er-
godicity of {xk }k≥0 can be investigated by the results given in the preceding sections.
To better understand the essence of the approach, let us consider the first order (i.e.,
p = q = 1) NARX system:
A1.3.1 Let the input {uk }k≥0 be a sequence of iid random variables with Euk =
0, Eu2k < ∞, and with a probability density function denoted by fu (·), which
is positive and continuous on R.
A1.3.2 {εk }k≥0 is a sequence of iid random variables with E εk = 0, E εk2 < ∞, and
with a density function fε (·), which is assumed to be positive and uniformly
continuous on R;
A1.3.3 {εk }k≥0 and {uk }k≥0 are mutually independent;
A1.3.4 f (·, ·) is continuous on R2 and there exist constants 0 < λ < 1, c1 > 0, c2 > 0,
and l > 0 such that | f (ξ1 , ξ2 )| ≤ λ |ξ1 | + c1 |ξ2 |l + c2 ∀ ξ = [ξ1 ξ2 ]T ∈ R2 ,
where λ , c1 , c2 , and l may be unknown;
A1.3.5 E|uk |l < ∞ and the initial value y0 satisfies E|y0 | < ∞.
Under the conditions A1.3.1–A1.3.3, it is clear that the state vector sequence
{xk }k≥0 defined by (1.3.32) is a time-homogeneous Markov chain valued in
(R2 , B 2 ). As to be seen in what follows, Assumption A1.3.4 is a kind of stability
condition to guarantee ergodicity of {xk }k≥0 .
Lemma 1.3.3 If A1.3.1–A1.3.3 hold, then the chain {xk }k≥0 defined by (1.3.32)
is μ2 -irreducible and aperiodic, and μ2 is the maximal irreducibility measure of
{xk }k≥0 . Further, any bounded set A ∈ B 2 with μ2 (A) > 0 is a small set.
Dependent Random Vectors 23
(i) there exist a probability measure PIV (·) on (R2 , B 2 ), a nonnegative measur-
able function M(x), and a constant ρ ∈ (0, 1) such that
(iii) PIV (·) is with probability density fIV (·, ·) which is positive on R2 , and
Proof. We first prove (i). Define the Lyapunov function g(x) |ξ1 | + β |ξ2 |l ,
where x = [ξ1 ξ2 ]T ∈ R2 and β > 0 is a constant to be determined.
By A1.3.1–A1.3.5, we have
Noticing (1.3.37) and (1.3.38) and applying Theorem 1.3.3, we see that (1.3.33)
holds.
24 Recursive Identification and Parameter Estimation
We now prove (ii). By Theorem 1.3.3, the measurable function M(x) actually can
be taken as a + bg(x), where a and b are positive constants and g(x) is the Lyapunov
function defined above. To prove (ii) we first verify that
and
Pn (·) − PIV (·)var = sup (Pn (A) − PIV (A)) − inf (Pn (A) − PIV (A))
A∈B2 A∈B2
by (A.56) we have
According to A1.3.4, we have sup x ≤K | f (x1 , x2 )| < ∞ for any fixed K > 0. As
both fu (·) and fε (·) are positive, for a large enough K > 0 it follows that
A1.3.5’ E|uk |l < ∞ and EY0 < ∞, where Y0 [y0 , y−1 , · · · , y1− p ]T is the initial
value.
The probabilistic properties of {xk }k≥0 such as irreducibility, aperiodicity, and
ergodicity for the case p > 1, q > 1 can be established as those for the first order
system. In fact, we have the following theorem.
Theorem 1.3.7 If A1.3.1–A1.3.3, A1.3.4’, and A1.3.5’ hold, then the chain {xk }k≥0
defined by (1.3.30) is μ p+q -irreducible, aperiodic, and
26 Recursive Identification and Parameter Estimation
(i) there exist a probability measure PIV (·) on (R p+q , B p+q ), a nonnegative
measurable function M(x), and a constant 0 < ρ < 1 such that Pn (x, ·) −
PIV (·)var ≤ M(x)ρ n ∀ x ∈ R p+q ;
(ii) supn R p+q M(x)Pn (dx) < ∞ and Pn (·)−PIV (·)var ≤ cρ n for some constants
c > 0 and 0 < ρ < 1.
Further, PIV (·) is with probability density, which is positive on R p+q .
Theorem 1.3.7 can be proved similarly to Lemma 1.3.3 and Theorem 1.3.6. Here
we only give some remarks.
Remark 1.3.2 We note that (1.3.34) gives the expression of the invariant probabil-
ity density of the first order NARX system (p, q) = (1, 1) . For the general case
p > 1 and q > 1, the invariant probability density and its properties can similarly be
obtained from the n0 -step transition probability Pn0 (x, ·) with n0 = max(p, q). For ex-
ample, for the case (p, q) = (2, 1) by investigating the two-step transition probability,
we find that the invariant probability density is expressed as follows:
∞
fIV (s1 , s2 , s3 ) = fε s1 − f (s2 , x1 ,t) fu (t)dt
R3 −∞
· fε s2 − f (x1 , x2 , x3 ) PIV (dx) fu (s3 ),
while for the case (p, q) = (3, 2) considering the three-step transition probability
leads to the invariant probability density
fIV (s1 , s2 , s3 , s4 , s5 )
∞
= fε s1 − f (s2 , s3 , x1 , s5 ,t) fε s2 − f (s3 , x1 , x2 ,t, x4 ) fu (t)dt
R5 −∞
· fε s3 − f (x1 , x2 , x3 , x4 , x5 ) PIV (dx) fu (s4 ) fu (s5 ).
The properties of fIV (s1 , s2 ), fIV (s1 , s2 , s3 ), and fIV (s1 , s2 , s3 , s4 , s5 ) are derived from
the above formulas by using the assumptions made in Theorem 1.3.7.
Remark 1.3.3 In A1.3.4’, a vector norm rather than the Euclidean norm is adopted.
This is because such a norm is more general than the Euclidean norm and λ in
(1.3.43) for many NARX systems in such a norm may be taken smaller than 1. The
fact that λ ∈ (0, 1) is of crucial importance for establishing stability and ergodicity
Dependent Random Vectors 27
of the NARX system (see the proof of Theorem 1.3.6). It is natural to ask what will
happen if λ ≥ 1. Let us consider the following example:
yk+1 = yk + εk+1 ,
k+1
where {εk } is iid. It is clear that yk+1 = i=1 εi if the initial value y0 = 0. It is seen
that for the above system, the constant λ equals 1 and {yk }k≥1 is not ergodic. So,
in a certain sense, the condition λ ∈ (0, 1) is necessary for ergodicity of the NARX
system.
For ergodicity of nonlinear systems, we assume that both {uk }k≥0 and {εk }k≥0
are with positive probability density functions. In fact, these assumptions are suf-
ficient but not necessary for ergodicity of stochastic systems. Let us consider the
following linear process:
A1.3.8 {εk }k≥0 is iid with density which is positive and continuous on a set U ∈ B r
satisfying μr (U) > 0, where μr (·) is the Lebesgue measure on (Rr , B r ).
Theorem 1.3.8 Assume that A1.3.6–A1.3.8 hold. Then the chain {xk }k≥0 defined by
(1.3.44) is geometrically ergodic.
Lemma 1.3.4 Given a matrix A ∈ Rn×n and any ε > 0, there exists a vector norm
· v such that
Proof. First, for the matrix A there exists a unitary matrix U such that
⎡ ⎤
λ1 t12 t13 · · · t1n
⎢ 0 λ1 t23 · · · t2n ⎥
⎢ ⎥
⎢ .. . . .. .. ⎥
−1 ⎢ .⎥
U AU = ⎢ . . . ⎥. (1.3.46)
⎢. . . . ⎥
⎣. . . . . . . ⎦
.
0 · · · · · · 0 λn
28 Recursive Identification and Parameter Estimation
For the given ε > 0, we can choose δ > 0 small enough such that
n
|ti j |δ j−i < ε , i = 1, · · · , n − 1. (1.3.49)
j=i+1
It can be shown that {xk }k≥0 defined by (1.3.44) is a Markov chain valued in
(E, C m ). Carrying out a discussion similar to that for Theorems 1.3.6 and 1.3.7 and
noticing that the distribution of εk is absolutely continuous with respect to μm (·), we
can show that {xk }k≥0 is μm (·)-irreducible, aperiodic, and any bounded set in C m
with a positive μm -measure is a small set.
By Lemma 1.3.4 for the matrix F there exist a vector norm · v and 0 < λ < 1
such that Fxv ≤ λ xv ∀ x ∈ Rn . Then by choosing the Lyapunov function
g(·) = · v and by applying Theorem 1.3.3, it is shown that {xk }k≥0 is geometri-
cally ergodic.
Suppose that {uk }k≥0 and {εk }k≥0 are mutually independent and each of them is
a sequence of iid random variables. Further, assume A(z) = 1 − a1 z − · · · − a p z p is
stable, i.e., all roots of A(z) lie strictly outside the unit disk. It is clear that {y1,k }k≥0
and {y2,qk+l }k≥0 are iid sequences for each l = 0, 1, · · · , q − 1. But, this does not hold
k−1
for {y3,k }k≥0 , since for each k, y3,k depends on the past inputs {ui }i=0 and noises
{εi }i=0 . However, since A(z) is stable, we can show that as l tends to infinity, y3,k
k
Definition 1.4.1 The process {ϕk }k≥0 is called an α -mixing or strong mixing if
The sequences {α (k)}k≥0 , {β (k)}k≥0 , and {φ (k)}k≥0 are called the mixing co-
efficients. It can be shown that
1 1 1 1
|E ξ η − E ξ E η | ≤ 10(α (n))1− p − q (E|ξ | p ) p (E|η |q ) q . (1.4.8)
∞
(ii) Assume {ϕk }k≥0 is an φ -mixing. For ξ ∈ F0k and η ∈ Fn+k , if E[|ξ | p +
|η |q ] < ∞ for some p > 1, q > 1, and 1p + 1q = 1, then
1 1 1
|E ξ η − E ξ E η | ≤ 2(φ (n)) p (E|ξ | p ) p (E|η |q ) q . (1.4.9)
Definition 1.4.2 Let {Fk }k≥0 be a sequence of nondecreasing σ -algebras. The se-
quence {ϕk , Fk }k≥0 is called a simple mixingale if ϕk is Fk -measurable and if
for two sequences of nonnegative constants {ck }k≥0 and {ψm }m≥0 with ψm → 0
as m → ∞, the following conditions are satisfied:
1
(i) E|E(ϕk |Fk−m )|2 2 ≤ ψm ck ∀ k ≥ 0 and ∀ m ≥ 0,
(ii) E ϕk = 0,
where Fk {∅, Ω} if k ≤ 0.
From the definition, we see that {ck }k≥0 and {ψm }m≥0 reflect the moment and
∞ coefficients of {ϕk }k≥0 , which are important for the almost sure convergence
mixing
of k=0 ϕk . In fact, we have the following result.
and
∞
∞
(log k)(log log k)1+γ ψk2 c2j < ∞ for some γ > 0. (1.4.11)
k=1 j=k
Dependent Random Vectors 31
Then
∞
ϕk < ∞ a.s. (1.4.12)
k=1
Theorem 1.4.2 Assume that {ϕk }k≥0 is an α -mixing with mixing coefficients denot-
ed by {α (k)}k≥0 . Let {Φk (·)}k≥0 be a sequence of functions Φk (·) : R → R and
EΦk (ϕk ) = 0. If there exist constants ε > 0 and γ > 0 such that
∞
2+2 ε
E|Φk (ϕk )|2+ε <∞ (1.4.13)
k=1
and
∞
ε
log k(log log k)1+γ (α (k)) 2+ε < ∞, (1.4.14)
k=1
then
∞
Φk (ϕk ) < ∞ a.s. (1.4.15)
k=1
ebookbell.com