14 IFAC Symposium On System Identification, Newcastle, Australia, 2006
14 IFAC Symposium On System Identification, Newcastle, Australia, 2006
=
_
N
t=1
u
2
t
_
1
_
N
t=1
u
t
y
t
_
=
1
N
N
t=1
y
t
,
where N is the number of data points. The estima-
tion error
0
=
1
N
N
t=1
y
t
0
=
1
N
N
t=1
(
0
+
w
t
)
0
=
1
N
N
t=1
w
t
does converge to zero
when N under natural assumptions on w
t
.
However, for any nite N,
1
N
N
t=1
w
t
cannot be
expected to be zero and a system-model mismatch
is present. 2
If a probabilistic description of the uncertain
elements is adopted, under general circumstances
the only probabilistic claim we are in a position
to make is that
Pr{
M = S} = 0,
clearly a useless statement if our intention is that
of crediting the model with reliability. Thus, when
an identication procedure only returns a nominal
model, we certainly cannot trust it to coincide
with the true system and - when this model is used
for any purpose - it is done in the hope that the
system-model mismatch does not aect the nal
result too badly. While this way of proceeding is
practical, it is not grounded on a solid theoretical
basis.
A scientic use of an identied nominal model
requires instead that this model be complemented
with additional information, information able to
certify the model accuracy.
1.2 Model-set identication
A way to obtain certied results is to move from
nominal-model to model-set identication. An ex-
ample explains the idea.
Example 2. (continuation of Example 1). In sys-
tem (1), suppose that w
t
is white and Gaussian
with zero mean and unitary variance. Then,
0
=
1
N
N
t=1
w
t
is a Gaussian random variable
with variance equal to 1/N and we can easily
attach a probability to the event {|
0
| }
for any given nite N by looking up a Gaussian
table. The practical use of this result is that - after
estimating
from data - an interval [
,
+]
can be computed having guaranteed probability
to contain the true parameter value. 2
In more general terms, the situation can be de-
picted as in Figure 1: Given any nite N, the
parameter estimate
is aected by random uctu-
ation, so that it has probability zero to exactly hit
the system parameter value. Considering a region
around
, however, elevates the probability that
0
belongs to the region to a nonzero - and therefore
meaningful - value.
Fig. 1. Random uctuation in parameter estimate.
This observation points to a simple, but important
fact:
exploiting the information conveyed by a -
nite number of data points can in standard
situations at best generate a model set to
which the system belong. Any further at-
tempt to provide sharper estimation results
(e.g. one single model) goes beyond the avail-
able information level and generates results
that cannot be guaranteed.
As a consequence, considering identication pro-
cedures returning parameter sets - as opposed to
single parameter values - is a sensible approach in
the attempt to provide guaranteed results.
Having made this conceptual point, it is also
important to observe that this in no way deny the
importance of nominal models: Nominal models
do play an important practical role in many areas
such as simulation, prediction and control. Thus,
47
constructing nominal models is often a signicant
step towards nding a solution of a problem. Yet,
in order to judge the quality of the solution one
should also account for uncertainty, as given by a
model uncertainty set.
1.3 The role of a-priori information
Any system identication process is based on two
sources of information:
(i) a-priori information, that is we a-priori know
that S belongs to a system set S;
(ii) a-posteriori information, that is data col-
lected from the system.
Without any a-priori information we can hardly
generate any guaranteed result. In Example 2, we
assumed quite a bit of a-priori information: The
system structure was known; and the fact that the
noise was Gaussian with variance 1 was exploited
in the construction of the condence region.
Having said that a-priori information cannot be
totally renounced, it is also important to point
out that a good theory should demand as little
prior as possible. Indeed:
stringent prior conditions reduce the theorys
applicability;
in real applications, stringent prior condi-
tions may be dicult to verify even when
they are actually satised.
Going back to Example 2, let us ask the following
question: Can we reduce prior knowledge and
still provide guaranteed results? Some answer is
provided in the following example continuation.
Example 3. (continuation of Example 2). Suppose
now that the variance of w
t
is
2
and that it
is unknown and no upper bound on its value is
available. We still want to quantify the probability
of the event {|
0
| } and we also want that
this probability to be guaranteed for any situation
allowed by our a-priori knowledge, that is for any
value of
2
.
Since
0
is Gaussian with variance
2
/N, for
any given > 0 (even very large), sup
2 Pr{|
0
| } = 0, so that the only statement valid for
all possible
2
is
Pr{|
0
| } 0,
which is evidently a void statement.
A natural question is: Is the situation hopeless
and do we have to give up on nding guaranteed
results or can we attempt some other approach?
To answer, let us try to put the problem under a
dierent light: A well known fact from statistics
is that
0
_
1
N(N1)
N
t=1
(y
t
)
2
has a Student t-distribution (Richmond (1964))
with N 1 degrees of freedom, independently of
the value of
2
. Thus, given a , using tables of
the Student t-distribution one can determine a
such that
Pr
0
|
_
1
N(N1)
N
t=1
(y
t
)
2
= ,
where the important fact is that the result holds
no matter what
2
is. Hence:
Pr
_
|
0
| (y)
_
= (2)
with
(y) =
_
1
N(N 1)
N
t=1
(y
t
)
2
.
The latter is the desired accuracy evaluation re-
sult: One selects a and computes the correspond-
ing accuracy variable (y) using data. The table
of the Student t-distribution is then used to nd
the probability of the condence interval. 2
Two remarks on Example 3 are in order:
(1) If
2
is unknown, we have seen that no
accuracy result - save the void one - can
be established for a xed deterministic .
Yet, by allowing to be data dependent (i.e.
by substituting with (y)) a meaningful
conclusion like (2) can be derived. This fact
tells us that we need to let the data speak
and the uncertainty set size has to depend
on observed data.
(2) The random variable
0
_
1
N(N1)
N
t=1
(y
t
)
2
is
a function of data, and the distribution of
the data themselves depends on the noise
variance
2
. Despite this, the distribution of
0
_
1
N(N1)
N
t=1
(y
t
)
2
is independent of
2
.
In the statistical literature, this is called
a pivotal variable because its distribution
does not depend on the variable elements
of the problem. The existence of a pivotal
variable is crucial in this example to establish
a result that is guaranteed for any
2
.
Unfortunately, nding pivotal variables is gener-
ally hard even for very simple examples:
48
Example 4. Consider now the autoregressive sys-
tem
y
t
+
0
y
t1
= w
t
,
where again w
t
is zero-mean Gaussian with un-
known variance
2
. Finding a pivotal variable for
this situation is already a very hard problem. 2
Thus, the approach outlined in Example 3 does
not appear to be easy to generalize. Nevertheless,
the concept of pivotal distribution - or, more
generally, of pivotal result - has a general appeal
and we shall come back to it at a later stage in
this paper.
1.4 Content of the present paper
In this paper, we provide an overview of LSCR
as a methodology to address the above described
problems of determining guaranteed model sets in
system identication. LSCR delivers data-based
condence sets that contain the true parameter
value with guaranteed probability for any nite
data sample and it requires minimal knowledge
on the noise aecting the system.
The main idea behind the LSCR method is to
compute empirical correlation functions and to
leave out those regions in parameter space where
the correlation functions take on positive or nega-
tive values too many times. This principle, which
is the reason for he name of the method, is based
on the fact that for the true parameter value
the correlation functions are sums of zero mean
random variables and, therefore, it is unlikely that
nearly all of them will be positive or nearly all of
them will be negative.
Part I of the paper deals with the case where the
system S belongs to the model class M.
In many cases, however, the structure of S is only
partially known, and - even when S is known
to belong to a certain class S - we at times
deliberately look for a model in a restricted model
class M because S is too large to work with. In
these cases, asking for a probability that S M
looses any meaning and we should instead ask
whether M contains a suitable approximation or
projection of S. Part II of this paper contains
results for systems with unmodeled dynamics.
Further generalizations to a nonlinear set-up are
given in Part III.
The presentation is mainly based on examples
to help readability, while general results are only
sketched.
It goes without saying that the perspective of this
paper of reviewing LSCR is a matter of choice and
other techniques exist to deal with nite sample
identication. Without any claim or attempt of
completeness, some of these techniques are briey
mentioned in the next section.
2. A SHORT LITERATURE REVIEW OF
FINITE SAMPLE SYSTEM IDENTIFICATION
The pioneering work of Vapnik and Chervonenkis
(1968,1971) provided the foundations of what be-
came the eld of statistical learning theory, e.g.
Vapnik (1998), Cherkassky and Mulier (1998),
Vidyasagar (2002). By using exponential inequal-
ities such as the Hoeding inequality (Hoeding
(1963)) and combinatorial arguments, they de-
rived rigorous uniform probabilistic bounds on
the dierence between expected values and em-
pirical means computed using a nite number of
data points. An underlying assumption was that
the data points were independent and identically
distributed, an assumption not often satised in
a system identication setting. Since the 1990s
results have appeared which extend the uniform
convergence results to dependent data sequences
such as M-dependent sequences, and -, - and
-mixing sequences, e.g. Yu (1994), Bosq (1998),
Karandikar and Vidyasagar (2002). These ideas
were applied to problems in identication and
prediction developing non-asymptotic bounds on
the estimation and prediction accuracies. See e.g.
Mohda and Masry (1996,1998), Campi and Ku-
mar (1998), Goldenshluger (1998), Weyer et al.
(1999), Weyer (2000), Meir (2000), Venkatesh and
Dahleh (2001), Campi and Weyer (2002), Weyer
and Campi (2002), Vidyasagar and Karandikar
(2002,2006), Bartlett (2003).
The nite sample results with roots in learning
theory gave bounds on the dierence between
expected and empirical means, and the bounds
were depending on the number of data points,
but not on the actually seen data. For this reason
they could be quite conservative for the particular
system at hand. A way around conservatism is
to make active use of the data and construct the
bounds using data coming from the system under
investigation. Data based methods for evaluation
of model quality using bootstrap and subsampling
(e.g. Efron and Tibshirani (1993), Shao and Tu
(1995), Politis (1998), Politis et al. (1999)) have
been explored in Tjarnstr om and Ljung (2002),
Bittanti and Lovera (2000) and Dunstan and Bit-
mead (2003). However, few truly rigorous nite
sample results for model quality assessment are
available for these techniques. Using subsampling
techniques, Campi et al. (2004) obtained some
guaranteed results for generalised FIR systems.
See also Hjalmarsson and Ninness (2004) and Nin-
ness and Hjalmarsson (2004) for non-asymptotic
49
variance expressions for frequency function esti-
mates.
Data based nite sample results have also been de-
rived in the context of model validation by Smith
and Dullerud (1996), and the set membership ap-
proach to system identication, e.g. Milanese and
Vicino (1991), Vicino and Zappa (1996), Giarre et
al. (1997a,b), Garulli et al. (2000, 2002), Milanese
and Taragna (2005).
Along a dierent line of thought, nite sample
properties of worst case identication in deter-
ministic frameworks were studied in Dahleh et al.
(1993,1995), Poolla and Tikku (1994) and Harri-
son et al. (1996). Spall (1995) considered uncer-
tainty bounds for M-estimators with small sample
sizes. Non-parametric identication methods with
nite number of data points have been studied
by Welsh and Goodwin (2002) (see also Heath
(2001)), while Ding and Chen (2005) have stud-
ied the recursive least squares method in a nite
sample context.
The LSCR method presented in this paper is mak-
ing use of subsampling techniques, and in partic-
ular it extends the results of Hartigan (1969,1970)
to a dynamical setting. Loosely speaking, one
could view LSCR as a stochastic set membership
approach to system identication, where the set-
ting we consider is the standard stochastic setting
for system identication from e.g. Ljung (1999)
or Soderstr om and Stoica (1988), but where the
outcomes are sets of models as in set membership
identication.
PART I: Known system
structure
3. LSCR: A PRELIMINARY EXAMPLE
We start with a preliminary example that readily
provides some insight in the LSCR technique.
More general results and comments are given in
the next section.
Consider again the system of Example 4:
y
t
+
0
y
t1
= w
t
. (3)
Assume we know that w
t
is an independent pro-
cess and that it has a symmetric distribution
around zero. Apart from this, no knowledge on
the noise is assumed: It can have any (unknown)
distribution: Gaussian; uniform; at with small-
area spikes at high-value locations describing the
chance of outliers; etc. Its variance can be any
(unknown) number, from very small to very large.
We do not even make any stationarity assumption
on w
t
and allow that its distribution varies with
time. The assumption that w
t
is independent can
be interpreted by saying that we know the system
structure: It is an autoregressive system of order
1.
9 data points were generated according to (3) and
they are shown in Figure 2. The values of
0
and
w
t
are for the moment not revealed to the reader
(they will be disclosed later). Our goal is to form
a condence region for
0
from the available data
set.
Fig. 2. Data for the preliminary example.
We next adopt a users point of view and describe
the procedure in order to solve this problem with
LSCR. Later on comments regarding the obtained
results will be provided.
Rewrite the system as a model with generic pa-
rameter :
y
t
+y
t1
= w
t
.
The predictor and prediction error associated with
the model are
y
t
() = y
t1
,
t
() = y
t
y
t
() = y
t
+y
t1
.
Next we compute the prediction errors
t
() for
t = 1, . . . , 8, and calculate
f
t1
() =
t1
()
t
(), t = 2, . . . , 8.
Note that, f
t1
() are functions of that can
indeed be computed from the available data set.
Then, we take the average of some of these func-
tions in many dierent ways. Precisely, we form 8
averages of the form:
g
i
() =
1
4
kI
i
f
k
(), i = 1, . . . , 8, (4)
where the sets I
i
are subsets of {1, . . . , 7} con-
taining the elements highlighted by a bullet in
the table below. For instance: I
1
= {1, 2, 4, 5},
I
2
= {1, 3, 4, 6}, etc.. The last set, I
8
, is an excep-
tional set: It is empty and we let g
8
() = 0. The
functions g
i
(), i = 1, . . . , 7, can be interpreted
50
as empirical 1-step correlations of the prediction
error.
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
1 2 3 4 5 6 7
The functions g
i
(), i = 1, . . . , 7, obtained for the
data in Figure 2 are displayed in Figure 3.
empirical correlations
Fig. 3. The g
i
() functions.
Now, a simple reasoning leads us to conclude that
these g
i
() functions have a tendency to intersect
the -axis near
0
and that, for =
0
, they
take on positive or negative value with equal
probability. Why is it so? Let us re-write one of
these functions, say g
1
(), as follows:
g
1
() =
1
4
k{1,2,4,5}
[y
k
+y
k1
][y
k+1
+y
k
]
=
1
4
k{1,2,4,5}
[(y
k
+
0
y
k1
) + (
0
)y
k1
]
[(y
k+1
+
0
y
k
) + (
0
)y
k
]
=
1
4
k{1,2,4,5}
[w
k
+ (
0
)y
k1
]
[w
k+1
+ (
0
)y
k
]
=(
0
)
2
1
4
k{1,2,4,5}
y
k1
y
k
+(
0
)
1
4
k{1,2,4,5}
w
k
y
k
+(
0
)
1
4
k{1,2,4,5}
y
k1
w
k+1
+
1
4
k{1,2,4,5}
w
k
w
k+1
.
If
1
4
k{1,2,4,5}
w
k
w
k+1
= 0, the intersection
with the -axis is obtained for =
0
. The vertical
displacement
1
4
k{1,2,4,5}
w
k1
w
k
is a random
variable with equal probability of being positive or
negative; moreover, due to averaging, vertical dis-
persion caused by noise is de-emphasized (it would
be more de-emphasized with more data). So, we
would like to claim that
0
will be somewhere
near where the average functions intersect the -
axis. Moreover - following the above reasoning -
we recognize that for =
0
it is very unlikely
that almost all the g
i
(), i = 1, . . . , 7, functions
have the same sign, and we therefore discard the
rightmost and leftmost regions where at most one
function is less than zero or greater than zero. The
resulting interval [0.04, 0.48] is the condence
region for
0
.
A deeper reasoning (given in Appendix A for not
breaking the continuity of discourse here) reveals
that a fairly strong claim on the obtained interval
can be made:
RESULT: the condence region constructed
this way has exact probability 122/8 = 0.5
to contain the true parameter value
0
.
A few comments are in order:
(1) The interval is stochastic because it depends
on data; the true parameter value
0
is not
and it has a xed location that does not
depend on any random element. Thus, what
the above RESULT says is that the interval
is subject to random uctuation and covers
the true parameter value
0
in 50% of the
cases.
Fig. 4. 10 more trials.
To better understand the nature of the re-
sult, we performed 10 more simulation trials
obtaining the results in Figure 4. Note that
0
and w
t
were as follows:
0
= 0.2, w
t
inde-
51
pendent with uniform distribution between
1 and +1.
(2) In this example, probability is low (50%)
and the interval is rather large. With more
data, we obtain smaller intervals with higher
probability (see also the next section).
(3) The LSCR algorithm was applied with no
knowledge on the noise level or distribution
and, yet, it returned an interval whose prob-
ability was exact, not an upper bound. What
is the key here is that the above RESULT is
a pivotal result as the probability remains
the same no matter what the noise charac-
teristics are.
(4) The result was established along a totally dif-
ferent inference principle from standard Pre-
diction Error Minimization (PEM) methods.
In particular - dierently from the asymp-
totic theory of PEM - LSCR does not con-
struct the condence region by quantifying
the variability in an estimate.
(5) We also mention a technical aspect: The piv-
otal RESULT holds because {I
i
, i = 1, . . . , 8}
form a group under the symmetric dierence
operation, that is (I
j
I
k
) (I
j
I
k
) returns
another set in {I
i
, i = 1, . . . , 8} for any j and
k. For instance, (I
1
I
2
) (I
1
I
2
) = I
3
.
4. LSCR FOR GENERAL LINEAR SYSTEMS
4.1 Data generating system
Consider now the general linear system in Figure
5.
Fig. 5. The system.
We assume that w
t
and u
t
are independent pro-
cesses. This does not mean however that we are
conning ourselves to an open-loop conguration
since closed-loop systems can be reduced to the
set-up in Figure 5 by regarding w
t
and u
t
as
external signals, see Figure 6.
G(
0
) and H(
0
) are stable rational transfer func-
tions. H(
0
) is monic and has a stable inverse. w
t
is a zero-mean independent sequence (noise). No
a-priori knowledge of the noise level is assumed.
The basic assumption we make is that the system
structure is known and, correspondingly, we take
a full-order model class of the form:
Fig. 6. Closed-loop recast as open-loop.
y
t
= G()u
t
+H()w
t
,
such that the true transfer functions G(
0
) and
H(
0
) are obtained for =
0
and for no other pa-
rameter than this. We assume that is restricted
to a set such that H() is monic and G(), H()
and H
1
() are stable for all .
Our goal is to construct an algorithm that works
in the following way (see Figure 7): A nite input-
output data set is given to the algorithm together
with a probability p. The algorithm is required to
return a condence region that contains the true
0
with probability p under assumptions on the
noise as general as possible.
ALGO
region for
Fig. 7. The algorithm.
4.2 Construction of condence regions
We start by describing procedures for the determi-
nation of condence sets
r
based on correlation
properties of (the prediction error) at dierent
time instants (this generalizes the preliminary ex-
ample in Section 3) and of condence sets
u
s
based on cross-correlation properties of and u.
In general, condence regions
for
0
can be
constructed by taking the intersection of a few
of the
r
and
u
s
sets and this is discussed at the
end of this section.
Procedure for the construction of
r
(1) Compute the prediction errors
t
() = y
t
y
t
() = H
1
()y
t
H
1
()G()u
t
52
for a nite number of values of t, say t =
1, 2, . . . , K;
(2) Select an integer r 1. For t = 1+r, . . . , N+
r = K, compute
f
tr,r
() =
tr
()
t
();
(3) Let I = {1, . . . , N} and consider a collection
G of subsets I
i
I, i = 1, . . . , M, forming a
group under the symmetric dierence opera-
tion (i.e. (I
i
I
j
)(I
i
I
j
) G, if I
i
, I
j
G).
Compute
g
i,r
() =
kI
i
f
k,r
(), i = 1, . . . , M;
(4) Select an integer q in the interval [1, (M +
1)/2) and nd the region
r
such that at
least q of the g
i,r
() functions are bigger than
zero and at least q are smaller than zero. 2
The above procedure is the same as the one
used for construction of the condence set in the
preliminary example in Section 3. In that example
we had H
1
() = 1 + z
1
, G() = 0, and
K = 8, N = 7, r = 1, M = 8 and q = 2.
Normalization
1
4
in the preliminary example was
introduced for the purpose of interpreting the
g
i
() functions as empirical averages but it could
have been dropped similarly to point 3 in the
above procedure without aecting the nal result.
In the procedure, the group G can be freely se-
lected. Thus, if e.g. I = {1, 2, 3, 4}, a suitable
group is G = {{1, 2}, {3, 4}, , {1, 2, 3, 4}}; an-
other one is G = {{1}, {2, 3, 4}, , {1, 2, 3, 4}};
yet another one is G = all subsets of I. While
the theory presented holds for any choice and
the region
r
is guaranteed to be a condence
region in any case (see Theorem 1 below), the
feasible choices are limited by computational con-
siderations. For example, the set of all subsets
cannot be normally chosen as it is a truly large
set. Gordon (1974) discusses how to construct
groups of moderate size where the subsets contain
approximately half of the elements in I and such
a procedure is also summarized in Appendix B.
These sets are particularly well suited for use in
point 3 of the above procedure.
The intuitive idea behind the construction in the
procedure is that, for =
0
, the functions g
i,r
()
assume positive or negative value at random, so
that it is unlikely that almost all of them are
positive or that almost all of them are negative.
Since point 4 in the construction of
r
discards
regions where all g
i,r
()s but a small fraction
(q should be taken to be small compared to M)
are of the same sign, we expect that
0
r
with high probability. This is put on solid math-
ematical grounds in Theorem 1 below, showing
that the probability that
0
r
is actually
1 2q/M. Thus, q is a tuning parameter that has
to be selected such that a desired probability of
the condence region is obtained. Moreover, as q
increases, we exclude larger and larger regions of
and hence
r
shrinks and the probability that
r
decreases.
The procedure for construction of the sets
u
s
is in
the same spirit. The only dierence being that the
empirical auto-correlations in point 2 are replaced
by empirical cross-correlations between the input
signal and the prediction error.
Procedure for the construction of
u
s
(1) Compute the prediction errors
t
() = y
t
y
t
() = H
1
()y
t
H
1
()G()u
t
for a nite number of values of t, say t =
1, 2, . . . , K;
(2) Select an integer s 0. For t = 1+s, . . . , N+
s = K, compute
f
u
ts,s
() = u
ts
t
();
(3) Let I = {1, . . . , N} and consider a collection
G of subsets I
i
I, i = 1, . . . , M, forming a
group under the symmetric dierence opera-
tion. Compute
g
u
i,s
() =
kI
i
f
u
k,s
(), i = 1, . . . , M;
(4) Select an integer q in the interval [1, (M + 1)/2)
and nd the region
u
s
such that at least q
of the g
u
i,s
() functions are bigger than zero
and at least q are smaller than zero. 2
The next theorem gives the exact probability that
the true parameter
0
belongs to one particular
of the above constructed sets. The proof of this
theorem - as well as comments on the technical
assumption on densities - can be found in Campi
and Weyer (2005).
Theorem 1. Assume that the variables w
t
and
w
t
u
r
and
u
s
constructed above are such that:
Pr{
0
r
} =1 2q/M, (5)
Pr{
0
u
s
} =1 2q/M. (6)
The following comments pinpoint some important
aspects of this result:
(1) The procedures return regions of guaranteed
probability despite that no a-priori knowl-
edge on the noise level is assumed: The noise
level enters the procedures through data only.
This could be phrased by saying that the
procedures let the data speak, without a-
priori assuming what they have to tell us.
53
(2) As expected, noise level does impact the nal
result as the shape and size of the region
depend on noise via the data.
(3) Evaluations (5) and (6) are nonconservative
in the sense that 1 2q/M is the exact
probability, not a lower bound of it.
Each one of the sets
r
and
u
s
is a non-
asymptotic condence set for
0
. However, each
one of these sets is based on one correlation only
and will usually be unbounded in some direc-
tions of the parameter space, and therefore not
particularly useful. A general practically useful
condence set
can be obtained by intersecting
a number of the sets
r
and
u
s
, i.e.
=
n
r=1
r
n
u
s=1
u
s
. (7)
An obvious question is how to choose n
and
n
u
in order to obtained well shaped condence
sets that are bounded and concentrated around
the true parameter
0
. It turns out that the
answer depends on the particular model class
under consideration and this issue will be further
discussed in Section 6.
We conclude this section with a fact which is
immediate from Theorem 1.
Theorem 2. Under the assumptions of Theorem
1,
Pr{
0
} 1 (n
+n
u
)2q/M,
where
is given by (7).
The inequality in the theorem is due to that the
events {
0
/
r
}, {
0
/
u
s
}, r = 1, . . . , n
,
s = 1, . . . , n
u
, may be overlapping.
Theorem 2 can be used in connection with robust
design procedures: If a problem solution is robust
with respect to
in the sense that a certain
property is achieved for any
, then such
a property is also guaranteed for the true system
with the selected probability 1 (n
+n
u
)2q/M.
5. EXAMPLES
Two examples illustrate the developed methodol-
ogy. The rst one is simple and permits an easy
illustration of the method. The second is more
challenging.
5.1 First order ARMA system
Consider the ARMA system
y
t
+a
0
y
t1
= w
t
+c
0
w
t1
, (8)
where a
0
= 0.5, c
0
= 0.2 and w
t
is an inde-
pendent sequence of zero mean Gaussian random
variables with variance 1. 1025 data points were
generated according to (8). As a model class we
used y
t
+ay
t1
= w
t
+cw
t1
, |a| < 1, |c| < 1, with
associated predictor and prediction error given by
y
t
(a, c) =c y
t1
(a, c) + (c a)y
t1
,
t
(a, c) =y
t
y
t
(a, c) = y
t
+ay
t1
c
t1
(a, c).
In order to form a condence region for
0
=
(a
0
, c
0
) we calculated
f
t1,1
(a, c) =
t1
(a, c)
t
(a, c), t = 2, . . . , 1024,
f
t2,2
(a, c) =
t2
(a, c)
t
(a, c), t = 3, . . . , 1025,
and then computed
g
i,1
(a, c) =
kI
i
f
k,1
(a, c), i = 1, . . . , 1024,
g
i,2
(a, c) =
kI
i
f
k,2
(a, c), i = 1, . . . , 1024,
using the group in Appendix B. Next we discarded
those values of a and c for which zero was among
the 12 largest and smallest values of g
i,1
(a, c) and
g
i,2
(a, c). Then, according to Theorem 2, (a
0
, c
0
)
belongs to the constructed region with probability
at least 1 2 2 12/1024 = 0.9531. The obtained
condence region is the blank area in Figure 8.
The area marked with x is where 0 is among the
12 smallest values of g
i,1
, the area marked with
+ is where 0 is among the 12 largest values of
g
i,1
. Likewise for g
i,2
with the squares representing
when 0 belongs to the 12 largest elements and the
circles the 12 smallest. The true value (a
0
, c
0
) is
marked with a star. As we can see, each step in
the construction of the condence region excludes
a particular region.
Using the algorithm for the construction of
we
have obtained a bounded condence set with a
guaranteed probability based on a nite number of
data points. As no asymptotic theory is involved
this is a rigorous nite sample result. For com-
parison, we have in Figure 8 also plotted the 95%
condence ellipsoid obtained using the asymptotic
theory (Ljung (1999), Chapter 9). The two con-
dence regions are of similar shape and size, con-
rming that the non-asymptotic condence sets
are practically useful, and - unlike the asymptotic
condence ellipsoids - they do have guaranteed
probability for a nite sample size.
5.2 A closed-loop system
The following example was originally introduced
in Garatti et al. (2004) to demonstrate that the
54
Fig. 8. Non-asymptotic condence region for
(a
0
, c
0
) (blank region) and asymptotic con-
dence ellipsoid. = true parameter, =
estimated parameter using a prediction error
method.
asymptotic theory of PEM can at times deliver
misleading results even with a large amount of
data points. It is reconsidered here to show how
LSCR works in this challenging situation.
Consider the system of Figure 9 where
Fig. 9. The closed-loop system.
F(
0
) =
b
0
z
1
1 +a
0
z
1
, a
0
= 0.7, b
0
= 0.3
H
(
0
) =1 +h
0
z
1
, h
0
= 0.5,
w
t
is white Gaussian noise with variance 1 and
the reference u
t
is also white Gaussian, with
variance 10
6
. Note that the variance of the
reference signal is very small as compared to the
noise variance, that is there is poor excitation. It
is perhaps interesting to note that the present
situation - though admittedly articial - is a
simplication of what often happens in practical
identication, where poor excitation is due to
the closed-loop operation of the system. 2050
measurements of u and y were generated to be
used in identication.
We rst describe what we obtained using PEM
identication.
A full order model was identied. The amplitude
Bode diagrams of the transfer function from u
to y of the identied model and of the real
system are plotted in Figure 10. From the plot,
a big mismatch between the real plant and the
identied model is apparent, a fact that does not
come too much of a surprise considering that the
reference signal is poorly exciting. An analysis
conducted in Garatti et al. (2004) shows that,
when u
t
= 0, the asymptotic PEM identication
cost has two isolated global minimizers, one is
0
and a second one is a spurious parameter, say
;
when u
t
= 0 but small as is the case in our actual
experiment,
, generating a totally
wrong identied model.
Fig. 10. The identied u y transfer function.
But, let us now see what we obtained as a 90%
condence region with the asymptotic theory.
Figure 11 displays the condence region in the
frequency domain: Surprisingly, it concentrates
around the identied model, so that in a real
identication procedure where the true transfer
function is not known we would conclude that the
estimated model is reliable, a totally misleading
result. We will come back to this point later and
discuss a bit further the theoretical reason for such
a bad behavior.
Return now to the LSCR approach. LSCR was
used in a totally blind manner, that is with
no concern at all for the identication set-up
characteristics; in particular, we did not pay any
attention to the existence of local minima: The
method is guaranteed by the theory and it will
work in all possible situations covered by the
theory.
In the present setting, the prediction error is given
by
t
() =
1
1 +hz
1
y
t
55
Fig. 11. 90% condence region for the identied
u y transfer function obtained with the
asymptotic theory.
bz
1
(1 +az
1
)(1 +hz
1
)
(u
t
y
t
)
=
1 + (a +b)z
1
(1 +az
1
)(1 +hz
1
)
y
t
bz
1
(1 +az
1
)(1 +hz
1
)
u
t
.
The group was constructed as in the Appendix B
(2
l
= 2048), and we computed
g
i,r
() =
kI
i
kr
()
k
(), r = 1, 2, 3,
in the parameter space, making the standard as-
sumptions that G() and H() (i.e. the u to y
and w to y closed-loop transfer functions) were
stable (|a + b| < 1) and that H() has a stable
inverse (|a| < 1, |h| < 1). We excluded the regions
in the parameter space where 0 was among the 34
smallest or largest values of any of the three corre-
lations above to obtain a 13234/2048 = 0.9004
condence set. The condence set is shown in
Figure 12. The set consists of two separate regions,
one around the true parameter
0
and one around
, the
condence ellipsoid all concentrates around this
spurious
(- -).
Fig. 14. Close-up of the non-asymptotic con-
dence region around
0
.
condence region unable to capture the real model
uncertainty. The reader is referred to Garatti et al.
(2004) for more details.
6. LSCR PROPERTIES
As we have seen in Section 4, Theorems 1 and
2 quantify the probability that
0
belongs to the
56
constructed regions. However, this theorem deals
only with one side of the story. In fact, a good
evaluation method must have two properties:
the provided region must have guaranteed
probability (and this is what Theorems 1 and
2 deliver);
the region must be bounded, and, in partic-
ular, it should concentrate around
0
as the
number of data points increases.
We next discuss how this second property can be
achieved by choosing n
and n
u
in (7). It turns out
that the choice depends on the model class, and
we here consider ARMA and ARMAX models,
while general linear model classes are dealt with
in Campi and Weyer (2005).
6.1 ARMA models
Data generating system and model class
The data generating system is given by
y
t
=
C(
0
)
A(
0
)
w
t
,
where
A(
0
) =1 +a
0
1
z
1
+ +a
0
n
z
n
,
C(
0
) =1 +c
0
1
z
1
+ +c
0
p
z
p
,
and
0
= [a
0
1
a
0
n
c
0
1
c
0
p
]
T
. In addition to
the assumptions in Section 4.1 and in Theorem 1,
we assume that A(
0
) and C(
0
) have no common
factors and that w
t
is wide-sense stationary with
spectral density
w
() =
2
w
> 0.
The model class is
y
t
=
C()
A()
w
t
,
where
A() =1 +a
1
z
1
+ +a
n
z
n
,
C() =1 +c
1
z
1
+ +c
p
z
p
,
= [a
1
a
n
c
1
c
p
]
T
, and the assumptions
in Section 4.1 are in place.
Condence regions for ARMA models
We next give a result taken from Campi and
Weyer (2005) which shows how a condence re-
gion which concentrates around the true parame-
ter as the number of data points increases can be
obtained for ARMA systems.
Theorem 3. Let
t
() =
A()
C()
y
t
be the prediction
error associated with the ARMA model class.
Then, =
0
is the unique solution to the set
of equations:
E[
tr
()
t
()] = 0, r = 1, . . . , n +p.
Theorem 3 shows that if we simultaneously impose
n+p correlation conditions, where n and p are the
orders of the A() and C() polynomials, then the
only solution is the true
0
. Guided by this idea,
we consider n + p sample correlation conditions,
and let n
= n +p in (7):
=
n+p
r=1
r
.
Theorem 2 guarantees that this set contains
0
with probability 1 (n + p)q/M, and Theorem
3 entails that the condence set concentrates
around
0
.
6.2 ARMAX models
Data generating system and model class
Consider now system
y
t
=
B(
0
)
A(
0
)
u
t
+
C(
0
)
A(
0
)
w
t
,
where
A(
0
) =1 +a
0
1
z
1
+ +a
0
n
z
n
,
B(
0
) =b
0
1
z
1
+ +b
0
m
z
m
,
C(
0
) =1 +c
0
1
z
1
+ +c
0
p
z
p
,
and
0
= [a
0
1
a
0
n
b
0
1
b
0
m
c
0
1
c
0
p
]
T
. In
addition to the assumptions in Section 4.1 and in
Theorem 1, we assume that A(
0
) and B(
0
) have
no common factors and - similarly to the ARMA
case - we assume a stationary environment. Pre-
cisely, w
t
is wide-sense stationary with spectral
density
w
() =
2
w
> 0 and u
t
is wide-sense
stationary too and independent of w
t
.
The model class is
y
t
=
B()
A()
u
t
+
C()
A()
w
t
,
where A(), B(), C() have the same structure
as for the true system.
Condence regions for ARMAX models
The next theorem taken from Campi and Weyer
(2005) shows that we can choose correlation equa-
tions such that the solution is unique and equal
to
0
, provided that the input signal u
t
is white.
Theorem 4. Let
t
() =
A()
C()
y
t
B()
C()
u
t
be the
prediction error associated with the ARMAX
57
model class. If u
t
is white with spectral density
u
() =
2
u
> 0, then =
0
is the unique
solution to the set of equations:
E[u
ts
t
()] =0, s = 1, . . . , n +m,
E[
tr
()
t
()] =0, r = 1, . . . , p.
Guided by this result, we choose n
= p and
n
u
= n + m in (7) to arrive at the following
condence region for ARMAX models:
=
p
r=1
r
n+m
s=1
u
s
.
Interestingly enough, the conclusion of Theorem
4 does not hold true for colored input sequences,
see Campi and Garatti (2003). On the other hand,
assuming that u
t
is white is often unrealistic.
This impasse can be circumvented by resorting to
suitable preltering actions, as indicated in Campi
and Weyer (2005).
6.3 Properties of LSCR
To summarize, LSCR has the following properties:
for suitable selections of the correlations, the
region shrinks around
0
;
for any sample size,
0
belongs to the con-
structed region with given probability, de-
spite that no assumption on the level of noise
is made.
7. COMPLEMENTS
In this Part I, the only restrictive assumption
on noise was that it had symmetric distribution
around zero. This assumption can be relaxed as
briey discussed here.
(i) Suppose that w
t
in Section 4 has median
0, i.e. Pr{w
t
0} = Pr{w
t
< 0} = 0.5
(note that this is a relaxation of the sym-
metric distribution condition). Then, the-
ory goes through by considering everywhere
sign(
t
()) instead of
t
(), where sign is
signum function: sign(x) = 1 if x > 0,
sign(x) = 1 if x < 0 and sign(x) = 0 if
x = 0.
(ii) When w
t
is independent and identically but
not symmetrically distributed, we can obtain
symmetrically distributed data by consider-
ing the dierence between two subsequent
data points, that is (y
t
y
t1
) = G()(u
t
u
t1
)+H()(w
t
w
t1
); here, w
t
w
t1
, t =
2, 4, 6, . . . are independent and symmetrically
distributed around 0 and we can refer to this
dierence system to construct condence
regions.
PART II: Presence of
unmodeled dynamics
In this second part, we discuss the possibility to
deal with unmodeled dynamics within the LSCR
framework: Despite that the true system is not
within the model class, we would like to derive
guaranteed results for some parts of the system.
We start by describing the problem of identifying
a full-order u to y transfer function, without
deriving a model for the noise. Then, we turn to
also consider unmodeled dynamics in the u to y
transfer function. General ideas are only discussed
by means of simple examples.
8. IDENTIFICATION WITHOUT NOISE
DESCRIPTION: AN EXAMPLE
Consider the system
y
t
=
0
u
t
+n
t
.
Suppose that the structure of the u to y transfer
function is known. Instead, the noise n
t
describes
all other sources of variation in y
t
apart from u
t
and we do not want to make any assumption on
how n
t
is generated. Correspondingly, we want
that our results regarding the value of
0
are valid
for any (unknown) deterministic noise sequence n
t
with no constraints whatsoever. When the noise
is stochastic, the result will then hold for any
realization of the noise, that is surely.
In this section, we assume that we have access
to the system for experiment: We are allowed to
generate a nite number, say 7, of input data
and - based on the collected outputs - we are
asked to construct a condence interval
for
0
of guaranteed probability.
The problem looks very challenging indeed: Since
the noise can be whatever, it seems that the
observed data are unable to give us a hand in
constructing a condence region. In fact, for any
given
0
and u
t
, a suitable choice of the noise
sequence can lead to any observed output signal!
Let us see how this problem can be circumvented.
Before proceeding, we feel advisable to make clear
what is meant here by guaranteed probability.
We said that n
t
is regarded as a deterministic
sequence, and the result is required to hold true
for any n
t
, that is uniformly in n
t
. The stochastic
element is instead the input sequence: We will
58
select u
t
according to a random generation mech-
anism and we require that
0
with a given
probability value, where the probability is with
respect to the random choice of u
t
.
We rst indicate input design and then the pro-
cedure for construction of the condence interval
.
Input design
Let u
t
, t = 1, . . . , 7, be independent and identi-
cally distributed with distribution
u
t
=
_
1, with probability 0.5
1, with probability 0.5.
Procedure for construction of the con-
dence interval
t
(), t = 1, . . . , 7.
The rest of the construction is the same as for the
preliminary example of Section 3: We consider the
same group of subsets as given in the bullet table
in that example and construct the g
i
() functions
as in (4). Then, we extract the interval where at
least two functions are below zero and at least
two are above zero. The reader can verify that the
theoretical analysis for the example in Section 3
goes through here to conclude that the obtained
interval has probability 0.5 to contain the true
0
.
Interestingly, the property of w
t
to be independent
and symmetrically distributed have been replaced
here by analogous properties of the input signal;
the advantage is that - if the experiment can be
designed - these properties can be easily enforced
and no restrictive conditions on the noise are
required anymore.
A simulation example was run where
0
= 1 and
the noise was the sequence shown in Figure 15.
This noise sequence was obtained as a realization
of a biased independent Gaussian process with
mean 0.5 and variance 0.1. The obtained g
i
()
functions and the corresponding condence region
are given in Figure 16.
Fig. 15. Noise sequence.
empirical correlations
Fig. 16. The g
i
() functions.
9. UNMODELED DYNAMICS IN THE U TO
Y TRANSFER FUNCTION: AN EXAMPLE
Suppose that a system has structure
y
t
=
0
0
u
t
+
0
1
u
t1
+n
t
,
while - for estimation purposes - we use the
reduced order model
y
t
= u
t
+n
t
.
The noise has been indicated with a generic n
t
to signify that it can be whatever, and not just a
white signal. In fact, a perspective similar to the
previous section is taken and we regard n
t
as a
generic unknown deterministic signal.
After determining a region for the parameter ,
one sensible question to ask is: Does this region
contain with a given probability the system pa-
rameter
0
0
linking u
t
to y
t
?
Reinterpreting the above question we are ask-
ing whether the projection of the true transfer
function
0
0
+
0
1
z
1
onto the 1-dimensional space
spanned by constant transfer functions is con-
tained in the estimated set with a certain prob-
ability.
59
We generate an input signal u
t
in the same way
as in the previous section, this time over the
time interval t = 0, . . . , 7, and inject it into the
system. Then, the predictor and prediction error
are constructed the same way as in the previous
section, while we add a sign function to f
t
():
f
t
() = sign(u
t
t
()), t = 1, . . . , 7. (9)
Corresponding to the true parameter value, i.e.
=
0
0
, an easy inspection reveals that sign(u
t
t
(
0
0
))
= sign(u
t
(
0
1
u
t1
+ n
t
)) is an independent and
symmetrically distributed process (it is in fact
a Bernoullian process taking on values 1 with
probability 0.5 each). Thus, with f
t
() as in (9),
the situation is similar to what we had in the pre-
vious section and again the theory goes through
to prove that an interval for constructed as in
the previous section has probability 0.5 to contain
0
0
, despite the presence of unmodeled dynamics.
A simulation example was run with
0
0
= 1,
0
1
= 0.5 and where the noise was again the
realization of a biased Gaussian process given
in Figure 15. As sign(u
t
t
()) only can take on
the values 1, 1 and 0, it is possible that two
or more of the g
i
() functions will take on the
same value on an interval (in technical terms,
the assumption on the existence of densities in
Theorem 1 does not hold). This tie can be broken
by introducing a random ordering (e.g. by adding
a random constant number between 0.1 and 0.1
to the g
i
() functions) and one can see that the
theory remains valid. The obtained g
i
() functions
and condence region are in Figure 17.
Fig. 17. The g
i
() functions.
Though presented on simple examples, the ap-
proaches illustrated in Sections 8 and 9 to deal
with unmodeled dynamics bear a general breath
of applicability.
PART III: Nonlinear sys-
tems
Interestingly, the identication set-up developed
in previous sections in a linear setting generalizes
naturally to nonlinear systems. Such an extension
is presented in this part III with reference to a
simple example.
10. IDENTIFICATION OF NONLINEAR
SYSTEMS: AN EXAMPLE
Consider the following nonlinear system
y
t
=
0
(y
2
t1
1) +w
t
, (10)
where w
t
is an independent and symmetrically
distributed sequence and
0
is an unknown pa-
rameter.
This system can be made explicit with respect to
w
t
as follows:
w
t
= y
t
0
(y
2
t1
1),
and - by substituting
0
with a generic and re-
naming the so-obtained right-hand-side as w
t
() -
we have
w
t
() = y
t
(y
2
t1
1).
Note that w
t
() coincides in this example with the
prediction error for the model with parameter . It
is not true, however, that the above construction
generates the prediction error for any nonlinear
system. For example, if we make explicit system
y
t
=
0
y
t1
w
t
with respect to w
t
we get w
t
=
y
t
/
0
y
t1
, and further substituting a generic we
have w
t
() = y
t
/y
t1
. This is not the prediction
error since the predictor is in this case y
t
() = 0,
so that the prediction error is here given by y
t
y
t
= y
t
. Note also that for linear systems w
t
() is
always equal to the prediction error
t
() so that
the construction suggested here for the generation
of w
t
() generalizes the construction of
t
() for
linear systems.
Now, an inspection of the proof in Appendix A
reveals that the only property that has a role
in determining the result that the condence
region contains
0
with a given probability is that
t
(
0
) = w
t
. Since this same property holds here
for w
t
(), i.e. w
t
(
0
) = w
t
, we can argue that
proceeding in the same way as for the preliminary
example of Section 3 where
t
() is replaced by
w
t
() still generates in the present context a
guaranteed condence region.
Is it all so simple? It is indeed, as far as the
guarantee that
0
is in the region is concerned.
The dark side of the medal is that second order
60
statistics are in general not enough to spot the
real parameter value for nonlinear systems, so that
results like those in Section 6 do not apply to
conclude that the region shrinks around
0
.
In actual eects, if e.g.
0
= 0 and E[w
2
t
] = 1,
some easy computation reveals that E[w
tr
()w
t
()]
= 0 for any and for any r > 0 so that the second-
order statistics are useless.
Now, the good news is that LSCR can be up-
graded to higher-order statistics with little eort.
A general presentation of the related results can
be found in Dalai et al. (2005). Here, it suces
to say that we can e.g. consider the third-order
statistic E[w
t
()
2
w
t+1
()] and the theory goes
through unaltered.
As an example, we generated 9 samples of y
t
,
t = 0, . . . , 8 for system (10) where w
t
is zero-mean
Gaussian with variance 1. Then, we constructed
g
i
() =
1
4
kI
i
w
k
()
2
w
k+1
(), i = 1, . . . , 8,
where the sets I
i
are as in Section 3. These
functions are displayed in Figure 18. The interval
marked in blue where at least two functions are
below zero and at least two are above zero has
probability 0.5 to contain
0
= 0.
empirical 3rd-order correlations
Fig. 18. The g
i
() functions.
11. CONCLUSIONS
In this paper we have provided an overview of
the LSCR method for system identication. Most
of the existing theory for system identication is
asymptotic in the number of data points while
in practice one will only have a nite number
of samples available. Although the asymptotic
theory often delivers sensible results when applied
heuristically to a nite number of data points, the
results are not guaranteed. The LSCR method
delivers guaranteed nite sample results, and it
produces a set of models to which the true system
belongs with a given probability for any nite
number of data points.
As illustrated by the simulation examples, the
method is not only grounded on a solid theoret-
ical basis, but it also delivers practically useful
condence sets.
The LSCR method takes a global approach and
can, when the situation requires, produce a con-
dence set which consists of disjoint regions, and
hence it has advantages over condence ellipsoids
based on the asymptotic theory.
By allowing the user to choose the input signal,
the LSCR method can be extended to deal with
unmodeled dynamics, and it can also be extended
to non-linear systems by considering higher-order
statistics.
REFERENCES
Bartlett, P.L (2003). Prediction algorithms: com-
plexity, concentration and convexity. Proc. of
the 13th IFAC Symposium on System Iden-
tication, Rotterdam, The Netherlands, pp.
1507-1517.
Bittanti, S. and M. Lovera (2000). Bootstrap-
based estimates of uncertainty in subspace
identication methods. Automatica, Vol. 36,
pp. 1605-1615.
Bosq, D. (1998). Nonparametric Statistics for
Stochastic Processes - Estimation and Pre-
diction. Lecture Notes in Statistics 110.
Springer Verlag.
Campi, M.C. and S. Garatti (2003). Correlation
approach for ARMAX model identication:
A counterexample to the uniqueness of the
asymptotic solution. Internal report of the
university of Brescia.
Campi, M.C. and P.R. Kumar (1998). Learning
Dynamical Systems in a Stationary Environ-
ment. Systems and Control Letters, Vol. 34,
pp. 125-132.
Campi, M.C. and E. Weyer (2002). Finite sample
properties of system identication methods.
IEEE Trans. on Automatic Control, Vol. 47,
pp. 1329-1334.
Campi, M.C. and E. Weyer (2005). Guaranteed
non-asymptotic condence regions in system
identication. Automatica, Vol. 41, pp. 1751-
1764.
Campi, M.C., S.K. Ooi and E. Weyer (2004).
Non-asymptotic quality assessment of gen-
eralised FIR models with periodic inputs.
Automatica, Vol. 40, pp. 2029-2041.
Cherkassky, V. and F. Mulier (1998). Learning
from data. John Wiley.
Dahleh, M.A., T.V. Theodosopoulos and J.N.
Tsitsiklis (1993). The sample complexity of
worst-case identication of FIR linear sys-
61
tems. System and Control Letters, Vol. 20,
pp. 157-166.
Dahleh, M.A., E.D. Sontag, D.N.C. Tse and J.N.
Tsitsiklis (1995). Worst-case identication of
nonlinear fading memory systems. Automat-
ica, Vol. 31, pp. 503-508.
Dalai, M., E. Weyer and M.C. Campi (2005).
Parametric identication of nonlinear sys-
tems: Guaranteed condence regions. In Proc.
of the 44th IEEE CDC, Seville, Spain, pp.
6418-6423.
Ding, F. and T. Chen (2005). Performance
bounds on forgetting factor least-squares al-
gorithms for time-varying systems with nite
measurement data. IEEE Trans on Circuits
and Systems-I, Vol. 52, pp. 555-566.
Dunstan, W.J. and R.R. Bitmead (2003). Em-
pirical estimation of parameter distributions
in system identication, In Proc. of the 13th
IFAC Symposium on System Identication,
Rotterdam, The Netherlands.
Efron, B., and R.J. Tibshirani (1993). An intro-
duction to the bootstrap, Chapman and Hall.
Garatti, S., M.C. Campi and S. Bittanti (2004).
Assessing the quality of identied models
through the asymptotic theory - When is the
result reliable? Automatica, Vol. 40, pp. 1319-
1332.
Garatti, S., M.C. Campi and S. Bittanti (2006).
The asymptotic model quality assessment for
instrumental variable identication revisited.
System and Control Letters, to appear.
Garulli, A., L. Giarre and G. Zappa (2002).
Identication of approximated Hammerstein
models in a worst-case setting. IEEE Trans.
on Automatic Control, Vol. 47, pp. 2046-
2050.
Garulli, A., A. Vicino and G. Zappa (2000).
Conditional central algorithms for worst-case
set membership identication and ltering.
IEEE Trans. on Automatic Control, Vol. 45,
pp. 14-23.
Giarre, L., B.Z. Kacewicz and M. Milanese
(1997a). Model quality evaluation in set
membership identication. Automatica, Vol.
33, pp. 1133-1139.
Giarre, L., M. Milanese and M. Taragna (1997b).
H
-
identication. Systems and Control Letters,
Vol. 27, pp. 255-260.
Hartigan, J.A. (1969). Using subsample values as
typical values. Journal of American Statisti-
cal Association, Vol. 64, pp. 1303-1317.
Hartigan, J.A. (1970). Exact condence inter-
vals in regression problems with independent
symmetric errors. Annals of Mathematical
Statistics, Vol. 41, pp. 1992-1998.
Heath, W.P. (2001). Bias of indirect non-parametric
transfer function estimates for plants in
closed loop. Automatica, Vol. 37, pp. 1529-
1540.
Hjalmarsson, H. and B. Ninness (2004). An exact
nite sample variance expression for a class
of frequency function estimates. In Proc. of
the 43rd IEEE CDC, Bahamas.
Hoeding, W. (1963). Probability inequalities for
sums of bounded random variables. J. Amer.
Statist. Assoc., Vol. 58, pp. 13-30.
Karandikar R. and M. Vidyasagar (2002). Rates
of uniform convergence of empirical means
with mixing processes. Statistics and Proba-
bility Letters, Vol. 58, pp. 297-307.
Meir, R. (2000). Nonparametric Time Series Pre-
diction Through Adaptive Model Selection.
Machine Learning, Vol. 39, pp. 5-34.
Ljung, L. (1999). System Identication - Theory
for the User. 2nd Ed., Prentice Hall.
Milanese, M. and M. Taragna (2005). H
set
membership identication. A survey. Auto-
matica, Vol. 41, pp. 2019-2032.
Milanese, M. and A. Vicino (1991). Optimal
estimation theory for dynamic systems with
set membership uncertainty: an overview.
Automatica, Vol. 27, pp. 997-1009.
Modha, D.S. and E. Masry (1996). Minimum
complexity regression estimation with weakly
dependent observations. IEEE Trans. Infor-
mation Theory, Vol. 42, pp. 2133-2145.
Modha, D.S. and E. Masry (1998). Memory-
Universal Prediction of Stationary Random
Processes. IEEE Trans. Information Theory,
Vol. 44, pp. 117-133.
Ninness, B. and H. Hjalmarson (2004). Variance
Error quantications that are exact for -
nite model order. IEEE Trans. on Automatic
Control, Vol. 49, pp. 1275-1291.
Politis, D.N. (1998). Computer-intensive meth-
ods in statistical analysis. IEEE Signal Pro-
cessing Magazine, Vol. 15, pp. 39-55.
Politis, D.N., J.P. Romano and M. Wolf (1999).
Subsampling. Springer.
Poolla, K. and A. Tikku (1994). On the Time
Complexity of Worst-Case System Identi-
cation. IEEE Trans. on Automatic Control,
Vol. 39, pp. 944-950.
Richmond, S. (1964). Statistical Analysis. 2nd
Ed., The Ronald Press Company, N.Y.
62
Shao, J., and D. Tu (1995). The Jackknife and
Bootstrap. Springer.
Smith, R. and G.E. Dullerud (1996). Continuous-
time control model validation using nite ex-
perimental data. IEEE Trans. on Auotmatic
Control, Vol. 41, pp. 1094-1105.
Soderstr om, T. and P. Stoica (1988). System
Identication. Prentice Hall.
Spall, J.C. (1995). Uncertainty Bounds for Pa-
rameter Identication with Small Sample
Sizes. Proc. of the 34th IEEE CDC, New
Orleans, Louisiana, USA, pp. 3504-3515.
Tjarnstr om, F. and L. Ljung (2002). Using the
Bootstrap to estimate the variance in the
case of undermodeling. IEEE Trans. on Au-
tomatic Control, Vol. 47, pp. 395-398.
Vapnik, V. (1998). The nature of statistical learn-
ing theory. Springer.
Vapnik, V.N. and A.Ya. Chervonenkis (1968).
Uniform convergence of the frequencies of
occurence of events to their probabilities.
Soviet Math. Doklady, Vol. 9, pp. 915-968.
Vapnik, V.N. and A.Ya. Chervonenkis (1971). On
the uniform convergence of relative frequen-
cies to their probabiliies. Theory of Prob. and
its Appl., Vol. 16, pp. 264-280.
Venkatesh, S.R. and M. A. Dahleh (2001). On
system identication of complex systems
from nite data. IEEE Trans. on Automatic
Control, Vol. 46, pp. 235-257.
Vicino, A. and G. Zappa (1996). Sequential ap-
proximation of feasible parameter sets for
identication with set membership uncer-
tainty. IEEE Trans. on Automatic Control,
Vol. 41, pp. 774-785.
Vidyasagar, M. (2002). A theory of Learning and
Generalization. 2nd Ed., Springer Verlag.
Vidyasagar, M. and R. Karandikar (2002). A
learning theory approach to system identi-
cation and stochastic adaptive control. In
IFAC Symp. on Adaptation and Learning.
Vidyasagar, M. and R. Karandikar (2006). A
learning theory approach to system identi-
cation and stochastic adaptive control. In G.
Calaore and F. Dabbene (Eds.) Probabilis-
tic and randomized methods for design under
uncertainty. Springer.
Welsh, J.S. and G.C. Goodwin (2002). Finite
sample properties of indirect nonparametric
closed-loop identication. IEEE Trans. on
Automatic Control, Vol. 47, pp. 1277-1292.
Weyer, E. (2000). Finite sample properties of
system identication of ARX models under
mixing conditions. Automatica Vol. 36, pp.
1291-1299.
Weyer, E. and M.C. Campi (2002). Non-asymptotic
condence ellipsoids for the least squares es-
timate. Automatica, Vol. 38, pp. 1539-1547.
Weyer, E., R. C. Williamson and I. M. Y. Mareels
(1999). Finite sample properties of linear
model identication. IEEE Trans. on Auto-
matic Control, Vol. 44, pp. 1370-1383.
Yu, B. (1994). Rates of convergence for empiri-
cal processes of stationary mixing sequences.
Annals of Probability, Vol. 22, pp. 94-116.
Appendix A. PROOF OF THE RESULT
In the condence region construction, we elim-
inate the regions in parameter space where all
functions g
i
(), i = 1, . . . , 7, are above zero, or
at most one of them is below zero and where
all functions are below zero, or at most one of
them is above zero. Therefore, the true parameter
value
0
falls outside the condence region when
- corresponding to =
0
- all functions g
i
(),
i = 1, . . . , 7, are bigger than g
8
() = 0, or at most
one of them is smaller than g
8
() and where all
functions are smaller than g
8
(), or at most one
of them is bigger than g
8
(). We claim that each
one of these events has probability 1/8 to happen,
so that the total probability that
0
falls outside
the condence region is 4 1/8 = 0.5, as claimed
in the RESULT.
We next concentrate on one condition only and
compute the probability that all functions g
i
(
0
),
i = 1, . . . , 7, are bigger than g
8
(
0
) = 0. The
probability of the other conditions can be derived
similarly.
The considered condition writes
w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
> 0
w
1
w
2
+w
3
w
4
+w
4
w
5
+w
6
w
7
> 0
w
2
w
3
+w
3
w
4
+w
5
w
6
+w
6
w
7
> 0
w
1
w
2
+w
2
w
3
+w
6
w
7
+w
7
w
8
> 0
w
1
w
2
+w
3
w
4
+w
5
w
6
+w
7
w
8
> 0
w
2
w
3
+w
3
w
4
+w
4
w
5
+w
7
w
8
> 0
w
4
w
5
+w
5
w
6
+w
6
w
7
+w
7
w
8
> 0.
(A.1)
To compute the probability that all these 7 in-
equalities are simultaneously true let us ask the
following question: What would we have writ-
ten if instead of comparing g
i
(
0
), i = 1, . . . , 7,
with g
8
(
0
) we would have compared g
i
(
0
), i =
2, . . . , 8, with g
1
(
0
)? The conditions would have
been:
w
1
w
2
+w
3
w
4
+w
4
w
5
+w
6
w
7
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
w
2
w
3
+w
3
w
4
+w
5
w
6
+w
6
w
7
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
w
1
w
2
+w
2
w
3
+w
6
w
7
+w
7
w
8
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
w
1
w
2
+w
3
w
4
+w
5
w
6
+w
7
w
8
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
w
2
w
3
+w
3
w
4
+w
4
w
5
+w
7
w
8
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
w
4
w
5
+w
5
w
6
+w
6
w
7
+w
7
w
8
> w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
0 > w
1
w
2
+w
2
w
3
+w
4
w
5
+w
5
w
6
,
63
or, moving everything to the left-hand-side,
w
2
w
3
+w
3
w
4
w
5
w
6
+w
6
w
7
> 0
w
1
w
2
+w
3
w
4
w
4
w
5
+w
6
w
7
> 0
w
4
w
5
w
5
w
6
+w
6
w
7
+w
7
w
8
> 0
w
2
w
3
+w
3
w
4
w
4
w
5
+w
7
w
8
> 0
w
1
w
2
+w
3
w
4
w
5
w
6
+w
7
w
8
> 0
w
1
w
2
w
2
w
3
+w
6
w
7
+w
7
w
8
> 0
w
1
w
2
w
2
w
3
w
4
w
5
w
5
w
6
> 0.
If we now let w
2
= w
2
, w
5
= w
5
, the latter
condition re-writes as
w
2
w
3
+w
3
w
4
+ w
5
w
6
+w
6
w
7
> 0
w
1
w
2
+w
3
w
4
+w
4
w
5
+w
6
w
7
> 0
w
4
w
5
+ w
5
w
6
+w
6
w
7
+w
7
w
8
> 0
w
2
w
3
+w
3
w
4
+w
4
w
5
+w
7
w
8
> 0
w
1
w
2
+w
3
w
4
+ w
5
w
6
+w
7
w
8
> 0
w
1
w
2
+ w
2
w
3
+w
6
w
7
+w
7
w
8
> 0
w
1
w
2
+ w
2
w
3
+w
4
w
5
+ w
5
w
6
> 0.
(A.2)
Except for the showing up here and there,
this latter set of inequalities is the same as the
original set of inequalities (A.1) (the order in
which the inequalities are listed is changed but
the inequalities altogether are the same - this is a
consequence of the group property of the sets I
i
).
Moreover, since the w
t
variables are symmetrically
distributed, the change of sign implied by the is
immaterial as far as the probability of satisfaction
of the inequalities in (A.2) is concerned, so that
we can conclude that (A.1) and (A.2) are satised
with the same probability.
Now, instead of comparing the g
i
(
0
)s with
g
1
(
0
), we could have compared with g
2
(
0
), or
with g
3
(
0
) or ... or with g
7
(
0
) arriving all the
time to a similar conclusion that the probability
does not change. Since these 8 events (g
1
(
0
) is the
smallest, g
2
(
0
) is the smallest, etc.) are disjoint
and cover all possibilities and all of them have the
same probability, we nally draw the conclusion
that each and every event has exactly probability
1/8 to happen. It remains therefore proven that
the initial condition (A.1) is satised with proba-
bility 1/8 and this concludes the proof.
Appendix B. GORDONS CONSTRUCTION
OF THE INCIDENT MATRIX OF A GROUP
Given I = {1, . . . , N}, the incident matrix for
a group {I
i
} of subsets of I is a matrix whose
(i, j) element is 1 if j I
i
and zero otherwise. In
Gordon (1974), the following construction proce-
dure for an incident matrix
R is proposed where
I = {1, . . . , 2
l
1} and the group has 2
l
elements.
Let R(1) = [1], and recursively compute (k =
2, 3, . . . , l)
R(2
k
1) =
R(2
k1
1) R(2
k1
1) 0
R(2
k1
1) J R(2
k1
1) e
0
T
e
T
1
,
where J and e are, respectively, a matrix and a
vector of all ones, and 0 is a vector of all zeros.
Then, let
R =
_
R(2
l
1)
0
T
_
.
Gordon (1974) also gives construction of groups
when the number of data points is dierent from
2
l
1.
64