Low Rank Approximation Algorithms, Implementation, Applications
Low Rank Approximation Algorithms, Implementation, Applications
Low Rank
Approximation
Algorithms, Implementation,
Applications
Ivan Markovsky
School of Electronics & Computer Science
University of Southampton
Southampton, UK
[email protected]
Mathematical models are obtained from first principles (natural laws, interconnec-
tion, etc.) and experimental data. Modeling from first principles is common in
natural sciences, while modeling from data is common in engineering. In engineer-
ing, often experimental data are available and a simple approximate model is pre-
ferred to a complicated detailed one. Indeed, although optimal prediction and con-
trol of a complex (high-order, nonlinear, time-varying) system is currently difficult
to achieve, robust analysis and design methods, based on a simple (low-order, lin-
ear, time-invariant) approximate model, may achieve sufficiently high performance.
This book addresses the problem of data approximation by low-complexity models.
A unifying theme of the book is low rank approximation: a prototypical data
modeling problem. The rank of a matrix constructed from the data corresponds to
the complexity of a linear model that fits the data exactly. The data matrix being full
rank implies that there is no exact low complexity linear model for that data. In this
case, the aim is to find an approximate model. One approach for approximate mod-
eling, considered in the book, is to find small (in some specified sense) modification
of the data that renders the modified data exact. The exact model for the modified
data is an optimal (in the specified sense) approximate model for the original data.
The corresponding computational problem is low rank approximation. It allows the
user to trade off accuracy vs. complexity by varying the rank of the approximation.
The distance measure for the data modification is a user choice that specifies the
desired approximation criterion or reflects prior knowledge about the accuracy of the
data. In addition, the user may have prior knowledge about the system that generates
the data. Such knowledge can be incorporated in the modeling problem by imposing
constraints on the model. For example, if the model is known (or postulated) to be
a linear time-invariant dynamical system, the data matrix has Hankel structure and
the approximating matrix should have the same structure. This leads to a Hankel
structured low rank approximation problem.
A tenet of the book is: the estimation accuracy of the basic low rank approx-
imation method can be improved by exploiting prior knowledge, i.e., by adding
constraints that are known to hold for the data generating system. This path of de-
velopment leads to weighted, structured, and other constrained low rank approxi-
v
vi Preface
mation problems. The theory and algorithms of these new classes of problems are
interesting in their own right and being application driven are practically relevant.
Stochastic estimation and deterministic approximation are two complementary
aspects of data modeling. The former aims to find from noisy data, generated by a
low-complexity system, an estimate of that data generating system. The latter aims
to find from exact data, generated by a high complexity system, a low-complexity
approximation of the data generating system. In applications both the stochastic
estimation and deterministic approximation aspects are likely to be present. The
data are likely to be imprecise due to measurement errors and is likely to be gener-
ated by a complicated phenomenon that is not exactly representable by a model in
the considered model class. The development of data modeling methods in system
identification and signal processing, however, has been dominated by the stochas-
tic estimation point of view. If considered, the approximation error is represented
in the mainstream data modeling literature as a random process. This is not natural
because the approximation error is by definition deterministic and even if consid-
ered as a random process, it is not likely to satisfy standard stochastic regularity
conditions such as zero mean, stationarity, ergodicity, and Gaussianity.
An exception to the stochastic paradigm in data modeling is the behavioral ap-
proach, initiated by J.C. Willems in the mid-1980s. Although the behavioral ap-
proach is motivated by the deterministic approximation aspect of data modeling, it
does not exclude the stochastic estimation approach. In this book, we use the behav-
ioral approach as a language for defining different modeling problems and present-
ing their solutions. We emphasize the importance of deterministic approximation in
data modeling, however, we formulate and solve stochastic estimation problems as
low rank approximation problems.
Many well known concepts and problems from systems and control, signal pro-
cessing, and machine learning reduce to low rank approximation. Generic exam-
ples in system theory are model reduction and system identification. The principal
component analysis method in machine learning is equivalent to low rank approx-
imation, which suggests that related dimensionality reduction, classification, and
information retrieval problems can be phrased as low rank approximation problems.
Sylvester structured low rank approximation has applications in computations with
polynomials and is related to methods from computer algebra.
The developed ideas lead to algorithms, which are implemented in software.
The algorithms clarify the ideas and the software implementation clarifies the al-
gorithms. Indeed, the software is the ultimate unambiguous description of how the
ideas are put to work. In addition, the provided software allows the reader to re-
produce the examples in the book and to modify them. The exposition reflects the
sequence
theory → algorithms → implementation.
Correspondingly, the text is interwoven with code that generates the numerical ex-
amples being discussed.
Preface vii
A common feature of the current research activity in all areas of science and engi-
neering is the narrow specialization. In this book, we pick applications in the broad
area of data modeling, posing and solving them as low rank approximation prob-
lems. This unifies seemingly unrelated applications and solution techniques by em-
phasising their common aspects (e.g., complexity–accuracy trade-off) and abstract-
ing from the application specific details, terminology, and implementation details.
Despite of the fact that applications in systems and control, signal processing, ma-
chine learning, and computer vision are used as examples, the only real prerequisites
for following the presentation is knowledge of linear algebra.
The book is intended to be used for self study by researchers in the area of data
modeling and by advanced undergraduate/graduate level students as a complemen-
tary text for a course on system identification or machine learning. In either case,
the expected knowledge is undergraduate level linear algebra. In addition, M ATLAB
code is used, so that familiarity with M ATLAB programming language is required.
Passive reading of the book gives a broad perspective on the subject. Deeper un-
derstanding, however, requires active involvement, such as supplying missing justi-
fication of statements and specific examples of the general concepts, application and
modification of presented ideas, and solution of the provided exercises and practice
problems. There are two types of practice problem: analytical, asking for a proof
of a statement clarifying or expanding the material in the book, and computational,
asking for experiments with real or simulated data of specific applications. Most of
the problems are easy to medium difficulty. A few problems (marked with stars) can
be used as small research projects.
The code in the book, available from
http://extra.springer.com/
has been tested with M ATLAB 7.9, running under Linux, and uses the Optimization
Toolbox 4.3, Control System Toolbox 8.4, and Symbolic Math Toolbox 5.3. A ver-
sion of the code that is compatible with Octave (a free alternative to M ATLAB) is
also available from the book’s web page.
Acknowledgements
A number of individuals and the European Research Council contributed and sup-
ported me during the preparation of the book. Oliver Jackson—Springer’s edi-
tor (engineering)—encouraged me to embark on the project. My colleagues in
ESAT/SISTA, K.U. Leuven and ECS/ISIS, Southampton, UK created the right en-
vironment for developing the ideas in the book. In particular, I am in debt to Jan C.
Willems (SISTA) for his personal guidance and example of critical thinking. The
behavioral approach that Jan initiated in the early 1980’s is present in this book.
Maarten De Vos, Diana Sima, Konstantin Usevich, and Jan Willems proofread
chapters of the book and suggested improvements. I gratefully acknowledge funding
viii Preface
from the European Research Council under the European Union’s Seventh Frame-
work Programme (FP7/2007–2013)/ERC Grant agreement number 258581 “Struc-
tured low-rank approximation: Theory, algorithms, and applications”.
Southampton, UK Ivan Markovsky
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classical and Behavioral Paradigms for Data Modeling . . . . . . 1
1.2 Motivating Example for Low Rank Approximation . . . . . . . . . 3
1.3 Overview of Applications . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Overview of Algorithms . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Literate Programming . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix
x Contents
AX ≈ B, (LSE)
where the matrices A and B are constructed from the given data and the matrix X
parametrizes the model. In this classical paradigm, the main tools are the ordinary
linear least squares method and its variations—regularized least squares, total least
squares, robust least squares, etc. The least squares method and its variations are
mainly motivated by their applications for data fitting, but they invariably consider
solving approximately an overdetermined system of equations.
The underlying premise in the classical paradigm is that existence of an exact lin-
ear model for the data is equivalent to existence of solution X to a system AX = B.
Such a model is a linear map: the variables corresponding to the A matrix are inputs
(or causes) and the variables corresponding to the B matrix are outputs (or conse-
quences) in the sense that they are determined by the inputs and the model. Note
that in the classical paradigm the input/output partition of the variables is postulated
a priori. Unless the model is required to have the a priori specified input/output
partition, imposing such structure in advance is ad hoc and leads to undesirable
theoretical and numerical features of the modeling methods derived.
An alternative to the classical paradigm that does not impose an a priori fixed
input/output partition is the behavioral paradigm. In the behavioral paradigm, fitting
linear models to data is equivalent to the problem of approximating a matrix D,
constructed from the data, by a matrix D of lower rank. Indeed, existence of an
exact linear model for D is equivalent to D being rank deficient. Moreover, the rank
of D is related to the complexity of the model. This fact is the tenet of the book and
is revisited in the following chapters in the context of applications from systems
and control, signal processing, computer algebra, and machine learning. Also its
implication to the development of numerical algorithms for data fitting is explored.
To see that existence of a low-complexity exact linear model is equivalent to rank
deficiency of the data matrix, let the columns d1 , . . . , dN of D be the observations
and the elements d1j , . . . , dqj of dj be the observed variables. We assume that there
are at least as many observations as observed variables, i.e., q ≤ N . A linear model
for D declares that there are linear relations among the variables, i.e., there are
vectors rk , such that
rk dj = 0, for j = 1, . . . , N.
If there are p independent linear relations, then D has rank less than or equal to
m := q − p and the observations belong to at most m-dimensional subspace B of Rq .
We identify the model for D, defined by the linear relations r1 , . . . , rp ∈ Rq , with
the set B ⊂ Rq . Once a model B is obtained from the data, all possible input/output
partitions can be enumerated, which is an analysis problem for the identified model.
Therefore, the choice of an input/output partition in the behavioral paradigm to data
modeling can be incorporated, if desired, in the modeling problem and thus need
not be hypothesized as necessarily done in the classical paradigm.
The classical and behavioral paradigms for data modeling are related but not
equivalent. Although existence of solution of the system AX = B implies that the
matrix [A B] is low rank, it is not true that [A B] having a sufficiently low rank im-
plies that the system AX = B is solvable. This lack of equivalence causes ill-posed
(or numerically ill-conditioned) data fitting problems in the classical paradigm,
which have no solution (or are numerically difficult to solve). In terms of the data
fitting problem, ill-conditioning of the problem (LSE) means that the a priori fixed
input/output partition of the variables is not corroborated by the data. In the be-
havioral setting without the a priori fixed input/output partition of the variables, ill-
conditioning of the data matrix D implies that the data approximately satisfy linear
relations, so that nearly rank deficiency is a good feature of the data.
The classical paradigm is included in the behavioral paradigm as a special case
because approximate solution of an overdetermined system of equations (LSE) is
a possible approach to achieve low rank approximation. Alternatively, low rank
approximation can be achieved by approximating the data matrix with a matrix
that has at least p-dimensional null space, or at most m-dimensional column space.
Parametrizing the null space and the column space by sets of basis vectors, the al-
ternative approaches are:
1. kernel representation there is a full row rank matrix R ∈ Rp×q , such that
RD = 0,
2. image representation there are matrices P ∈ Rq×m and L ∈ Rm×N , such that
D = P L.
1.2 Motivating Example for Low Rank Approximation 3
The approaches using kernel and image representations are equivalent to the origi-
nal low rank approximation problem. Next, the use of AX = B, kernel, and image
representations is illustrated on the most simple data fitting problem—line fitting.
(“:=” stands for “by definition”, see page 245 for a list of notation) and solve ap-
proximately the overdetermined system
by the least squares method. Let xls be the least squares solution to (lse). Then the
least squares fitting line is
Bls := d = col(a, b) ∈ R2 | axls = b .
Geometrically, Bls minimizes the sum of the squared vertical distances from the
data points to the fitting line.
The left plot in Fig. 1.1 shows a particular example with N = 10 data points. The
data points d1 , . . . , d10 are the circles in the figure, the fit Bls is the solid line, and
the fitting errors e := axls − b are the dashed lines. Visually one expects the best fit
to be the vertical axis, so minimizing vertical distances does not seem appropriate
in this example.
Note that by solving (lse), a (the first components of the d) is treated differently
from b (the second components): b is assumed to be a function of a. This is an
arbitrary choice; the data can be fitted also by solving approximately the system
in which case a is assumed to be a function of b. Let xls be the least squares solution
to (lse ). It gives the fitting line
Bls := d = col(a, b) ∈ R2 | a = bxls ,
which minimizes the sum of the squared horizontal distances (see the right plot in
Fig. 1.1). The line Bls happens to achieve the desired fit in the example.
4 1 Introduction
In the classical approach for data fitting, i.e., solving approximately a linear
system of equations in the least squares sense, the choice of the model repre-
sentation affects the fitting criterion.
aj x =
subject to bj , for j = 1, . . . , N.
(tls)
However, for the data in Fig. 1.1 the total least squares problem has no solution.
Informally, the approximate solution is xtls = ∞, which corresponds to a fit by a
vertical line. Formally,
the total least squares problem (tls) may have no solution and therefore fail to
give a model.
1.2 Motivating Example for Low Rank Approximation 5
The use of (lse) in the definition of the total least squares line fitting problem
restricts the fitting line to be a graph of a function ax = b for some x ∈ R. Thus,
the vertical line is a priori excluded as a possible solution. In the example, the line
minimizing the sum of the squared orthogonal distances happens to be the vertical
line. For this reason, xtls does not exist.
Any line B passing through the origin can be represented as an image and a
kernel, i.e., there exist matrices P ∈ R2×1 and R ∈ R1×2 , such that
B = (P ) := d = P ∈ R2 | ∈ R
and
B = ker(R) := d ∈ R2 | Rd = 0 .
Using the image representation of the model, the line fitting problem of minimizing
the sum of the squared orthogonal distances is
N
minimize over P ∈ R2×1 and 1 · · · N ∈ R1×N dj − dj 2
2 (lraP )
j =1
subject to dj = P j , for j = 1, . . . , N.
With
D := d1 · · · dN , := d1 · · · dN ,
D
and · Fthe Frobenius norm,
E F := vec(E)2 = e11 · · · eq1 · · · e1N · · · eqN 2 , for all E ∈ Rq×N
Similarly, using a kernel representation, the line fitting problem, minimizing the sum
of squares of the orthogonal distances is
minimize over R ∈ R1×2 , R = 0, and D ∈ R2×N D − D 2
F (lraR )
subject to R D = 0.
Contrary to the total least squares problem (tls), problems (lraP ) and (lraR ) always
have (nonunique) solutions. In the example, solutions are, e.g., P ∗ = col(0, 1) and
R ∗ = [1 0], which describe the vertical line
B ∗ := (P ∗ ) = ker(R ∗ ).
The constraints
= P L,
D with P ∈ R2×1 , L ∈ R1×N = 0,
and R D with R ∈ R1×2 , R = 0
6 1 Introduction
Thus, (lraP ) and (lraR ) are instances of one and the same
abstract problem: approximate the data matrix D by a low rank matrix D.
In Chap. 2, the observations made in the line fitting example are generalized to
modeling of q-dimensional data. The underlying goal is:
equations becomes a low rank approximation problem, where the nongeneric cases
(and the related issues of ill-conditioning and need of regularization) are avoided.
Behind every data modeling problems there is a (hidden) low rank approxima-
tion problem: the model imposes relations on the data which render a matrix
constructed from exact data rank deficient.
Although an exact data matrix is low rank, a matrix constructed from observed
data is generically full rank due to measurement noise, unaccounted effects, and as-
sumptions about the data generating system that are not satisfied in practice. There-
fore, generically, the observed data do not have an exact low-complexity model.
This leads to the problem of approximate modeling, which can be formulated as a
low rank approximation problem as follows. Modify the data as little as possible,
so that the matrix constructed from the modified data has a specified low rank. The
modified data matrix being low rank implies that there is an exact model for the
modified data. This model is by definition an approximate model for the given data.
The transition from exact to approximate modeling is an important step in building
a coherent theory for data modeling and is emphasized in this book.
In all applications, the exact modeling problem is discussed before the practi-
cally more important approximate modeling problem. This is done because (1) ex-
act modeling is simpler than approximate modeling, so that it is the right starting
place, and (2) exact modeling is a part of optimal approximate modeling and sug-
gests ways of solving such problems suboptimally. Indeed, small modifications of
exact modeling algorithms lead to effective approximate modeling algorithms. Well
known examples of the transition from exact to approximate modeling in systems
theory are the progressions from realization theory to model reduction and from
deterministic subspace identification to approximate and stochastic subspace iden-
tification.
8 1 Introduction
Fig. 1.2 Transitions among exact deterministic, approximate deterministic, exact stochastic, and
approximate stochastic modeling problems. The arrows show progression from simple to complex
The applications can be read in any order or skipped without loss of continu-
ity.
When there is no exact finite dimensional realization of the data or the exact
realization is of high order, one may want to find an approximate realization of a
specified low order n. These, respectively, approximate realization and model re-
duction problems naturally lead to Hankel structured low rank approximation.
The deterministic system realization and model reduction problems are further
considered in Sects. 2.2, 3.1, and 4.2.
Let y be the output of an nth order linear time-invariant system, driven by white
noise (a stochastic system) and let E be the expectation operator. The sequence
R = R(0), R(1), . . . , R(t), . . .
defined by
R(τ ) := E y(t)y (t − τ )
is called the autocorrelation sequence of y. Stochastic realization theory is con-
cerned with the problem of finding a state representation of a stochastic system
that could have generated the observed output y, i.e., a linear time-invariant system
driven by white noise, whose output correlation sequence is equal to R.
An important result in stochastic realization theory is that R is the output cor-
relation sequence of an nth order stochastic system if and only if the Hankel ma-
trix H (R) constructed from R has rank n, i.e.,
rank H (R) = order of a minimal stochastic realization of R.
System Identification
Realization theory considers a system representation problem: pass from one repre-
sentation of a system to another. Alternatively, it can be viewed as a special exact
identification problem: find from impulse response data (a special trajectory of the
10 1 Introduction
system) a state space representation of the data generating system. The exact identi-
fication problem (also called deterministic identification problem) is to find from a
general response of a system, a representation of that system. Let
w = col(u, y), where u = u(1), . . . , u(T ) and y = y(1), . . . , y(T )
with nmax + 1 block rows, constructed from the trajectory w, is rank deficient:
rank Hnmax +1 (w) ≤ rank Hnmax +1 (u) + order of the system. (SYSID)
Conversely, if the Hankel matrix Hnmax +1 (w) has rank (nmax + 1)m + n and the
matrix H2nmax +1 (u) is full row rank (persistency of excitation of u), then w is a
trajectory of a controllable linear time-invariant system of order n. Under the above
assumptions, the data generating system can be identified from a rank revealing
factorization of the matrix Hnmax +1 (w).
When there are measurement errors or the data generating system is not a low-
complexity linear time-invariant system, the data matrix Hnmax +1 (w) is generi-
cally full rank. In such cases, an approximate low-complexity linear time-invariant
model for w can be derived by finding a Hankel structured low rank approximation
of Hnmax +1 (w). Therefore, the Hankel structured low rank approximation problem
can be applied also for approximate system identification. Linear time-invariant sys-
tem identification is a main topic of the book and appears frequently in the following
chapters.
Similarly, to the analogy between deterministic and stochastic system realization,
there is an analogy between deterministic and stochastic system identification. The
latter analogy suggests an application of Hankel structured low rank approximation
to stochastic system identification.
p = rc and q = sc.
(By convention, in this book, all missing entries in a matrix are assumed to be zeros.)
A well known fact in algebra is that the degree of the greatest common divisor of p
and q is equal to the rank deficiency (corank) of R(p, q), i.e.,
degree(c) = n + m − rank R(p, q) . (GCD)
Suppose that p and q have a greatest common divisor of degree d > 0, but the
coefficients of the polynomials p and q are imprecise, resulting in perturbed polyno-
mials pd and qd . Generically, the matrix R(pd , qd ), constructed from the perturbed
polynomials, is full rank, implying that the greatest common divisor of pd and qd
has degree zero. The problem of finding an approximate common divisor of pd
and qd with degree d, can be formulated as follows. Modify the coefficients of pd
and
and qd , as little as possible, so that the resulting polynomials, say, p q have a
greatest common divisor of degree d. This problem is a Sylvester structured low
rank approximation problem. Therefore, Sylvester structured low rank approxima-
tion can be applied for computing an approximate common divisor with a specified
degree. The approximate greatest common divisor c for the perturbed polynomials
pd and qd is the exact greatest common divisor of p and
q.
The approximate greatest common divisor problem is considered in Sect. 3.2.
An array of antennas or sensors is used for direction of arrival estimation and adap-
tive beamforming. Consider q antennas in a fixed configuration and a wave propa-
gating from distant sources, see Fig. 1.3.
12 1 Introduction
Consider, first, the case of a single source. The source intensity 1 (the signal) is
a function of time. Let w(t) ∈ Rq be the response of the array at time t (wi being the
response of the ith antenna). Assuming that the source is far from the array (relative
to the array’s length), the array’s response is proportional to the source intensity
w(t) = p1 1 (t − τ1 ),
where τ1 is the time needed for the wave to travel from the source to the array
and p1 ∈ Rq is the array’s response to the source emitting at a unit intensity. The
vector p1 depends only on the array geometry and the source location and is there-
fore constant in time. Measurements of the antenna at time instants t = 1, . . . , T
give a data matrix
D := w(1) · · · w(T ) = p1 1 (1 − τ ) · · · 1 (T − τ ) = p1 1 ,
1
m
D = w(1) · · · w(T ) = pk k (1 − τk ) · · · k (T − τk ) = P L,
k=1
k
Applications in Chemometrics
Multivariate Calibration
m
D := d1 · · · dN = pk (1) (N )
k · · · k
= P L. (RRF)
k=1
k
Therefore, the rank of D is less than or equal to the number of components m. As-
suming that q > m, N > m, the spectral responses p1 , . . . , pm of the components are
linearly independent, and the concentration vectors 1 , . . . , m are linearly indepen-
dent, we have
Applications in Psychometrics
Factor Analysis
The psychometric data are test scores and biometrics of a group of people. The test
scores can be organized in a data matrix D, whose rows correspond to the scores
and the columns correspond to the group members. Factor analysis is a popular
method that explains the data as a linear combination of a small number of abilities
of the group members. These abilities are called factors and the weights by which
they are combined in order to reproduce the data are called loadings. Factor analysis
is based on the assumption that the exact data matrix is low rank with rank being
equal to the number of factors. Indeed, the factor model can be written as D = P L,
where the columns of P correspond to the factors and the rows of L correspond to
the loadings. In practice, the data matrix is full rank because the factor model is an
14 1 Introduction
idealization of the way test data are generated. Despite the fact that the factor model
is a simplification of the reality, it can be used as an approximation of the way
humans perform on tests. Low rank approximation then is a method for deriving
optimal in a specified sense approximate psychometric factor models.
The factor model, explained above, is used to assess candidates at the US uni-
versities. An important element of the acceptance decision in US universities for
undergraduate study is the Scholastic Aptitude Test, and for postgraduate study, the
Graduate Record Examination. These tests report three independent scores: writ-
ing, mathematics, and critical reading for the Scholastic Aptitude Test; and verbal,
quantitative, and analytical for the Graduate Record Examination. The three scores
assess what are believed to be the three major factors for, respectively, undergradu-
ate and postgraduate academic performance. In other words, the premise on which
the tests are based is that the ability of a prospective student to do undergraduate
and postgraduate study is predicted well by a combination of the three factors. Of
course, in different areas of study, the weights by which the factors are taken into
consideration are different. Even in pure subjects, such as mathematics, however,
the verbal as well as quantitative and analytical ability play a role.
Many graduate-school advisors have noted that an applicant for a mathematics fellowship
with a high score on the verbal part of the Graduate Record Examination is a better bet as a
Ph.D. candidate than one who did well on the quantitative part but badly on the verbal.
Halmos (1985, p. 5)
(j )
Let dj be the vector of term frequencies, related to the j th document and let k
be the relevance of the kth concept to the j th document. Then, according to the latent
semantic analysis model, the term–document frequencies for the N documents form
a data matrix, satisfying (RRF). Therefore, the rank of the data matrix is less than
or equal to the number of concepts m. Assuming that m is smaller than the number
of terms q, m is smaller than the number of documents N , the term frequencies
p1 , . . . , pm are linearly independent, and the relevance of concepts 1 , . . . , m are
linearly independent, we have
Recommender System
The main issue underlying the abstract low rank approximation problem and the
applications reviewed up to now is data approximation. In the recommender system
problem, the main issue is the one of missing data: given ratings of some items by
some users, infer the missing ratings. Unique recovery of the missing data is impos-
sible without additional assumptions. The underlying assumption in many recom-
mender system problems is that the complete matrix of the ratings is of low rank.
Consider q items and N users and let dij be the rating of the ith item by the j th
user. As in the psychometrics example, it is assumed that there is a “small” num-
ber m of “typical” (or characteristic, or factor) users, such that all user ratings can
16 1 Introduction
be obtained as linear combinations of the ratings of the typical users. This implies
that the complete matrix D = [dij ] of the ratings has rank m, i.e.,
Then exploiting the prior knowledge that the number of “typical” users is small,
the missing data recovery problem can be posed as the following matrix completion
problem
minimize over D rank D
(MC)
subject to Dij = Dij for all (i, j ), where Dij is given.
This gives a procedure for solving the exact modeling problem (the given elements
of D are assumed to be exact). The corresponding solution method can be viewed as
the equivalent of the rank revealing factorization problem in exact modeling prob-
lems, for the case of complete data.
Of course, the rank minimization problem (MC) is much harder to solve than the
rank revealing factorization problem. Moreover, theoretical justification and addi-
tional assumptions (about the number and distribution of the given elements of D)
are needed for a solution D of (MC) to be unique and to coincide with the com-
plete true matrix D. It turns out, however, that under certain specified assumptions
exact recovery is possible by solving the convex optimization problem obtained by
in (MC) with the nuclear norm
replacing rank(D)
D
:= sum of the singular values of D.
∗
The importance of the result is that under the specified assumptions the hard prob-
lem (MC) can be solved efficiently and reliably by convex optimization methods.
In real-life application of recommender systems, however, the additional problem
ij = Dij of (MC) has to
of data approximation occurs. In this case the constraint D
be relaxed, e.g., replacing it by
ij = Dij + ΔDij ,
D
where ΔDij are corrections, accounting for the data uncertainty. The corrections are
additional optimization variables. Taking into account the prior knowledge that the
corrections are small, a term λ ΔD F is added in the cost function. The resulting
matrix approximation problem is
minimize over D and ΔD rank D + λ ΔD F
(AMC)
subject to Dij = Dij + ΔDij for all (i, j ), where Dij is given.
In a stochastic setting the term λ ΔD F corresponds to the assumption that the true
data D is perturbed with noise that is zero mean, Gaussian, independent, and with
equal variance.
Again the problem can be relaxed to a convex optimization problem by replac-
ing rank with nuclear norm. The choice of the λ parameter reflects the trade-off
1.3 Overview of Applications 17
between complexity (number of identified “typical” users) and accuracy (size of the
correction ΔD) and depends in the stochastic setting on the noise variance.
Nuclear norm and low rank approximation methods for estimation of missing
values are developed in Sects. 3.3 and 5.1.
Multidimensional Scaling
X := {x1 , . . . , xN } ⊂ R2
dij := xi − xj 2
2.
The N × N matrix D = [dij ] of the pair-wise distances, called in what follows the
distance matrix (for the set of points X ), has rank at most 4. Indeed,
so that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 x1 x x1
⎢ .. ⎥
⎢ .. ⎥ ⎢ 1. ⎥
D = ⎣ . ⎦ x1 x1 · · · xN xN −2 ⎣ . ⎦ x1 · · · xN + ⎣ .. ⎦ 1 · · · 1 . (∗)
1
xN x
xN N
rank ≤1 rank ≤2 rank ≤1
The localization problem from pair-wise distances is: given the distance matrix D,
find the locations {x1 , . . . , xN } of the points up to a rigid transformation, i.e., up
to translation, rotation, and reflection of the points. Note that rigid transformations
preserve the pair-wise distances, so that the distance matrix D alone is not sufficient
to locate the points uniquely.
With exact data, the problem can be posed and solved as a rank revealing factor-
ization problem (∗). With noisy measurements, however, the matrix D is generically
full rank. In this case, the relative (up to rigid transformation) point locations can be
estimated by approximating D by a rank-4 matrix D. In order to be a valid distance
matrix, however, D must have the structure
⎡ ⎤ ⎡ ⎤
1 x1
x1
⎢ ⎥
= ⎣.⎦
. X
+⎣ . ⎥
⎢ .
D . x1 x1 · · · xN − 2X
xN . ⎦ 1 ··· 1 , (MDS)
1
xN xN
= [
for some X x1 · · ·
xN ], i.e., the estimation problem is a bilinearly structured low
rank approximation problem.
18 1 Introduction
In the applications reviewed so far, the low rank approximation problem was applied
to linear data modeling. Nonlinear data modeling, however, can also be formulated
as a low rank approximation problem. The key step—“linearizing” the problem—
involves preprocessing the data by a nonlinear function defining the model struc-
ture. In the machine learning literature, where nonlinear data modeling is a common
practice, the nonlinear function is called the feature map and the resulting modeling
methods are referred to as kernel methods.
As a specific example, consider the problem of fitting data by a conic section,
i.e., given a set of points in the plane
Fig. 1.4 Conic section fitting. Left: N = 4 points (circles) have nonunique fit (two fits are shown
in the figure with solid and dashed lines). Right: N = 5 different points have a unique fit (solid
line)
we have
d ∈ B(θ ) = B(A, b, c) ⇐⇒ θ dext = 0.
(In the machine learning terminology, the map d → dext , defined by (dext ), is the
feature map for the conic section model.) Consequently, all data points d1 , . . . , dN
are fitted by the model if
θ dext,1 · · · dext,N = 0 ⇐⇒ rank(Dext ) ≤ 5. (CSF)
Dext
Therefore, for N > 5 data points, exact fitting is equivalent to rank deficiency of
the extended data matrix Dext . For N < 5 data points, there is nonunique exact fit
independently of the data. For N = 5 different points, the exact fitting conic section
is unique, see Fig. 1.4.
20 1 Introduction
With N > 5 noisy data points, the extended data matrix Dext is generically full
rank, so that an exact fit does not exists. A problem called geometric fitting is to
minimize the sum of squared distances from the data points to the conic section.
The problem is equivalent to quadratically structured low rank approximation.
Generalization of the conic section fitting problem to algebraic curve fitting and
solution methods for the latter are presented in Chap. 6.
Exercise 1.1 Find and plot another conic section that fits the points
0.2 0.8 0.2 0.8
d1 = , d2 = , d3 = , and d4 = ,
0.2 0.2 0.8 0.8
Exercise 1.2 Find the parameters (A, b, 1) in the representation (B(A, b, c)) of the
ellipse in Fig. 1.4, right. The data points are:
0.2 0.8 0.2 0.8 0.9
d1 = , d2 = , d3 = , d4 = , and d5 = .
0.2 0.2 0.8 0.8 0.5
−3.5156
Answer: A = 3.5156 0
0 2.7344
, b = −2.7344
A scene is captured by two cameras at fixed locations (stereo vision) and N match-
ing pairs of points
are located in the resulting images. Corresponding points u and v of the two images
satisfy what is called an epipolar constraint
u
v 1 F = 0, for some F ∈ R3×3 , with rank(F ) = 2. (EPI)
1
vec (F )dext = 0.
1.4 Overview of Algorithms 21
Note that, as in the application for conic section fitting, the original data (u, v) are
mapped to an extended data vector dext via a nonlinear function (a feature map). In
this case, however, the function is bilinear.
Taking into account the epipolar constraints for all data points, we obtain the
matrix equation
vec (F )Dext = 0, where Dext := dext,1 · · · dext,N . (FME)
Summary of Applications
Table 1.1 summarized the reviewed applications: given data, data matrix constructed
from the original data, structure of the data matrix, and meaning of the rank of the
data matrix in the context of the application. More applications are mentioned in the
notes and references section in the end of the chapter.
Table 1.1 Meaning of the rank of the data matrix in the applications
Application Data Data matrix Structure Rank = Ref.
On the practical level, the low rank problem formulation allows one to translate
the abstract data modeling problem to different concrete parametrized problems by
choosing the model representation. Different representations naturally lend them-
selves to different analytical and numerical methods. For example, a controllable
linear time-invariant system can be represented by a transfer function, state space,
convolution, etc. representations. The analysis tools related to these representations
are rather different and, consequently, the obtained solutions differ despite of the
fact that they solve the same abstract problem. Moreover, the parameter optimiza-
tion problems, resulting from different model representations, lead to algorithms
and numerical implementations, whose robustness properties and computational ef-
ficiency differ. Although, often in practice, there is no universally “best” algorithm
or software implementation, having a wider set of available options is an advantage.
Independent of the choice of the rank representation only a few special low rank
approximation problems have analytic solutions. So far, the most important spe-
cial case with an analytic solution is the unstructured low rank approximation in
1.4 Overview of Algorithms 23
the Frobenius norm. The solution in this case can be obtained from the singular
value decomposition of the data matrix (Eckart–Young–Mirsky theorem). Exten-
sions of this basic solution are problems known as generalized and restricted low
rank approximation, where some columns or, more generally submatrices of the ap-
proximation, are constrained to be equal to given matrices. The solutions to these
problems are given by, respectively, the generalized and restricted singular value de-
compositions. Another example of low rank approximation problem with analytic
solution is the circulant structured low rank approximation, where the solution is
expressed in terms of the discrete Fourier transform of the data.
In general, low rank approximation problems are NP-hard. There are three fun-
damentally different solution approaches for the general low rank approximation
problem:
• heuristic methods based on convex relaxations,
• local optimization methods, and
• global optimization methods.
From the class of heuristic methods the most popular ones are the subspace meth-
ods. The approach used in the subspace type methods is to relax the difficult low
rank approximation problem to a problem with an analytic solution in terms of the
singular value decomposition, e.g., ignore the structure constraint of a structured
low rank approximation problem. The subspace methods are found to be very ef-
fective in model reduction, system identification, and signal processing. The class
of the subspace system identification methods is based on the unstructured low rank
approximation in the Frobenius norm (i.e., singular value decomposition) while the
original problems are Hankel structured low rank approximation.
The methods based on local optimization split into two main categories:
• alternating projections and
• variable projections
type algorithms. Both alternating projections and variable projections exploit the
bilinear structure of the low rank approximation problems.
In order to explain the ideas underlining the alternating projections and variable
projections methods, consider the optimization problem
The ideas presented in the book are best expressed as algorithms for solving
data modeling problems. The algorithms, in turn, are practically useful when imple-
mented in ready-to-use software. The gap between the theoretical discussion of data
modeling methods and the practical implementation of these methods is bridged by
using a literate programming style. The software implementation (M ATLAB code)
is interwoven in the text, so that the full implementation details are available in a
human readable format and they come in the appropriate context of the presentation.
A literate program is composed of interleaved code segments, called chunks,
and text. The program can be split into chunks in any way and the chunks can be
presented in any order, deemed helpful for the understanding of the program. This
allows us to focus on the logical structure of the program rather than the way a
computer executes it. The actual computer executable code is tangled from a web
of the code chunks by skipping the text and putting the chunks in the right order. In
addition, literate programming allows us to use a powerful typesetting system such
as LATEX (rather than plain text) for the documentation of the code.
The noweb system for literate programming is used. Its main advantage over
alternative systems is independence of the programming language being used.
Next, some typographic conventions are explained. The code is typeset in small
true type font and consists of a number of code chunks. The code chunks begin
with tags enclosed in angle brackets (e.g., code tag) and are sequentially numbered
by the page number and a letter identifying them on the page. Thus the chunk’s
1.5 Literate Programming 25
identification number (given to the left of the tag) is also used to locate the chunk in
the text. For example,
25a Print a figure 25a≡
function print_fig(file_name)
xlabel(’x’), ylabel(’y’), title(’t’),
set(gca, ’fontsize’, 25)
eval(sprintf(’print -depsc %s.eps’, file_name));
Defines:
print_fig, used in chunks 101b, 114b, 126b, 128, 163c, 192f, 213, and 222b.
(a function exporting the current figure to an encapsulated postscript file with a
specified name, using default labels and font size) has identification number 25a,
locating the code as being on p. 25.
If a chunk is included in, is a continuation of, or is continued by other chunk(s),
its definition has references to the related chunk(s). The sintax convention for doing
this is best explained by an example.
where
w = w(1), . . . , w(T ) , with w(t) ∈ Rq×N .
The definition of the function, showing its input and output arguments, is
25b Hankel matrix constructor 25b≡ 26a
function H = blkhank(w, i, j)
Defines:
blkhank, used in chunks 76a, 79, 88c, 109a, 114a, 116a, 117b, 120b, 125b, 127b, and 211.
(The reference to the right of the identification tag shows that the definition is contin-
ued in chunk number 26a.) The third input argument of blkhank—the number of
block columns j is optional. Its default value is maximal number of block columns
j = T − i + 1.
(The reference to the right of the identification tag now shows that this chunk is
included in other chunks.)
Two cases are distinguished, depending on whether w is a vector (N = 1) or
matrix (N > 1) valued trajectory.
26a Hankel matrix constructor 25b+≡ 25b
if length(size(w)) == 3
matrix valued trajectory w 26b
else
vector valued trajectory w 26e
end
(The reference to the right of the identification tag shows that this chunk is a contin-
uation of chunk 25b and is not followed by other chunks. Its body includes chunks
26b and 26e.)
• If w is a matrix valued trajectory, the input argument w should be a 3 dimensional
tensor, constructed as follows:
Exercise 1.4 Write a literate program for constructing the Sylvester matrix R(p, q),
defined in (R) on p. 11. Use your program to find the degree d of the greatest com-
mon divisor of the polynomials
(Answer: d = 1)
1.6 Notes
Methods for solving overdetermined systems of linear equations (i.e., data modeling
methods using the classical input/output representation paradigm) are reviewed in
Appendix A. The behavioral paradigm for data modeling was put forward by Jan C.
Willems in the early 1980s. It became firmly established with the publication of the
three part paper Willems (1986a, 1986b, 1987). Other landmark publications on the
behavioral paradigm are Willems (1989, 1991, 2007), and the book Polderman and
Willems (1998).
The numerical linear algebra problem of low rank approximation is a computa-
tional tool for data modeling, which fits the behavioral paradigm as “a hand fits a
glove”. Historically the low rank approximation problem is closely related to the
singular value decomposition, which is a method for computing low rank approx-
imations and is a main tool in many algorithms for data modeling. A historical
account of the development of the singular value decomposition is given in Stew-
art (1993). The Eckart–Young–Mirsky matrix low rank approximation theorem is
proven in Eckart and Young (1936).
28 1 Introduction
Applications
For details about the realization and system identification problems, see Sects. 2.2,
3.1, and 4.3. Direction of arrival and adaptive beamforming problems are discussed
in Kumaresan and Tufts (1983), Krim and Viberg (1996). Low rank approxima-
tion methods (alternating least squares) for estimation of mixture concentrations
in chemometrics are proposed in Wentzell et al. (1997). An early reference on
the approximate greatest common divisor problem is Karmarkar and Lakshman
(1998). Efficient optimization based methods for approximate greatest common di-
visor computation are discussed in Sect. 3.2. Other computer algebra problems that
reduce to structured low rank approximation are discussed in Botting (2004).
Many problems for information retrieval in machine learning, see, e.g., Shawe-
Taylor and Cristianini (2004), Bishop (2006), Fierro and Jiang (2005), are low rank
approximation problems and the corresponding solution techniques developed in
the machine learning community are methods for solving low rank approximation
problems. For example, clustering problems have been related to low rank approxi-
mation problems in Ding and He (2004), Kiers (2002), Vichia and Saporta (2009).
Machine learning problems, however, are often posed in a stochastic estimation set-
ting which obscures their deterministic approximation interpretation. For example,
principal component analysis (Jolliffe 2002; Jackson 2003) and unstructured low
rank approximation with Frobenius norm are equivalent optimization problems. The
principal component analysis problem, however, is motivated in a statistical setting
and for this reason may be considered as a different problem. In fact, principal com-
ponent analysis provides another (statistical) interpretation of the low rank approx-
imation problem.
The conic section fitting problem has extensive literature, see Chap. 6 and the
tutorial paper (Zhang 1997). The kernel principal component analysis method is
developed in the machine learning and statistics literature (Schölkopf et al. 1999).
Despite of the close relation between kernel principal component analysis and conic
section fitting, the corresponding literature are disjoint.
Closely related to the estimation of the fundamental matrix problem in two-view
computer vision is the shape from motion problem (Tomasi and Kanade 1993; Ma
et al. 2004).
Matrix factorization techniques have been used in the analysis of microarray data
in Alter and Golub (2006) and Kim and Park (2007). Alter and Golub (2006) pro-
pose a principal component projection to visualize high dimensional gene expres-
sion data and show that some known biological aspects of the data are visible in a
two dimensional subspace defined by the first two principal components.
Distance Problems
The low rank approximation problem aims at finding the “smallest” correction of a
given matrix that makes the corrected matrix rank deficient. This is a special case of
1.6 Notes 29
a distance problems: find the “nearest” matrix with a specified property to a given
matrix. For an overview of distance problems, see Higham (1989). In Byers (1988),
an algorithm for computing the distance of a stable matrix (Hurwitz matrix in the
case of continuous-time and Schur matrix in the case of discrete-time linear time-
invariant dynamical system) to the set of unstable matrices is presented. Stability
radius for structured perturbations and its relation to the algebraic Riccati equation
is presented in Hinrichsen and Pritchard (1986).
Related to the topic of distance problems is the grand idea that the whole linear alge-
bra (solution of systems of equations, matrix factorization, etc.) can be generalized
to uncertain data. The uncertainty is described as structured perturbation on the data
and a solution of the problem is obtained by correcting the data with a correction of
the smallest size that renders the problem solvable for the corrected data. Two of the
first references on the topic of structured linear algebra are El Ghaoui and Lebret
(1997), Chandrasekaran et al. (1998).
Structured Pseudospectra
Let λ(A) be the set of eigenvalues of A ∈ Cn×n and M be a set of structured matri-
ces
M := S (p) | p ∈ Rnp ,
with a given structure specification S . The ε-structured pseudospectrum (Graillat
2006; Trefethen and Embree 1999) of A is defined as the set
λε (A) := z ∈ C | z ∈ λ A ,A ∈ M , and A − A ≤ ε .
2
Using the structured pseudospectra, one can determine the structured distance of A
to singularity as the minimum of the following optimization problem:
minimize over A A − A subject to A is singular and A
∈ M .
2
This is a special structured low rank approximation problem for squared data matrix
and rank reduction by one. Related to structured pseudospectra is the structured
condition number problem for a system of linear equations, see Rump (2003).
Statistical Properties
Related to low rank approximation are the orthogonal regression (Gander et al.
1994), errors-in-variables (Gleser 1981), and measurement errors methods in the
30 1 Introduction
statistical literature (Carroll et al. 1995; Cheng and Van Ness 1999). Classic pa-
pers on the univariate errors-in-variables problem are Adcock (1877, 1878), Pear-
son (1901), Koopmans (1937), Madansky (1959), York (1966). Closely related to
the errors-in-variables framework for low rank approximation is the probabilistic
principal component analysis framework of Tipping and Bishop (1999).
Reproducible Research
The reproducible research concept is at the core of all sciences. In applied fields
such as data modeling, however, algorithms’ implementation, availability of data,
and reproducibility of the results obtained by the algorithms on data are often ne-
glected. This leads to a situation, described in Buckheit and Donoho (1995) as a
scandal. See also Kovacevic (2007).
A quick and easy way of making computational results obtained in M ATLAB
reproducible is to use the function publish. Better still, the code and the obtained
results can be presented in a literate programming style.
Literate Programming
The creation of the literate programming is a byproduct of the TEX project, see
Knuth (1992, 1984). The original system, called web is used for documentation of
the TEX program (Knuth 1986) and is for the Pascal language. Later a version cweb
for the C language was developed. The web and cweb systems are followed by
many other systems for literate programming that target specific languages. Unfor-
tunately this leads to numerous literate programming dialects.
The noweb system for literate programming, created by N. Ramsey in the mid
90’s, is not bound to any specific programming language and text processing sys-
tem. A tutorial introduction is given in Ramsey (1994). The noweb syntax is also
adopted in the babel part of Emacs org-mode (Dominik 2010)—a package for keep-
ing structured notes that includes support for organization and automatic evaluation
of computer code.
References
Adcock R (1877) Note on the method of least squares. The Analyst 4:183–184
Adcock R (1878) A problem in least squares. The Analyst 5:53–54
References 31
Alter O, Golub GH (2006) Singular value decomposition of genome-scale mRNA lengths distri-
bution reveals asymmetry in RNA gel electrophoresis band broadening. Proc Natl Acad Sci
103:11828–11833
Bishop C (2006) Pattern recognition and machine learning. Springer, Berlin
Botting B (2004) Structured total least squares for approximate polynomial operations. Master’s
thesis, School of Computer Science, University of Waterloo
Buckheit J, Donoho D (1995) Wavelab and reproducible research. In: Wavelets and statistics.
Springer, Berlin/New York
Byers R (1988) A bisection method for measuring the distance of a stable matrix to the unstable
matrices. SIAM J Sci Stat Comput 9(5):875–881
Carroll R, Ruppert D, Stefanski L (1995) Measurement error in nonlinear models. Chapman &
Hall/CRC, London
Chandrasekaran S, Golub G, Gu M, Sayed A (1998) Parameter estimation in the presence of
bounded data uncertainties. SIAM J Matrix Anal Appl 19:235–252
Cheng C, Van Ness JW (1999) Statistical regression with measurement error. Arnold, London
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proc int conf ma-
chine learning, pp 225–232
Dominik C (2010) The org mode 7 reference manual. Network theory ltd, URL http://orgmode.
org/
Eckart G, Young G (1936) The approximation of one matrix by another of lower rank. Psychome-
trika 1:211–218
El Ghaoui L, Lebret H (1997) Robust solutions to least-squares problems with uncertain data.
SIAM J Matrix Anal Appl 18:1035–1064
Fierro R, Jiang E (2005) Lanczos and the Riemannian SVD in information retrieval applications.
Numer Linear Algebra Appl 12:355–372
Gander W, Golub G, Strebel R (1994) Fitting of circles and ellipses: least squares solution. BIT
34:558–578
Gleser L (1981) Estimation in a multivariate “errors in variables” regression model: large sample
results. Ann Stat 9(1):24–44
Graillat S (2006) A note on structured pseudospectra. J Comput Appl Math 191:68–76
Halmos P (1985) I want to be a mathematician: an automathography. Springer, Berlin
Higham N (1989) Matrix nearness problems and applications. In: Gover M, Barnett S (eds) Appli-
cations of matrix theory. Oxford University Press, Oxford, pp 1–27
Hinrichsen D, Pritchard AJ (1986) Stability radius for structured perturbations and the algebraic
Riccati equation. Control Lett 8:105–113
Jackson J (2003) A user’s guide to principal components. Wiley, New York
Jolliffe I (2002) Principal component analysis. Springer, Berlin
Karmarkar N, Lakshman Y (1998) On approximate GCDs of univariate polynomials. J Symb Com-
put 26:653–666
Kiers H (2002) Setting up alternating least squares and iterative majorization algorithms for solving
various matrix optimization problems. Comput Stat Data Anal 41:157–170
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-
constrained least squares for microarray data analysis. Bioinformatics 23:1495–1502
Knuth D (1984) Literate programming. Comput J 27(2):97–111
Knuth D (1986) Computers & typesetting, Volume B: TeX: The program. Addison-Wesley, Read-
ing
Knuth D (1992) Literate programming. Cambridge University Press, Cambridge
Koopmans T (1937) Linear regression analysis of economic time series. DeErven F Bohn
Kovacevic J (2007) How to encourage and publish reproducible research. In: Proc IEEE int conf
acoustics, speech signal proc, pp 1273–1276
Krim H, Viberg M (1996) Two decades of array signal processing research. IEEE Signal Process
Mag 13:67–94
Kumaresan R, Tufts D (1983) Estimating the angles of arrival of multiple plane waves. IEEE Trans
Aerosp Electron Syst 19(1):134–139
32 1 Introduction
and
The smallest possible number of generators, i.e., col dim(P ), such that (IMAGE0 )
holds is invariant of the representation and is equal to m := dim(B)—the dimen-
sion of B. Integers, such as m and p := q − m that are invariant of the representa-
tion, characterize properties of the model (rather than of the representation) and are
called model invariants. The integers m and p have data modeling interpretation as
number of inputs and number of outputs, respectively. Indeed, m variables are free
(unassigned by the model) and the other p variables are determined by the model
and the inputs. The number of inputs and outputs of the model B are denoted by
m(B) and p(B), respectively.
The model class of linear static models with q variables, at most m of which are
q
inputs is denoted by Lm,0 . With col dim(P ) = m, the columns of P form a basis
for B. The smallest possible row dim(R), such that ker(R) = B is invariant of the
representation and is equal to the number of outputs of B. With row dim(R) = p,
the rows of R form a basis for the orthogonal complement B ⊥ of B. Therefore,
without loss of generality we can assume that P ∈ Rq×m and R ∈ Rp×q .
The transition from one model representation to another gives insight into the prop-
erties of the model. These analysis problems need to be solved before the more com-
plicated modeling problems are considered. The latter can be viewed as synthesis
problems since the model is created or synthesized from data and prior knowledge.
If the parameters R, P , and (X, Π) describe the same system B, then they are
related. We show the relations that the parameters must satisfy as well as code that
does the transition from one representation to another. Before we describe the transi-
tion among the parameters, however, we need an efficient way to store and multiply
by permutation matrices and a tolerance for computing rank numerically.
In the input/output model representation Bi/o (X, Π), the partitioning of the vari-
ables d ∈ Rq into inputs u ∈ Rm and outputs y ∈ Rp is specified by a permutation
matrix Π ,
Π −1 =Π
u u
d col(u, y), d =Π , = Π d.
y y
Π
Clearly, the vector π contains the same information as the matrix Π and one can
reconstruct Π from π by permuting the rows of the identity matrix. (In the code io
is the variable corresponding to the vector π and Pi is the variable corresponding
to the matrix Π .)
37a π → Π 37a≡ (38a)
Pi = eye(length(io)); Pi = Pi(io,:);
Permuting the elements of a vector is done more efficiently by direct reordering of
the elements, instead of a matrix-vector multiplication. If d is a variable correspond-
ing to a vector d, then d(io) corresponds to the vector Πd.
The default value for Π is the identity matrix I , corresponding
u to first m variables
of d being inputs and the remaining p variables outputs, d = y .
37b default input/output partition 37b≡ (39 40a 42a)
if ~exist(’io’) || isempty(io), io = 1:q; end
In case the inverse permutation Π −1 = Π is needed, the corresponding “permuta-
tion vector” is
π = Π col(1, . . . , q) ∈ {1, . . . , q}q .
38 2 From Data to Models
In the code inv_io is the variable corresponding to the vector π and the transition
from the original variables d to the partitioned variables uy = [u; y] via io and
inv_io is done by the following indexing operations:
io_inv
d uy, d = uy(io), uy = d(io_inv).
io
i.e., by taking the tolerance to be equal to zero, the numerical rank reduces to the
theoretical rank. A nonzero tolerance ε makes the numerical rank robust to pertur-
bations of size (measured in the induced 2-norm) less than ε. Therefore, ε reflects
the size of the expected errors in the matrix. The default tolerance is set to a small
value, which corresponds to numerical errors due to a double precision arithmetic.
(In the code tol is the variable corresponding to ε.)
38b default tolerance tol 38b≡ (41–43 76b)
if ~exist(’tol’) || isempty(tol), tol = 1e-12; end
Note 2.2 The numerical rank definition (num rank) is the solution of an unstruc-
tured rank minimization problem: find a matrix A of minimal rank, such that
A−A 2 < ε.
The relation
q
ker(R) = image(P ) = B ∈ Lm,0 =⇒ RP = 0 (R ↔ P )
gives a link between the parameters P and R. In particular, a minimal image rep-
resentation image(P ) = B can be obtained from a given kernel representation
ker(R) = B by computing a basis for the null space of R. Conversely, a minimal
kernel representation ker(R) = B can be obtained from a given image representa-
tion image(P ) = B by computing a basis for the left null space of P .
40b R → P 40b≡
function p = r2p(r), p = null(r);
Defines:
r2p, used in chunks 44 and 45a.
40c R 40c≡
P →
function r = p2r(p), r = null(p’)’;
Defines:
p2r, used in chunk 45a.
The kernel and image representations obtained from xio2r, xio2p, p2r, and r2p
are minimal. In general, however, a given kernel or image representations can be
non-minimal, i.e., R may have redundant rows and P may have redundant columns.
The kernel representation, defined by R, is minimal if and only if R is full row rank.
Similarly, the image representation, defined by P , is minimal if and only if P is full
column rank.
The problems of converting kernel and image representations to minimal ones
are equivalent to the problem of finding a full rank matrix that has the same kernel
or image as a given matrix. A numerically reliable way to solve this problem is to
use the singular value decomposition.
q
Consider a model B ∈ Lm,0 with parameters R ∈ Rg×q and P ∈ Rq×g of, re-
spectively, kernel and image representations. Let
R = U ΣV
2.1 Linear Static Model Representations 41
be the singular value decomposition of R and let p be the rank of R. With the
partitioning,
V =: V1 V2 , where V1 ∈ Rq×p ,
we have
image R = image(V1 ).
Therefore,
ker(R) = ker V1 and V1 is full rank,
so that V1 is a parameter of a minimal kernel representation of B.
Similarly, let
P = U ΣV
be the singular value decomposition of P . With the partitioning,
U =: U1 U2 , where U1 ∈ Rq×m ,
we have
image(P ) = image(U1 ).
Since U1 is full rank, it is a parameter of a minimal image representation of B.
In the numerical implementation, the rank is replaced by the numerical rank with
respect to a user defined tolerance tol.
41 R → minimal R 41≡
function r = minr(r, tol)
[p, q] = size(r); default tolerance tol 38b
[u, s, v] = svd(r, ’econ’); pmin = sum(diag(s) > tol);
if pmin < p, r = v(:, 1:pmin)’; end
Defines:
minr, used in chunks 42b and 43c.
Exercise 2.3 Write a function minp that implements the transition P → minimal P .
m p
ker Ru Ry Π = Bi/o − Ry−1 Ru , Π
R
and
P m
image Π u = Bi/o Py Pu−1 , Π .
Py p
P
42c (R, Π) → X 42a+≡ 42b
[u, s, v] = svd(ry); s = diag(s);
if s(end) < tol
warning(’Computation of X is ill conditioned.’);
x = NaN;
else
x = -( v * diag(1 ./ s) * u’ * ru )’;
end
2.1 Linear Static Model Representations 43
Singularity of the blocks Ry and Pu implies that the input/output representation with
a permutation matrix Π is not possible. In such cases, the function rio2x issues a
warning message and returns NaN value for X.
The function r2io uses rio2x in order to find all possible input/output parti-
tions for a model specified by a kernel representation.
43a R → Π 43a≡ 43b
function IO = r2io(r, tol)
q = size(r, 2); default tolerance tol 38b
Defines:
r2io, used in chunk 43d.
The search is exhaustive over all input/output partitionings of the variables (i.e., all
choices of m elements of the set of variable indices {1, . . . , q}), so that the compu-
tation is feasible only for a small number of variables (say less than 6).
43b R → Π 43a+≡ 43a 43c
IO = perms(1:q); nio = size(IO, 1);
The parameter X for each candidate partition is computed. If the computation of X
is ill conditioned, the corresponding partition is not consistent with the model and
is discarded.
43c R → Π 43a+≡ 43b
not_possible = []; warning_state = warning(’off’);
r = minr(r, tol);
for i = 1:nio
x = rio2x(r, IO(i, :), tol);
if isnan(x), not_possible = [not_possible, i]; end
end
warning(warning_state); IO(not_possible, :) = [];
Uses minr 41 and rio2x 42a.
Example 2.4 Consider the linear static model with three variables and one input
⎛⎡ ⎤⎞
1
0 1 0
B = image ⎝⎣0⎦⎠ = ker .
0 0 1
0
correctly computes the input output partitionings from the parameter R in a kernel
representation of the model
ans =
1 2 3
1 3 2
Exercise 2.5 Write functions pio2x and p2io that implement the transitions
(P , Π) → X and P → Π.
Exercise 2.6 Explain how to check that two models, specified by kernel or image
representation, are equivalent.
Figure 2.1 summarizes the links among the parameters R, P , and (X, Π) of a linear
static model B and the functions that implement the transitions.
Numerical Example
In order to test the functions for transition among the kernel, image, and input/output
model representations, we choose, a random linear static model, specified by a ker-
nel representation,
44 Test model transitions 44≡ 45a
m = 2; p = 3; R = rand(p, m + p); P = r2p(R);
Uses r2p 40b.
2.2 Linear Time-Invariant Model Representations 45
and traverse the diagram in Fig. 2.1 clock-wise and anti clock-wise
45a Test model transitions 44+≡ 44 45b
R_ = xio2r(pio2x(r2p(R))); P_ = xio2p(rio2x(p2r(P)));
Uses p2r 40c, r2p 40b, rio2x 42a, xio2p 40a, and xio2r 39.
As a verification of the software, we check that after traversing the loop, equivalent
models are obtained.
45b Test model transitions 44+≡ 45a
norm(rio2x(R) - rio2x(R_)), norm(pio2x(P) - pio2x(P_))
Uses rio2x 42a.
The answers are around the machine precision, which confirms that the models are
the same.
A linear static model is a finite dimensional subspace. The dimension m of the sub-
space is equal to the number of inputs and is invariant of the model representation.
The integer constant m quantifies the model complexity: the model is more complex
when it has more inputs. The rationale for this definition of model complexity is
that inputs are “unexplained” variables by the model, so the more inputs the model
has, the less it “explains” the modeled phenomenon. In data modeling the aim is to
obtain low-complexity models, a principle generally referred to as Occam’s razor.
Note 2.7 (Computing the model complexity is a rank estimation problem) Comput-
ing the model complexity, implicitly specified by (exact) data, or by a non-minimal
kernel or image representation is a rank computation problem; see Sect. 2.3 and the
function minr.
In this book, we consider the special class of finite dimensional linear time-
invariant dynamical models. By definition, a model B is linear if it is a subspace of
the data space (Rq )T . In order to define the time-invariance property, we introduce
the shift operator σ τ . Acting on a signal w, σ τ produces a signal σ τ w, which is the
backwards shifted version of w by τ time units, i.e.,
τ
σ w (t) := w(t + τ ), for all t ∈ T .
σ τ B = B, for all τ ∈ T .
The model B is finite dimensional if it is a closed subset (in the topology of point-
wise convergence). Finite dimensionality is equivalent to the property that at any
time t the future behavior of the model is deterministically independent of the past
behavior, given a finite dimensional vector, called a state of the model. Intuitively,
the state is the information (or memory) of the past that is needed in order to predict
the future. The smallest state dimension is an invariant of the system, called the
order. We denote the set of finite dimensional linear time-invariant models with q
variables and order at most n by L q,n and the order of B by n(B).
A finite dimensional linear time-invariant model B ∈ L q,n admits a representa-
tion by a difference or differential equation
R0 w + R1 λw + · · · + Rl λl w = R0 + R1 λ + · · · + Rl λl w
R(λ) (DE)
= R(λ)w = 0,
where λ is the unit shift operator σ in the discrete-time case and the differential
operator dtd in the continuous-time case. Therefore, the model B is the kernel
B := ker R(λ) = w | (DE) holds , (KER)
n(B) ≤ p(B)l(B).
2.2 Linear Time-Invariant Model Representations 47
As in the static case, the smallest possible number of rows g of the polyno-
mial matrix R in a kernel representation (KER) of a finite dimensional linear time-
invariant system B is the invariant p(B)—number of outputs of B. Finding an
input/output partitioning for a model specified by a kernel representation amounts
to selection of a non-singular p × p submatrix of R. The resulting input/output rep-
resentation is:
B = Bi/o (P , Q, Π) := ker Π Q(λ) P (λ) , (I/O)
Fig. 2.2 Data, input/output model representations, and links among them
• transfer function,
B = Bi/o (H, Π) := w = Π col(u, y) | F (y) = H (z)F (u) , (TF)
System Realization
w = col(u, y)
Note 2.10 (Discrete-time vs. continuous-time system realization) There are some
differences between the discrete and continuous-time realization theory. Next, we
consider the discrete-time case. It turns out, however, that the discrete-time algo-
rithms can be used for realization of continuous-time systems by applying them on
the sequence of the Markov parameter (H (0), ddt H (0), . . .) of the system.
The sequence
H = H (0), H (1), H (2), . . . , H (t), . . . , where H (t) ∈ Rp×m
is a one sided infinite matrix-values time series. Acting on H , the shift operator σ ,
removes the first sample, i.e.,
σ H = H (1), H (2), . . . , H (t), . . . .
H (i + j − 1) = CAi+j −2 B = CAi−1 Aj −1 B.
Let
Ot (A, C) := col C, CA, . . . , CAt−1 (O)
be the extended observability matrix of the pair (A, C) and
Ct (A, B) := B AB · · · At−1 B (C )
be the extended controllability matrix of the pair (A, B). With O(A, C) and
C (A, B) being the infinite observability and controllability matrices, we have
(⇐=) In this direction, the proof is constructive and results in an algorithm for
computation of the minimal realization of H in Lmq,n , where n = rank(H (σ H )).
A realization algorithm is presented in Sect. 3.1.
This suggests a method to find the order n(B) of the minimal realization of H :
compute the rank of the finite Hankel matrix Hi,j (σ H ), where nmax := min(pi, mj )
is an upper bound of the order. Algorithms for computing the order and parameters
of the minimal realization are presented in Chap. 3.1.
In order to treat static, dynamic, linear, and nonlinear modeling problems with uni-
fied terminology and notation, we need an abstract setting that is general enough to
accommodate all envisaged applications. Such a setting is described in this section.
The data D and a model B for the data are subsets of a universal set U of pos-
sible observations. In static modeling problems, U is a real q-dimensional vector
space Rq , i.e., the observations are real valued vectors. In dynamic modeling prob-
lems, U is a function space (Rq )T , with T being Z in the discrete-time case and R
in the continuous-time case.
2.3 Exact and Approximate Data Modeling 53
Note 2.12 (Categorical data and finite automata) In modeling problems with cate-
gorical data and finite automata, the universal set U is discrete and may be finite.
D = {wd,1 , . . . , wd,N } ⊂ U .
In dynamic problems, the data D often consists of a single trajectory wd,1 , in which
case the subscript index 1 is skipped and D is identified with wd . In static modeling
problems, an observation wd,j is a vector and the alternative notation dj = wd,j is
used in order to emphasize the fact that the observations do not depend on time.
Note 2.13 (Given data vs. general trajectory) In order to distinguish a general tra-
jectory w of the system from the given data wd (a specific trajectory) we use the
subscript “d” in the notation of the given data.
Complexities are compared in this book by the lexicographic ordering, i.e., two
complexities are compared by comparing their corresponding elements in the in-
creasing order of the indices. The first time an index is larger, the corresponding
complexity is declared larger. For linear time-invariant dynamic models, this con-
vention and the ordering of the elements in c(B) imply that a model with more
inputs is always more complex than a model with less inputs irrespective of their
orders.
The complexity c of a model class M is the largest complexity of a model in
the model class. Of interest is the restriction of the generic model classes M to
q
subclasses Mcmax of models with bounded complexity, e.g., Lm,l max
, with m < q.
54 2 From Data to Models
Problem 2.14 (Exact data modeling) Given data D ⊂ U and a model class
Mcmax ∈ 2U , find a model B in Mc
max that contains the data and has minimal
(in the lexicographic ordering) complexity or assert that such a model does not ex-
ist, i.e.,
(Existence of exact model) Under what conditions on the data D and the
model class Mcmax does a solution to problem (EM) exist?
If a solution exists, it is unique. This unique solution is called the most pow-
erful unfalsified model for the data D in the model class Mcmax and is denoted
by Bmpum (D). (The model class Mcmax is not a part of the notation Bmpum (D) and
is understood from the context.)
Suppose that the data D are generated by a model B0 in the model class Mcmax ,
i.e.,
D ⊂ B0 ∈ Mcmax .
Then, the exact modeling problem has a solution in the model class Mcmax , however,
the solution Bmpum (D) may not be the data generating model B0 . The question
occurs:
Example 2.15 (Exact data fitting by a linear static model) Existence of a linear static
model B of bounded complexity m for the data D is equivalent to rank deficiency
of the matrix
Φ(D) := d1 · · · dN ∈ Rq×N ,
2.3 Exact and Approximate Data Modeling 55
composed of the data. (Show this.) Moreover, the rank of the matrix Φ(D) is equal
to the minimal dimension of an exact model for D
∈ L q , such that D ⊂ B
existence of B ⇐⇒ rank Φ(D) ≤ m. (∗)
m,0
When an exact model does not exist in the considered model class, an approximate
model that is in some sense “close” to the data is aimed at instead. Closeness is
measured by a suitably defined criterion. This leads to the following approximate
data modeling problem.
The algebraic distance depends on the choice of the parameter R in a kernel rep-
resentation of the model, while the geometric distance is representation invariant. In
addition, the algebraic distance is not invariant to a rigid transformation. However,
a modification of the algebraic distance that is invariant to a rigid transformation is
presented in Sect. 6.3.
Example 2.18 (Geometric distance for linear and quadratic models) The two plots
in Fig. 2.3 illustrate the geometric distance (dist) from a set of eight data points
D = di = (xi , yi ) | i = 1, . . . , 8
in the plane to, respectively, linear B1 and quadratic B2 models. As its name sug-
gests, dist(D, B) has geometric interpretation—in order to compute the geometric
distance, we project the data points on the models. This is a simple task (linear least
squares problem) for linear models but a nontrivial task (nonconvex optimization
problem) for nonlinear models. In contrast, the algebraic “distance” (not visualized
in the figure) has no simple geometrical interpretation but is easy to compute for
linear and nonlinear models alike.
2.3 Exact and Approximate Data Modeling 57
Fig. 2.3 Geometric distance from eight data points to a linear (left) and quadratic (right) models
Note 2.19 (Approximate modeling in the case of exact data) If an exact model B
exists in the model class Mcmax , then B is a global optimum point of the approxi-
mate modeling problem (AM) (irrespective of the approximation criterion f being
used). Indeed,
(AM) has a nonunique solution. In the next section, we present a more general
approximate data modeling problem formulation that minimizes simultaneously the
complexity as well as the fitting error.
The terminology “geometric” and “algebraic” distance comes from the computer
vision application of the methods for fitting curves and surfaces to data. In the sys-
tem identification community, the geometric fitting method is related to the misfit
approach and the algebraic fitting criterion is related to the latency approach. Misfit
and latency computation are data smoothing operations. For linear time-invariant
systems, the misfit and latency can be computed efficiently by Riccati type recur-
sions. In the statistics literature, the geometric fitting is related to errors-in-variable
estimation and the algebraic fitting is related to classical regression estimation, see
Table 2.2.
wd,j = w0,j + w
.j , (EIV)
58 2 From Data to Models
Table 2.2 Correspondence among terms for data fitting criteria in different fields
Computer vision System identification Statistics Mathematics
where
D0 := {w0,1 , . . . , w0,N } ⊂ B0
is the true data and
. := w
D .1 , . . . , w
.N
is the measurement noise, which is assumed to be a set of independent, zero mean,
Gaussian random vectors, with covariance matrix σ 2 I .
Example 2.21 (Algebraic fit by a linear model and regression) A linear model
class, defined by the input/output representation Bi/o (Θ) and algebraic fitting crite-
rion (dist ), where
The statistical setting for the least squares approximation problem (LS) is the clas-
sical regression model
R(wd,j ) = ej , (REG)
where e1 , . . . , eN are zero mean independent and identically distributed random
variables. Gauss–Markov’s theorem states that the least squares approximate so-
lution is the best linear unbiased estimator for the regression model (REG).
Complexity–Accuracy Trade-off
Data modeling is a mapping from a given data set D , to a model B in a given model
class M :
data modeling problem
data set D ⊂ U −−−−−−−−−−−−−→ model B ∈ M ∈ 2U .
A data modeling problem is defined by specifying the model class M and one or
more modeling criteria. Basic criteria in any data modeling problem are:
2.3 Exact and Approximate Data Modeling 59
Given a data set D ∈ U and a measure F for the fitting error, solve the multi-
objective optimization problem:
c(B)
minimize over B ∈ M . (DM)
F (D, B)
Next we consider the special cases of linear static models and linear time-
invariant dynamic models with F (D, B) being dist(D, B) or dist (D, B). The
model class assumption implies that dim(B) is a complexity measure in both static
and dynamic cases.
The data set D , can be parametrized by a real vector p ∈ Rnp . (Think of the vector p
as a representation of the data in the computer memory.) For a linear model B and
exact data D , there is a relation between the model complexity and the rank of a
data matrix S (p):
c Bmpum (D) = rank S (p) . (∗)
The mapping
S : Rnp → Rm×n
from the data parameter vector p to the data matrix S (p) depends on the applica-
tion. For example, S (p) = Φ(D) is unstructured in the case of linear static mod-
eling (see Example 2.15) and S (p) = H (wd ) is Hankel structured in the case of
autonomous linear time-invariant dynamic model identification,
Let p be the parameter vector for the data D and p
be the parameter vector for
the data approximation D. The geometric distance dist(D, B) can be expressed in
as
terms of the parameter vectors p and p
minimize
over p p−p
2
⊂ B.
subject to D
Moreover, the norm in the parameter space Rnp can be chosen as weighted 1-, 2-,
and ∞-(semi)norms:
60 2 From Data to Models
np
/ /
.
p w,1 .1 :=
:= w p /w i p
.i /,
i=1
+
, np
, 2 ( · w)
p .2 := -
. w,2 := w p . ,
wi p
i=1
/ /
p .∞ := max /wi p
. w,∞ := w p .i /,
i=1,...,np
where w is a vector with nonnegative elements, specifying the weights, and is the
element-wise (Hadamard) product.
Using the data parametrization (∗) and one of the distance measures ( · w ),
the data modeling problem (DM) becomes the biobjective matrix approximation
problem:
rank(S ( p ))
minimize over p . (DM’)
p−p
Two possible ways to scalarize the biobjective problem (DM’) are:
1. Misfit minimization subject to a bound r on the model complexity
minimize over p p − p subject to rank S ( p ) ≤ r. (SLRA)
to applications. The low rank approximation view of the problem makes link to
computational algorithms for solving the problem.
The matrix D∗ is an optimal rank-m (or less) approximation of D with respect to
the given norm · .
The special case of (LRA) with · being the weighted 2-norm
0
D W := vec (D)W vec(D), for all D ( · W)
where W ∈ RqN ×qN is a positive definite matrix, is called weighted low rank ap-
proximation problem. In turn, special cases of the weighted low rank approximation
problem are obtained when the weight matrix W has diagonal, block diagonal, or
some other structure.
• Element-wise weighting:
• Column-wise weighting:
• Row-wise weighting:
= diag(W1 , . . . , Wq ),
W where Wi ∈ RN ×N , Wi > 0, for i = 1, . . . , q,
is a matrix, such that
and W
D W = D W, for all D.
Figure 2.4 shows the hierarchy of weighted low rank approximation problems, ac-
cording to the structure of the weight matrix W . Exploiting the structure of the
weight matrix allows more efficient solution of the corresponding weighted low rank
approximation problems compared to the general problem with unstructured W .
62 2 From Data to Models
As shown in the next section left/right weighting with equal √ matrix for
all rows/columns
√ corresponds to the approximation criteria
√ √Wl D F and
D Wr F . The approximation problem with criterion Wl D Wr F is called
two-sided weighted and is also known as the generalized low rank approximation
problem. This latter problem allows analytic solution in terms of the singular value
decomposition of the data matrix.
An extreme special case of the weighted low rank approximation problem is the
“unweighted” case, i.e., weight matrix a multiple of the identity W = v −1 I , for
some v > 0. Then, · W is proportional to the Frobenius norm · F and the low
rank approximation problem has an analytic solution in terms of the singular value
decomposition of D. The results is known as the Eckart–Young–Mirsky theorem or
the matrix approximation lemma. In view of its importance, we refer to this case as
the basic low rank approximation problem.
D = U ΣV
m q − m
m q − m Σ 0 m m q − m
U =: U1 U2 q , Σ =: 1 and V =: V1 V2 N .
0 Σ2 q − m
Then the rank-m matrix, obtained from the truncated singular value decomposition
∗ = U1 Σ1 V1 ,
D
is such that
1
∗
D−D F= min
D−D F=
2
σm+1 + · · · + σq2 .
rank(D)≤m
Note 2.24 (Unitarily invariant norms) Theorem 2.23 holds for any norm · that is
invariant under orthogonal transformations, i.e., satisfying the relation
Fig. 2.4 Hierarchy of weighted low rank approximation problems according to the structure of the
weight matrix W . On the left side are weighted low rank approximation problems with row-wise
weighting and on the right side are weighted low rank approximation problems with column-wise
weighting. The generality of the problem reduces from top to bottom
Note 2.25 (Approximation in the spectral norm) For a matrix D, let D 2 be
the spectral (2-norm induced) matrix norm
D 2 = σmax (D).
64 2 From Data to Models
Then
min = σm+1 ,
D − D
2
rank(D)=m
i.e., the optimal rank-m spectral norm approximation error is equal to the first ne-
glected singular value. The truncated singular value decomposition yields an opti-
mal approximation with respect to the spectral norm, however, in this case a mini-
mizer is not unique even when the singular values σm and σm+1 are different.
Corollary 2.26 An optimal in the Frobenius norm approximate model for the
data D in the model class Lm,0 , i.e., B∗ := Bmpum (D ∗ ) is unique if and only if
the singular values σm and σm+1 of D are different, in which case
B∗ = ker U = image(U1 ).
2
Corollary 2.27 (Nested approximations) The optimal in the Frobenius norm ap-
proximate models B∗ for the data D in the model classes Lm,0 , where m = 1, . . . , q
m
are nested, i.e.,
q ⊆ B
B q−1 ⊆ · · · ⊂ B
1 .
and the singular value decomposition of R1 is a more efficient alternative for find-
than computing the singular value decomposition of D.
ing B
and
(A1 ⊗ B1 )(A2 ⊗ B2 ) = (A1 A2 ) ⊗ (B1 B2 ),
we have
0
D − D = vec (D)(Wr ⊗ Wl ) vec(D)
Wr ⊗Wl
0 0
= Wr ⊗ Wl vec(D)2
0 0
= vec Wl D Wr 2
0 0
= Wl D Wr F .
Therefore, the low rank approximation problem (LRA) with norm ( · W ) and
weight matrix (Wr ⊗ Wl ) is equivalent to the two-sided weighted (or generalized)
low rank approximation problem
0 0
minimize over D Wl D − D Wr
F
(WLRA2)
subject to rank D ≤ m,
Theorem 2.29 (Two-sided weighted low rank approximation) Define the modified
data matrix
0 0
Dm := Wl D Wr ,
66 2 From Data to Models
m
and let D ∗ be the optimal (unweighted) low rank approximation of D . Then
m
0 0 −1
∗ := Wl −1 D
D m∗
Wr ,
Exercise 2.30 Using the result in Theorem 2.29, write a function that solves the
two-sided weighted low rank approximation problem (WLRA2).
The following problem is the approximate modeling problem (AM) for the model
class of linear static models, i.e., Mcmax = Lm,0 , with the orthogonal distance ap-
proximation criterion, i.e., f (D, B) = dist(D, B). The norm · in the definition
of dist, however, in the present context is a general vector norm, rather than the
2-norm.
{d1 , . . . , dN } ⊂ Rq ,
and D. is the measurement error that is assumed to be a random matrix with zero
mean and normal distribution. The true matrix D0 is “generated” by a model B0 ,
with a known complexity bound m. The model B0 is the object to be estimated in
the errors-in-variables setting.
2.5 Structured Low Rank Approximation 67
.
Note, however, that σ 2 is not given, so that the probability density function of D
is not completely specified. Proposition 2.32 shows that the problem of computing
the maximum likelihood estimator in the errors-in-variables setting is equivalent to
Problem 2.22 with the weighted norm · W . Maximum likelihood estimation for
density functions other than normal leads to low rank approximation with norms
other than the weighted 2-norm.
We showed that some weighted unstructured low rank approximation problems have
global analytic solution in terms of the singular value decomposition. Similar result
exists for circulant structured low rank approximation. If the approximation crite-
rion is a unitarily invariant matrix norm, the unstructured low rank approximation
(obtained for example from the truncated singular value decomposition) is unique.
In the case of a circulant structure, it turns out that this unique minimizer also has
circulant structure, so the structure constraint is satisfied without explicitly enforc-
ing it in the approximation problem.
An efficient computational way of obtaining the circulant structured low rank
approximation is the fast Fourier transform. Consider the scalar case and let
np
−i 2π
np kj
Pk := pj e
j =1
k∈K and k ∈
/K =⇒ |Pk | > |Pk |,
The reason to consider the more general structured low rank approximation is that
D = S (p) being low rank and Hankel structured is equivalent to p being generated
by a linear time-invariant dynamic model. To show this, consider first the special
case of a scalar Hankel structure
⎡ ⎤
p1 p2 ... pnp −l
⎢ p2 p3 . . . pnp −l+1 ⎥
⎢ ⎥
Hl+1 (p) := ⎢ . . .. .. ⎥.
⎣ .. .. . . ⎦
pl+1 pl+2 ··· pnp
being rank deficient implies that there is a nonzero vector R = [R0 R1 · · · Rl ], such
that
p ) = 0.
RHl+1 (
Due to the Hankel structure, this system of equations can be written as
t + R1 p
R0 p t+1 + · · · + Rl p
t+l = 0, for t = 1, . . . , np − l,
is a tra-
i.e., a homogeneous constant coefficients difference equation. Therefore, p
jectory of an autonomous linear time-invariant system, defined by (KER). Recall
that for an autonomous system B,
The scalar Hankel low rank approximation problem is then equivalent to the
following dynamic modeling problem. Given T samples of a scalar signal wd ∈ RT ,
a signal norm · , and a model complexity n,
minimize and w
over B wd − w
(AM L0,l )
subject to w [1,T ]
∈ B| and ≤ n.
dim(B)
minimize and w
over B wd − w
(AM Lm,l )
[1,T ] and B
∈ B|
subject to w ∈ L q .
m,l
of the optimization problem, used for Problem SLRA, so that a representation of the
model is actually obtained directly from the optimization solver.
Similarly to the static modeling problem, the dynamic modeling problem has a
maximum likelihood interpretation in the errors-in-variables setting.
2.6 Notes
The concept of the most powerful unfalsified model is introduced in Willems (1986,
Definition 4). See also Antoulas and Willems (1993), Kuijper and Willems (1997),
Kuijper (1997) and Willems (1997). Kung’s method for approximate system real-
ization is presented in Kung (1978).
Modeling by the orthogonal distance fitting criterion (misfit approach) is initi-
ated in Willems (1987) and further on developed in Roorda and Heij (1995), Ro-
orda (1995a, 1995b), Markovsky et al. (2005b), where algorithms for solving the
problems are developed. A proposal for combination of misfit and latency for linear
time-invariant system identification is made in Lemmerling and De Moor (2001).
Weighted low rank approximation methods are developed in Gabriel and Zamir
(1979), De Moor (1993), Wentzell et al. (1997), Markovsky et al. (2005a), Manton
et al. (2003), Srebro (2004), Markovsky and Van Huffel (2007). The analytic so-
lution of circulant structured low rank approximation problem is derived indepen-
dently in the optimization community (Beck and Ben-Tal 2006) and in the systems
and control community (Vanluyten et al. 2005).
From this algorithmic point view, the equivalence of principal component analysis
and low rank approximation problem is a basic linear algebra fact: the space spanned
by the first m principal vectors of D coincides with the model B = image(Φ(D)),
where D is a solution of the low rank approximation problem (LRA).
References
Abelson H, diSessa A (1986) Turtle geometry. MIT Press, New York
Antoulas A, Willems JC (1993) A behavioral approach to linear exact modeling. IEEE Trans Au-
tomat Control 38(12):1776–1802
Beck A, Ben-Tal A (2006) A global solution for the structured total least squares problem with
block circulant matrices. SIAM J Matrix Anal Appl 27(1):238–255
De Moor B (1993) Structured total least squares and L2 approximation problems. Linear Algebra
Appl 188–189:163–207
Gabriel K, Zamir S (1979) Lower rank approximation of matrices by least squares with any choice
of weights. Technometrics 21:489–498
Kuijper M (1997) An algorithm for constructing a minimal partial realization in the multivariable
case. Control Lett 31(4):225–233
Kuijper M, Willems JC (1997) On constructing a shortest linear recurrence relation. IEEE Trans
Automat Control 42(11):1554–1558
Kung S (1978) A new identification method and model reduction algorithm via singular value
decomposition. In: Proc 12th asilomar conf circuits, systems, computers, Pacific Grove, pp
705–714
Lemmerling P, De Moor B (2001) Misfit versus latency. Automatica 37:2057–2067
Manton J, Mahony R, Hua Y (2003) The geometry of weighted low-rank approximations. IEEE
Trans Signal Process 51(2):500–514
Markovsky I, Van Huffel S (2007) Left vs right representations for solving weighted low rank
approximation problems. Linear Algebra Appl 422:540–552
Markovsky I, Rastello ML, Premoli A, Kukush A, Van Huffel S (2005a) The element-wise
weighted total least squares problem. Comput Stat Data Anal 50(1):181–209
Markovsky I, Willems JC, Van Huffel S, Moor BD, Pintelon R (2005b) Application of structured
total least squares for system identification and model reduction. IEEE Trans Automat Control
50(10):1490–1500
Roorda B (1995a) Algorithms for global total least squares modelling of finite multivariable time
series. Automatica 31(3):391–404
Roorda B (1995b) Global total least squares—a method for the construction of open approximate
models from vector time series. PhD thesis, Tinbergen Institute
Roorda B, Heij C (1995) Global total least squares modeling of multivariate time series. IEEE
Trans Automat Control 40(1):50–63
Srebro N (2004) Learning with matrix factorizations. PhD thesis, MIT
Vanluyten B, Willems JC, De Moor B (2005) Model reduction of systems with symmetries. In:
Proc 44th IEEE conf dec control, Seville, Spain, pp 826–831
Wentzell P, Andrews D, Hamilton D, Faber K, Kowalski B (1997) Maximum likelihood principal
component analysis. J Chemom 11:339–366
72 2 From Data to Models
Willems JC (1986) From time series to linear system—Part II. Exact modelling. Automatica
22(6):675–694
Willems JC (1987) From time series to linear system—Part III. Approximate modelling. Automat-
ica 23(1):87–115
Willems JC (1997) On interconnections, control, and feedback. IEEE Trans Automat Control
42:326–339
Chapter 3
Algorithms
Realization Algorithms
Γ T T −1
Hn+1,n+1 (σ H ) = Γ Δ = Δ
Γ Δ
is obtained with the same inner dimension. The nonuniqueness of the factoriza-
tion corresponds to the nonuniqueness of the input/state/output representation of
the minimal realization due to a change of the state space bases:
Bi/s/o (A, B, C, D) = Bi/s/o T −1 AT , T −1 B, CT , D .
σ −1 Γ A = σ Γ, (SE1 )
where, acting on a block matrix, σ and σ −1 remove, respectively, the first and the
last block elements.
74b Γ, Δ) → (A, B, C) 74a+≡ (76d) 74a
a = O(1:end - p, :) \ O((p + 1):end, :);
3.1 Subspace Methods 75
Note 3.1 (Solution of the shift equation) When a unique solution exists, the code in
chunk 74a computes the exact solution. When a solution A of (SE1 ) does not exist,
the same code computes a least squares approximate solution.
Equivalently (in the case of exact data), A can be computed from the Δ factor
Aσ −1 Δ = σ Δ. (SE2 )
In the case of noisy data (approximate realization problem) or data from a high
order system (model reduction problem), (SE1 ) and (SE2 ) generically have no exact
solutions and their least squares approximate solutions are different.
Implementation
As square as possible Hankel matrix Hi,j (σ H ) is formed, using all data points, i.e.,
2 Tm 3
i= and j = T − i. (i, j )
m+p
75a dimension of the Hankel matrix 75a≡ (76d 81)
if ~exist(’i’, ’var’) | ~isreal(i) | isempty(i)
i = ceil(T * m / (m + p));
end
if ~exist(’j’, ’var’) | ~isreal(j) | isempty(j)
j = T - i;
elseif j > T - i
error(’Not enough data.’)
end
The choice (i, j ) for the dimension of the Hankel matrix maximazes the order of the
realization that can be computed. Indeed, a realization of order n can be computed
from the matrix Hi,j (σ H ) provided
n ≤ nmax := min(pi − 1, mj ).
75b check n < min(pi − 1, mj ) 75b≡ (76d)
if n > min(i * p - 1, j * m), error(’Not enough data’), end
The minimal number of samples T of the impulse response that allows identification
of a system of order n is
4 5 4 5
n n
Tmin := + + 1. (Tmin )
p m
The key computational step of the realization algorithm is the factorization of the
Hankel matrix. In particular, this step involves rank determination. In finite preci-
sion arithmetic, however, rank determination is a nontrivial problem. A numerically
reliable way of computing rank is the singular value decomposition
Hi,j (σ H ) = U ΣV .
76 3 Algorithms
n
n Σ1 0 n n
U =: U1 U2 , Σ =: , and V =: V1 V2 ,
0 Σ2
the factors Γ and Δ of the rank revealing factorization are chosen as follows
0 0
Γ := U1 Σ1 and Δ := Σ1 V1 . (Γ, Δ)
76c define Δ and Γ 76c≡ (76d)
sqrt_s = sqrt(s(1:n))’;
O = sqrt_s(ones(size(U, 1), 1), :) .* U(:, 1:n);
C = (sqrt_s(ones(size(V, 1), 1), :) .* V(:, 1:n))’;
This choice leads to a finite-time balanced realization of Bi/s/o (A, B, C, D), i.e.,
the finite time controllability and observability Gramians
Oi (A, C)Oi (A, C) = Γ Γ and Cj (A, B)Cj (A, B) = ΔΔ
are equal,
Γ Γ = ΔΔ = Σ.
Note 3.2 (Kung’s algorithm) The combination of the described realization al-
gorithm with the singular value decomposition-based rank revealing factoriza-
tion (Γ, Δ), i.e., unstructured low rank approximation, is referred to as Kung’s al-
gorithm.
h(:, :, t) = H (t),
Note 3.4 (Suboptimality of Kung’s algorithm) Used as a method for Hankel struc-
tured low rank approximation, Kung’s algorithm is suboptimal. The reason for this
is that the factorization
Hi,j (σ H ) ≈ Γ Δ,
performed by the singular value decomposition is unstructured low rank approxi-
mation and unless the data are exact, Γ and Δ are not extended observability and
controllability matrices, respectively. As a result, the shift equations (SE1 ) and (SE2 )
do not have solutions and Kung’s algorithm computes an approximate solution in
the least squares sense.
Note 3.5 (Unstructured vs. structure enforcing methods) The two levels of approx-
imation:
78 3 Algorithms
In the realization problem the given data are a special trajectory—the impulse re-
sponse of the model. Therefore, the realization problem is a special exact identifica-
tion problem. In this section, the general exact identification problem:
exact
identification
wd −−−−−−−−−−→ Bmpum (wd )
impulse response
computation (w2h) realization (h2ss)
wd −−−−−−−−−−−−−→ Hmpum −−−−−−−−−−−−−→ Bmpum (wd ).
(wd → H → B)
First, the impulse response Hmpum of the most powerful unfalsified model
Bmpum (wd ) is computed from the given general trajectory wd . Then, an in-
put/state/output representation of Bmpum (wd ) is computed from Hmpum by a re-
alization algorithm, e.g., Kung’s algorithm.
The key observation in finding an algorithm for the computation of the impulse
response is that the image of the Hankel matrix Hi,j (wd ), with j > qi, constructed
3.1 Subspace Methods 79
from the data wd is the restriction Bmpum (wd )|[1,i] of the most powerful unfalsified
model on the interval [1, i], i.e.,
Bmpum (wd )|[1,i] = span Hi (wd ) . (DD)
w = Hi (wd )g
Hf,y gk = hk .
80 3 Algorithms
Choosing the least norm solution as a particular solution and using matrix notation,
the first i samples of the impulse response are given by
⎡ ⎤
+ 0 ql×m
Hp ⎣
H = Hf,y
Hf,u Im ⎦ ,
0(i−1)m×m
approximation. Empirical results suggest that the default choice (i, j ) gives best ap-
proximation.
81 Most powerful unfalsified model in Lmq,n 80c+≡ 80d
T = Th; dimension of the Hankel matrix 75a
sys = h2ss(w2h(w, m, n, Th), n, [], i, j);
Uses h2ss 76d and w2h 80b.
As discussed in Sect. 1.4, different methods for solving the problem are obtained by
choosing different combinations of
• rank parametrization, and
• optimization method.
In this section, we choose the kernel representation for the rank constraint
rank S ( p) ≤ r ⇐⇒ there is R ∈ R(m−r)×m , such that
p ) = 0 and RR = Im−r ,
RS ( (rankR )
and the variable projections approach (in combination with standard methods for
nonlinear least squares optimization) for solving the resulting parameter optimiza-
tion problem.
The developed method is applicable for the general affinely structured and
weighted low rank approximation problem (SLRA). The price paid for the
generality, however, is lack of efficiency compared to specialized methods
exploiting the structure of the data matrix S (p) and the weight matrix W .
The inner minimization (computation of f (R)) is over the correction p and the outer
minimization is over the model parameter R ∈ R(m−r)×m . The inner minimization
problem can be given the interpretation of projecting the columns of S (p) onto the
model B := ker(R), for a given matrix R. However, the projection depends on the
parameter R, which is the variable in the outer minimization problem.
For affine structures S , the constraint RS (
p ) = 0 is bilinear in the optimization
variables R and p . Then, the evaluation of the cost function f for the outer mini-
mization problem is a linear least norm problem. Direct solution has computational
complexity O(np3 ), where np is the number of structure parameters. Exploiting the
structure of the problem (inherited from S ), results in computational methods with
cost O(np2 ) or O(np ), depending on the type of structure. For a class of structures,
which includes block Hankel, block Toeplitz, and block Sylvester ones, efficient
O(np ) cost function evaluation can be done by Cholesky factorization of a block-
Toeplitz banded matrix.
Structure Specification
82 (S0 , S, p = S (
) → D p ) 82≡
dh = s0 + reshape(bfs * ph, m, n);
Note 3.7 In many applications the matrices Sk are sparse, so that, for efficiency,
they can be stored and manipulated as sparse matrices.
In (S), each element of the structured matrix S (p) is equal to the corresponding
element of the matrix S0 or to the Sij th element of the parameter vector p. The
structure is then specified by the matrices S0 and S. Although (S) is a special case
of the general affine structure (S (
p )), it covers all linear modeling problems con-
sidered in this book and will therefore be used in the implementation of the solution
method.
In the implementation of the algorithm, the matrix S corresponds to a vari-
able tts and the extended parameter vector pext corresponds to a variable pext.
Since in M ATLAB indeces are positive integers (zero index is not allowed), in all
indexing operations of pext, the index is incremented by one. Given the matrices
S0 and S, specifying the structure, and a structure parameter vector p, the structured
matrix S ( p ) is constructed by
83a (S0 , S, p
) → D = S (p ) 83a≡ (85d 98a)
phext = [0; ph(:)]; dh = s0 + phext(tts + 1);
The matrix dimensions m, n, and the number of parameters np are obtained from S
as follows:
83b S → (m, n, np ) 83b≡ (84c 85e 97 98c)
[m, n] = size(tts); np = max(max(tts));
The transition from the specification of (S) to the specification in the general affine
case (S (p )) is done by
83c S → S 83c≡ (84c 85e 98c)
vec_tts = tts(:); NP = 1:np;
bfs = vec_tts(:, ones(1, np)) == NP(ones(m * n, 1), :);
Conversely, for a linear structure of the type (S), defined by S (and m, n), the ma-
trix S is constructed by
83d S → S 83d≡
tts = reshape(bfs * (1:np)’, m, n);
In most applications that we consider, the structure S is linear, so that s0 is an
optional input argument to the solvers with default value the zero matrix.
83e default s0 83e≡ (84c 85e 97 98c)
if ~exist(’s0’, ’var’) | isempty(s0), s0 = zeros(m, n); end
The default weight matrix W in the approximation criterion is the identity matrix.
83f default weight matrix 83f≡ (84c 85e 98c)
if ~exist(’w’, ’var’) | isempty(w), w = eye(np); end
Minimization over p
→ Δp = p − p
p .
84 3 Algorithms
Then, the constraint is written as a system of linear equations with unknown Δp:
p) = 0
RS ( ⇐⇒ RS (p − Δp) = 0
⇐⇒ RS (p) − RS (Δp) + RS0 = 0
⇐⇒ vec RS (Δp) = vec RS (p) + vec(RS0 )
⇐⇒ vec(RS1 ) · · · vec(RSnp ) Δp = G(R)p + vec(RS0 )
G(R) h(R)
⇐⇒ G(R)Δp = h(R).
Minimization over R
General purpose constrained optimization methods are used for the outer minimiza-
tion problem in (SLRAR ), i.e., the minimization of f over R, subject to the con-
straint RR = I . This is a non-convex optimization problem, so that there is no
guarantee that a globally optimal solution is found.
85a set optimization solver and options 85a≡ (85c 88b 191d)
prob = optimset();
prob.solver = ’fmincon’;
prob.options = optimset(’disp’, ’off’);
85b call optimization solver 85b≡ (85c 88b 191d)
[x, fval, flag, info] = fmincon(prob); info.M = fval;
85c nonlinear optimization over R 85c≡ (85e)
set optimization solver and options 85a
prob.x0 = Rini; inv_w = inv(w);
prob.objective = ...
@(R) misfit_slra(R, tts, p, w, s0, bfs, inv_w);
prob.nonlcon = @(R) deal([], [R * R’ - eye(size(R, 1))]);
call optimization solver 85b, R = x;
Uses misfit_slra 84c.
If not specified, the initial approximation is computed from a heuristic that ig-
nores the structure and replaces the weighted norm by the Frobenius norm, so that
the resulting problem can be solved by the singular value decomposition (function
lra).
85d default initial approximation 85d≡ (85e)
if ~exist(’Rini’) | isempty(Rini)
ph = p; (S0 , S, p = S (
) → D p ) 83a, Rini = lra(dh, r);
end
Uses lra 64.
The resulting function is:
85e Structured low rank approximation 85e≡
function [R, ph, info] = slra(tts, p, r, w, s0, Rini)
S → (m, n, np ) 83b, S → S 83c
default s0 83e, default weight matrix 83f
default initial approximation 85d
nonlinear optimization over R 85c
if nargout > 1,
[M, ph] = misfit_slra(R, tts, p, w, s0, bfs, inv_w);
end
Defines:
slra, used in chunks 100e and 108a.
Uses misfit_slra 84c.
Exercise 3.8 Use slra and r2x to solve approximately an overdetermined system
of linear equations AX ≈ B in the least squares sense. Check the accuracy of the
answer by using the analytical expression.
86 3 Algorithms
Exercise 3.9 Use slra to solve the basic low rank approximation problem
minimize
over D
D−D F subject to ≤ m.
rank(D)
Exercise 3.10 Use slra to solve the weighted low rank approximation problem
(unstructured approximation in the weighted norm ( · W ), defined on p. 61). Check
the accuracy of the answer in the special case of two-sided weighted low rank ap-
proximation, using Theorem 2.29.
Misfit Computation
= TT (P ),
w
where
⎡ ⎤
P0 P1 ··· Pl
⎢ P0 P1 ··· Pl ⎥
⎢ ⎥
TT (P ) := ⎢ .. .. .. ⎥ ∈ RqT ×(T +l) . (T )
⎣ . . . ⎦
P0 P1 ··· Pl
3.2 Algorithms Based on Local Optimization 87
Misfit Minimization
Numerical Example
p(z) := p0 + p1 z + · · · + pn zn
p := col(p0 , p1 , . . . , pn ) ∈ Rn+1
The polynomial c is an optimal (in the specified sense) approximate common divisor
of p and q.
Note 3.14 The object of interest in solving Problem 3.13 is the approximate com-
mon divisor c. The approximating polynomials p and
q are auxiliary variables in-
troduced for the purpose of defining c.
Note 3.15 Problem 3.13 has the following system theoretic interpretation. Consider
the single input single output linear time-invariant system B = Bi/o (p, q). The
system B is controllable if and only if p and q have no common factor. Therefore,
Problem 3.13 finds the nearest uncontrollable system B = Bi/o ( q ) to the given
p ,
system B. The bigger the approximation error is, the more robust the controllability
property of B is. In particular, with zero approximation error, B is uncontrollable.
and
By definition, the polynomial c is a common divisor of p q if there are polyno-
mials u and v, such that
= uc
p and
q = vc. (GCD)
With the auxiliary variables u and v, Problem 3.13 becomes the following optimiza-
tion problem:
minimize over p ,
q , u, v, and c dist col(p, q), col(
p ,
q)
(AGCD)
subject to p = uc, q = vc, and degree(c) = d.
where
−1
f (c) := trace pq I − Tn+1 (c) Tn+1 (c)Tn+1 (c) Tn+1 (c) p q ,
Since
f (c) = dist col(p, q), col(
p ,
q)
the value of the cost function f (c) shows the approximation errors in taking c as an
approximate common divisor of p and q. Optionally, Algorithm 1 returns a “certifi-
and
cate” p q for c being an approximate common divisor of p and q with approx-
imation error f (c).
In order to complete Algorithm 1, next, we specify the computation of the initial
approximation cini . Also, the fact that the analytic expression for f (c) involves the
highly structured matrix Tn−d+1 (c) suggests that f (c) (and its derivatives) can be
evaluated efficiently.
The most expensive operation in the cost function evaluation is solving the least
squares problem
p q = Tn−d+1 (c) u v .
Since Tn−d+1 (c) is an upper triangular, banded, Toeplitz matrix, this operation can
be done efficiently. One approach is to compute efficiently the QR factorization
of Tn−d+1 (c), e.g., via the generalized Schur algorithm. Another approach is to
solve the normal system of equations
Tn+1 (c) p q = Tn−d+1 (c)Tn−d+1 (c) u v ,
3.2 Algorithms Based on Local Optimization 93
exploiting the fact that Tn−d+1 (c)Tn−d+1 (c) is banded and Toeplitz structured. The
first approach is implemented in the function MB02ID from the SLICOT library.
Once the least squares problem is solved, the product
Tn−d+1 (c) u v = c u c v
is computed efficiently by the fast Fourier transform. The resulting algorithm has
computational complexity O(n) operations. The first derivative f (c) can be evalu-
ated also in O(n) operations, so assuming that d n, the overall cost per iteration
for Algorithm 1 is O(n).
Initial Approximation
degree(u) = n − d,
94 3 Algorithms
has an analytic solution in terms of the singular value decomposition of Rd (p, q).
The vector r ∈ R2(n−d+1) corresponding to the optimal solution of (LRAr ) is equal
to the right singular vector of Rd (p, q) corresponding to the smallest singular value.
The vector col(v, −u) composed of the coefficients of the approximate divisors v
and −u is up to a scaling factor (that enforces the normalization constraint un−d+1 =
1) equal to r. This gives Algorithm 2 as a method for computing a suboptimal initial
approximation.
Numerical Examples
Example 4.2, Case 1, from Zhi and Yang (2004), originally given in
Karmarkar and Lakshman (1998)
p(z) = (1 − z)(5 − z) = 5 − 6z + z2
q(z) = (1.1 − z)(5.2 − z) = 5.72 − 6.3z + z2
The nuclear norm of a matrix is the sum of the matrix’s singular values
Recall the mapping S , see (S ( p )), on p. 82, from a structure parameter space Rnp
to the set of matrices R m×n . Regularized nuclear norm minimization
minimize
over p S (
p) ∗ +γ p−p
(NNM)
p≤h
subject to G
is a convex optimization problem and can be solved globally and efficiently. Using
the fact
∗ < μ ⇐⇒ 1 trace(U ) + trace(V ) < μ and U D
D
2 V # 0,
D
3.3 Data Modeling Using the Nuclear Norm Heuristic 97
Consider the affine structured low rank approximation problem (SLRA). Due to
the rank constraint, this problem is non-convex. Replacing the rank constraint by a
constraint on the nuclear norm of the affine structured matrix, however, results in a
convex relaxation of (SLRA)
minimize
over p p−p
subject to S (
p) ∗ ≤ μ. (RLRA)
The motivation for this heuristic of solving (SLRA) is that approximation with an
appropriately chosen bound on the nuclear norm tends to give solutions S ( p ) of
low (but nonzero) rank. Moreover, the nuclear norm is the tightest relaxation of the
rank function, in the same way 1 -norm is the tightest relaxation of the function
mapping a vector to the number of its nonzero entries.
Problem (RLRA) can also be written in the equivalent unconstrained form
minimize over p S ( p )∗ + γ p − p
. (RLRA’)
Literate Programs
The CVX package is used in order to automatically translate problem (NNM’) into a
standard convex optimization problem and solve it by existing optimization solvers.
97 Regularized nuclear norm minimization 97≡ 98a
function [ph, info] = nucnrm(tts, p, gamma, nrm, w, s0, g, h)
S → (m, n, np ) 83b, default s0 83e
Defines:
nucnrm, used in chunks 98d and 104.
98 3 Algorithms
The following function finds suboptimal solution of the structured low rank approx-
imation problem by solving the relaxation problem (RLRA’). Affine structures of
the type (S) are considered.
98c Structured low rank approximation using the nuclear norm 98c≡ 99a
function [ph, gamma] = slra_nn(tts, p, r, gamma, nrm, w, s0)
S → (m, n, np ) 83b, S → S 83c, default s0 83e, default weight matrix 83f
if ~exist(’gamma’, ’var’), gamma = []; end % default gamma
if ~exist(’nrm’, ’var’), nrm = 2; end % default norm
Defines:
slra_nn, used in chunks 100d, 101b, 125b, and 127b.
If a parameter γ is supplied, the convex relaxation (RLRA’) is completely specified
and can be solved by a call to nucnrm.
98d solve the convex relaxation (RLRA’) for given γ parameter 98d≡ (99)
ph = nucnrm(tts, p, gamma, nrm, w, s0);
Uses nucnrm 97.
Large values of γ lead to solutions p with small approximation error p − p W,
with low
but potentially high rank. Vice verse, small values of γ lead to solutions p
rank, but potentially high approximation error p − p W . If not given as an input
argument, a value of γ which gives an approximation matrix S ( p ) with numerical
3.3 Data Modeling Using the Nuclear Norm Heuristic 99
rank r can be found by bisection on an a priori given interval [γmin , γmax ]. The in-
terval can be supplied via the input argument gamma, in which case it is a vector
[γmin , γmax ].
99a Structured low rank approximation using the nuclear norm 98c+≡ 98c
if ~isempty(gamma) & isscalar(gamma)
solve the convex relaxation (RLRA’) for given γ parameter 98d
else
if ~isempty(gamma)
gamma_min = gamma(1); gamma_max = gamma(2);
else
gamma_min = 0; gamma_max = 100;
end
parameters of the bisection algorithm 99c
bisection on γ 99b
end
On each iteration of the bisection algorithm, the convex relaxation (RLRA’) is
solved for γ equal to the mid point (γmin + γmax )/2 of the interval and the numerical
rank of the approximation S ( p ) is checked by computing the singular values of the
approximation. If the numerical rank is higher than r, γmax is redefined to the mid
point, so that the search continuous on smaller values of γ (which have the potential
of decreasing the rank). Otherwise, γmax is redefined to the mid point, so that the
search continuous on higher values of γ (which have the potential of increasing the
rank). The search continuous till the interval [γmin , γmax ] is sufficiently small or a
maximum number of iterations is exceeded.
99b bisection on γ 99b≡ (99a)
iter = 0;
while ((gamma_max - gamma_min)
/ gamma_max > rel_gamma_tol) ...
& (iter < maxiter)
gamma = (gamma_min + gamma_max) / 2;
solve the convex relaxation (RLRA’) for given γ parameter 98d
sv = svd(ph(tts));
if (sv(r + 1) / sv(1) > rel_rank_tol) ...
& (sv(1) > abs_rank_tol)
gamma_max = gamma;
else
gamma_min = gamma;
end
iter = iter + 1;
end
The rank test and the interval width test involve a priori set tolerances.
99c parameters of the bisection algorithm 99c≡ (99a)
rel_rank_tol = 1e-6; abs_rank_tol = 1e-6;
rel_gamma_tol = 1e-5; maxiter = 20;
100 3 Algorithms
Examples
The function slra_nn for affine structured low rank approximation by the nuclear
norm heuristic is tested on randomly generated Hankel structured and unstructured
examples. A rank deficient “true” data matrix is constructed, where the rank r0 is a
simulation parameter. In the case of a Hankel structure, the “true” structure param-
eter vector p0 is generated as the impulse response (skipping the first sample) of a
discrete-time random linear time-invariant system of order r0 . This ensures that the
“true” Hankel structured data matrix S (p0 ) has the desired rank r0 .
100a Test slra_nn 100a≡ 100b
initialize the random number generator 89e
if strcmp(structure, ’hankel’)
np = m + n - 1; tts = hankel(1:m, m:np);
p0 = impulse(drss(r0), np + 1); p0 = p0(2:end);
Defines:
test_slra_nn, used in chunk 102.
In the unstructured case, the data matrix is generated by multiplication of random
m × r0 and r0 × n factors of a rank revealing factorization of the data matrix S (p0 ).
100b Test slra_nn 100a+≡ 100a 100c
else % unstructured
np = m * n; tts = reshape(1:np, m, n);
p0 = rand(m, r0) * rand(r0, n); p0 = p0(:);
end
The data parameter p, passed to the low rank approximation function, is a noisy
version of the true data parameter p0 , where the additive noise’s standard deviation
is a simulation parameter.
100c Test slra_nn 100a+≡ 100b 100d
e = randn(np, 1); p = p0 + nl * e / norm(e) * norm(p0);
The results obtained by slra_nn
100d Test slra_nn 100a+≡ 100c 100e
[ph, gamma] = slra_nn(tts, p, r0);
Uses slra_nn 98c.
are compared with the ones of alternative methods by checking the singular values
of S (p ), indicating the numerical rank, and the fitting error p − p
.
In the case of a Hankel structure, the alternative methods, being used, is Kung’s
method (implemented in the function h2ss) and the method based on local opti-
mization (implemented in the function slra).
100e Test slra_nn 100a+≡ 100d 101a
if strcmp(structure, ’hankel’)
sysh = h2ss([0; p], r0);
ph2 = impulse(sysh, np + 1); ph2 = ph2(2:end);
tts_ = hankel(1:(r0 + 1), (r0 + 1):np);
3.3 Data Modeling Using the Nuclear Norm Heuristic 101
cost =
0 2.3359 1.2482
cost =
heuristic gives more than two times larger approximation error. The corresponding
trade-off curve is shown in Fig. 3.1, right plot.
A random rank deficient “true” data matrix is constructed, where the matrix dimen-
sions m × n and its rank r0 are simulation parameters.
103a Test missing data 103a≡ 103b
initialize the random number generator 89e
p0 = rand(m, r0) * rand(r0, n); p0 = p0(:);
Defines:
test_missing_data, used in chunk 104e.
The true matrix is unstructured:
103b Test missing data 103a+≡ 103a 103c
np = m * n; tts = reshape(1:np, m, n);
and is perturbed by a sparse matrix with sparsity level od, where od is a simula-
tion parameter. The perturbation has constant values nl (a simulation parameter) in
order to simulate outliers.
103c Test missing data 103a+≡ 103b 104a
pt = zeros(np, 1); no = round(od * np);
I = randperm(np); I = I(1:no);
pt(I) = nl * ones(no, 1); p = p0 + pt;
Under certain conditions, derived in Candés et al. (2009), the problem of recover-
ing the true values from the perturbed data can be solved exactly by the regularized
nuclear norm heuristic with 1-norm regularization and regularization parameter
6
1
γ= .
min(m, n)
104 3 Algorithms
ans =
1.0e-09 *
0.4362 0.5492
showing that the method indeed recovers the missing data exactly.
3.4 Notes
Efficient Software for Structured Low Rank Approximation
is used in a software package (Markovsky et al. 2005; Markovsky and Van Huffel
2005) for solving structured low rank approximation problems, based on the vari-
able projections approach (Golub and Pereyra 2003) and Levenberg–Marquardt’s
algorithm, implemented in MINPACK. This algorithm is globally convergent with
a superlinear convergence rate.
The nuclear norm relaxation for solving rank minimization problems (RM) was
proposed in Fazel (2002). It is a generalization of the 1 -norm heuristic from sparse
vector approximation problems to low rank matrix approximation problems. The
CVX package is developed and maintained by Grant and Boyd (2008a), see also
Grant and Boyd (2008b). A Python version is also available (Dahl and Vanden-
berghe 2010).
The computational engines of CVX are SDPT3 and SeDuMi. These solvers can
deal with a few tens of parameters (np < 100). An efficient interior point method
for solving (NNM), which can deal with up to 500 parameters, is presented in Liu
and Vandenberghe (2009). The method is implemented in Python.
References
Candés E, Li X, Ma Y, Wright J (2009) Robust principal component analysis? www-stat.stanford.
edu/~candes/papers/RobustPCA.pdf
Dahl J, Vandenberghe L (2010) CVXOPT: Python software for convex optimization. abel.ee.ucla.
edu/cvxopt
Fazel M (2002) Matrix rank minimization with applications. PhD thesis, Elec. Eng. Dept., Stanford
University
Golub G, Pereyra V (2003) Separable nonlinear least squares: the variable projection method and
its applications. Inst. Phys., Inverse Probl. 19:1–26
Grant M, Boyd S (2008a) CVX: Matlab software for disciplined convex programming.
stanford.edu/~boyd/cvx
Grant M, Boyd S (2008b) Graph implementations for nonsmooth convex programs. In: Blondel V,
Boyd S, Kimura H (eds) Recent advances in learning and control. Springer, Berlin, pp 95–110.
stanford.edu/~boyd/graph_dcp.html
106 3 Algorithms
4.1 Introduction
The assumptions underlying the statistical setting are that the data p are gener-
. that
ated by a true model that is in the considered model class with additive noise p
is a stochastic process satisfying certain additional assumptions. Model–data mis-
match, however, is often due to a restrictive linear time-invariant model class being
used and not (only) due to measurement noise. This implies that the approximation
aspect of the method is often more important than the stochastic estimation one.
The problems reviewed in this chapter are stated as deterministic approximation
problems although they can be given also the interpretation of defining maximum
likelihood estimators under appropriate stochastic assumptions.
All problems are special cases of Problem SLRA for certain specified choices
of the norm · , structure S , and rank r. In all problems the structure is affine,
so that the algorithm and corresponding function slra, developed in Chap. 3, for
solving affine structured low rank approximation problems can be used for solving
the problems in this chapter.
108a solve Problem SLRA 108a≡ (108b 113 115b 117a 120a)
[R, ph, info] = slra(tts, par, r, [], s0);
Uses slra 85e.
Approximate Realization
σ (H )(t) = H (t + 1).
Acting on a finite time series (H (0), H (1), . . . , H (T )), σ removes the first sam-
ple H (0).
The optimal approximate model B ∗ does not depend on the shape of the Han-
kel matrix as long as the Hankel matrix dimensions are sufficiently large: at least
p$(n + 1)/p% rows and at least m$(n + 1)/m% columns. However, solving the low
rank approximation problem for a data matrix Hl +1 (σ Hd ), where l > l, one
needs to achieve rank reduction by p(l − l + 1) instead of by p. Larger rank re-
duction leads to more difficult computational problems. On one hand, the cost per
iteration gets higher and on another hand, the search space gets higher dimensional,
which makes the optimization algorithm more susceptible to local minima. The vari-
able l1 corresponds to l + 1. It is computed from the specified order n.
109b define l1 109b≡ (109a 116a 117b)
l1 = ceil((n + 1) / p);
The mapping p → H from the solution p
of the structured low rank approxi-
mation problem to the optimal approximation H of the noisy impulse response H
is reshaping the vector p as a m × p × T tensor hh, representing the sequence
H(1), . . . , H
(T − 1) and setting H
(0) = H (0) (since D
= H (0)).
109c 109c≡
p → H (108b)
hh = zeros(p, m, T); hh(:, :, 1) = h(:, :, 1);
hh(:, :, 2:end) = reshape(ph(:), p, m, T - 1);
The mapping H → B is system realization, i.e., the 2-norm (locally) optimal
realization B∗ is obtained by exact realization of the approximation H, computed
by the structured low rank approximation method.
109d → B
H 109d≡ (108b)
sysh = h2ss(hh, n);
Uses h2ss 76d.
of H has an exact realization in the model
By construction, the approximation H
m+p
class Lm,l .
Note 4.2 In the numerical solution of the structured low rank approximation prob-
lem, the kernel representation (rankR ) on p. 81 of the rank constraint is used. The
parameter R, computed by the solver, gives a kernel representation of the optimal
approximate model B ∗ . The kernel representation can subsequently be converted
110 4 Applications in System, Control, and Signal Processing
Example
The following script verifies that the local optimization-based method h2ss_opt
improves the suboptimal approximation computed by Kung’s method h2ss. The
data are a noisy impulse response of a random stable linear time-invariant system.
The number of inputs m and outputs p, the order n of the system, the number of data
points T, and the noise standard deviation s are simulation parameters.
110a Test h2ss_opt 110a≡ 110b
initialize the random number generator 89e
sys0 = drss(n, p, m);
h0 = reshape(shiftdim(impulse(sys0, T), 1), p, m, T);
h = h0 + s * randn(size(h0));
Defines:
test_h2ss_opt, used in chunk 110c.
The solutions, obtained by the unstructured and Hankel structured low rank approx-
imation methods, are computed and the relative approximation errors are printed.
110b Test h2ss_opt 110a+≡ 110a
[sysh, hh] = h2ss(h, n); norm(h(:) - hh(:)) / norm(h(:))
[sysh_, hh_, info] = h2ss_opt(h, n);
norm(h(:) - hh_(:)) / norm(h(:))
Uses h2ss 76d and h2ss_opt 108b.
The optimization-based method improves the suboptimal results of the singular
value decomposition-based method at the price of extra computation, a process ref-
ered to as iterative refinement of the solution.
110c Compare h2ss and h2ss_opt 110c≡
m = 2; p = 3; n = 4; T = 15; s = 0.1; test_h2ss_opt
Uses test_h2ss_opt 110a.
0.1088 for h2ss and 0.0735 for h2ss_opt
Model Reduction
Problem 4.3 (Finite time H2 model reduction) Given a linear time-invariant system
q
Bd ∈ Lm,l and a complexity specification lred < l, find an optimal approximation
of Bd with bounded complexity (m, lred ), such that
minimize over B Bd − B subject to B ∈ L q .
2,T m,lred
4.2 Model Reduction 111
Exercise 4.4 Compare the results of mod_red with the ones obtained by using un-
structured low rank approximation (finite-time balanced model reduction) on simu-
lation examples.
Excluding the cases of multiple poles, the model class of autonomous linear time-
p
invariant systems L0,l is equivalent to the sum-of-damped exponentials model class,
i.e., signals y that can be represented in the form
l
√
y(t) = αk eβk t ei(ωk t+φk ) i = −1 .
k=1
Problem 4.5 (Output only identification) Given a signal yd ∈ (Rp )T and a complex-
ity specification l, find an optimal approximate model for yd of bounded complexity
(0, l), such that
minimize over B and y 2
y y d −
subject to [1,T ]
y ∈ B| ∈ L p .
and B 0,l
112 4 Applications in System, Control, and Signal Processing
Exercise 4.6 Compare the results of ident_aut with the ones obtained by using
unstructured low rank approximation on simulation examples.
Harmonic Retrieval
The aim of the harmonic retrieval problem is to approximate the data by a sum of
sinusoids. From a system theoretic point of view, harmonic retrieval aims to approx-
imate the data by a marginally stable linear time-invariant autonomous system.1
subject to [1,T ] ,
y ∈ B| ∈ L p ,
B is marginally stable.
and B
0,l
Due to the stability constraint, Problem 4.7 is not a special case of prob-
lem SLRA. In the univariate case p = 1, however, a necessary condition for an
1 A linear time-invariant autonomous system is marginally stable if all its trajectories, except for
or antipalindromic,
The antipalindromic case is nongeneric in the space of the marginally stable sys-
tems, so as relaxation of the stability constraint, we can use the constraint that the
kernel representation is palindromic.
Problem 4.8 (Harmonic retrieval, relaxed version, scalar case) Given a signal yd ∈
(R)T and a complexity specification l, find an optimal approximate model for yd
that is in the model class L0,l
1 and has a palindromic kernel representation, such
that
minimize over B and y 2
y y d −
subject to [1,T ] , B
y ∈ B| ∈ L 1
0,l and ker R = B, with R palindromic.
where
⎡ ⎤
yl+1 yl+2 ··· yT
⎢ .. .. .. .. ⎥
⎢ . ⎥
&Hl+1 (y) := ⎢ . . . ⎥,
⎣ y2 y3 ... yT −l+1 ⎦
y1 y2 ... yT −l
and
• rank reduction by one.
114 4 Applications in System, Control, and Signal Processing
[1,T ] , B
y ∈ B| ∈ L 1 and ker(R)
=B is palindromic
0,l
⇐⇒ rank Hl+1 ( y ) &Hl+1 (
y ) ≤ l.
7
In order to show it, let ker(R), with R(z) = li=0 zi Ri full row rank, be a kernel
representation of B ∈ L0,l1 . Then [1,T ] is equivalent to
y ∈ B|
R0 R1 · · · Rl Hl+1 (
y ) = 0.
We have
R0 R1 · · · Rl Hl+1 (
y ) &Hl+1 (
y) = 0 (∗)
which is equivalent to
rank Hl+1 (
y ) &Hl+1 (
y ) ≤ l.
Example
The data are generated as a sum of random sinusoids with additive noise. The num-
ber hn of sinusoids, the number of samples T, and the noise standard deviation s
are simulation parameters.
114b Test harmonic_retrieval 114b≡
initialize the random number generator 89e
t = 1:T; f = 1 * pi * rand(hn, 1);
phi = 2 * pi * rand(hn, 1);
y0 = sum(sin(f * t + phi(:, ones(1, T))));
yt = randn(size(y0)); y = y0 + s * norm(y0) * yt / norm(yt);
[sysh, yh, xinih, info] = harmonic_retrieval(y, hn * 2);
plot(t, y0, ’k-’, t, y, ’k:’, t, vec(yh), ’b-’),
ax = axis; axis([1 T ax(3:4)])
print_fig(’test_harmonic_retrieval’)
Defines:
test_harmonic_retrieval, used in chunk 115a.
Uses harmonic_retrieval 113 and print_fig 25a.
4.3 System Identification 115
Figure 4.1 shows the true signal, the noisy signal, and the estimate obtained with
harmonic_retrieval in the following simulation example:
115a Example of harmonic retrieval 115a≡
clear all, T = 50; hn = 2; s = 0.015;
test_harmonic_retrieval
Uses test_harmonic_retrieval 114b.
Errors-in-Variables Identification
Example
if and only if
rank Hl+1 (w) ≤ m(l + 1) + (q − m)l. (∗∗)
Proof (=⇒) Assume that (∗) holds and let ker(R), with
l
R(z) = zi Ri ∈ Rg×q [z]
i=0
Example
Let FIRm,l be the model class of finite impulse response linear time-invariant sys-
tems with m inputs and lag at most l, i.e.,
FIRm,l := B ∈ Lm,l | B has finite impulse response and m inputs .
Identification of a finite impulse response model in the output error setting leads to
the ordinary linear least squares problem
H(0) H(1) · · · H
(l) Hl+1 (ud ) = yd (1) · · · yd (T − l) .
= (
Equivalently, w u, y ) is a trajectory of B if and only if
Hl+1 ( u)
H (l) · · · H (1) H (0) −Ip = 0,
y (1) · · ·
y (T − l)
4.3 System Identification 121
or,
Hl+1 ( u) ≤ m(l + 1).
rank
y (1) · · ·
y (T − l)
For exact data, i.e., assuming that
l
yd (t) = (H ud )(t) := H (τ )ud (t − τ )
τ =0
Example
Random data from a moving average finite impulse response system is generated
in the errors-in-variables setup. The number of inputs and outputs, the system lag,
number of observed data points, and the noise standard deviation are simulation
parameters.
121a Test ident_fir 121a≡ 121b
initialize the random number generator 89e,
h0 = ones(m, p, l + 1); u0 = rand(m, T); t = 1:(l + 1);
y0 = conv(h0(:), u0); y0 = y0(end - T + 1:end);
w0 = [u0; y0]; w = w0 + s * randn(m + p, T);
Defines:
test_ident_fir, used in chunk 121c.
The output error and errors-invariables finite impulse response identification meth-
ods ident_fir_oe and ident_fir_eiv are applied on the data and the rela-
tive fitting errors wd − w
/ wd are computed.
121b Test ident_fir 121a+≡ 121a
[hh_oe, wh_oe] = ident_fir_oe(w, m, l);
e_oe = norm(w(:) - wh_oe(:)) / norm(w(:))
[hh_eiv, wh_eiv, info] = ident_fir_eiv(w, m, l);
e_eiv = norm(w(:) - wh_eiv(:)) / norm(w(:))
Uses ident_fir_eiv 120a and ident_fir_oe 119b.
The obtained result in the following example
121c Example of finite impulse response identification 121c≡
m = 1; p = 1; l = 10; T = 50; s = 0.5; test_ident_fir
Uses test_ident_fir 121a.
are: relative error 0.2594 for ident_fir_oe and 0.2391 for ident_fir_eiv.
122 4 Applications in System, Control, and Signal Processing
Distance to Uncontrollability
is a property of the pair of systems (B, B).
In terms of the parameters P and Q,
the constraint B ∈ Lctrb is equivalent to
rank deficiency of the Sylvester matrix R(P , Q)
(see (R) on p. 11). With respect to
the distance measure (dist), the problem of computing the distance from B to un-
controllability is equivalent to a Sylvester structured low rank approximation prob-
lem
minimize over P P − P
and Q 2 + Q − Q 2
2 2
subject to rank R P , Q
≤ degree(P ) + degree(Q) − 1,
for which numerical algorithms are developed in Sect. 3.2. The implementation de-
tails are left as an exercise for the reader (see Note 3.15 and Problem P.20).
Consider the single input single output feedback system shown in Fig. 4.2. The
polynomials P and Q, define the transfer function Q/P of the plant and are given.
They are assumed to be relatively prime and the transfer function Q/P is assumed
to satisfy the constraint
deg(Q) ≤ deg(P ) =: lP ,
4.4 Analysis and Synthesis 123
which ensures that the plant is a causal linear time-invariant system. The polynomi-
als Y and X parameterize the controller Bi/o (X, Y ) and are unknowns. The design
constraints are that the controller should be causal and have order bounded by a
specified integer lX . These specifications translate to the following constraints on
the polynomials Y and X
The pole placement problem is to determine X and Y , so that the poles of the closed-
loop system are as close as possible in some specified sense to desired locations,
given by the roots of a polynomial F , where deg(F ) = lX + lP . We consider a
modification of the pole placement problem that aims to assign exactly the poles of
a plant that is as close to the given plant as possible.
In what follows, we use the correspondence between lP + 1 dimensional vectors
and lP th degree polynomials
and (with some abuse of notation) refer to P as both a vector and a polynomial.
is minimized.
so that a solution to the pole placement problem is given by a solution to the Dio-
phantine equation
P X + QY = F.
The Diophantine equation can be written as a Sylvester structured system of equa-
tions
⎡ ⎤⎡ ⎤ ⎡ ⎤
P0 Q0 X0 F0
⎢ ⎥
⎢ P1 . . . .. ⎥⎢ .. ⎥ ⎢ .. ⎥
⎢ Q1 . ⎥⎢⎢ . ⎥ ⎢
⎥ ⎢ . ⎥
⎥
⎢ .. ⎥⎢
. Q0 ⎥ ⎢ lX ⎥ = ⎢ lP ⎥
⎥ ⎢
.. .. ..
⎢ . . P0 . ⎥ X F ⎥
⎢ ⎢ Fl +1 ⎥,
⎢Pl ⎥⎢ Y0 ⎥
⎢ P P 1 Q l P Q 1 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ P
⎥
⎢ ⎥
⎣ . .. .
.. . .. .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
.
Exercise 4.14 Implement the method for pole placement by low-order controller,
outlined in this section, using the function slra. Test it on simulation exam-
ples and compare the results with the ones obtained with the M ATLAB function
place.
4.5 Simulation Examples 125
Model Reduction
The Markov parameters of the high-order system and the low-order approximations
are plotted for visual inspection of the quality of the approximations.
126a Test model reduction 125a+≡ 125e
plot(h, ’:k’, ’linewidth’, 4), hold on,
plot(hh1, ’b-’), plot(hh2, ’r-.’), plot(hh3, ’r-’)
legend(’high order sys.’, ’nucnrm’, ’kung’, ’slra’)
The results of an approximation of a 50th-order system by a 4th-order system
with time horizon 25 time steps
126b Test structured low rank approximation methods on a model reduction problem 126b≡
n = 50; r = 4; T = 25; test_mod_red,
print_fig(’test_mod_red’)
Uses print_fig 25a and test_mod_red 125a.
are
sv =
cost =
System Identification
Next we compare the nuclear norm heuristic with an alternative heuristic method,
based on the singular value decomposition, and a method based on local optimiza-
4.5 Simulation Examples 127
tion, on single input single output system identification problems. The data are gen-
erated according to the errors-in-variables model (EIV). A trajectory w0 of a linear
time-invariant system B0 of order n0 is corrupted by noise, where the noise w . is
zero mean, white, Gaussian with covariance σ 2 , i.e.,
wd = w0 + w
..
The results of
128 Test structured low rank approximation methods on system identification 128≡
n0 = 2; T = 25; nl = 0.2; test_sysid,
print_fig(’test_sysid’)
Uses print_fig 25a and test_sysid 127a.
are
sv =
cost =
4.6 Notes
System Theory
studied in Pintelon and Schoukens (2001) and Kukush et al. (2005). A survey pa-
per on errors-in-variables system identification is Söderström (2007). Most of the
work on the subject is presented in the classical input/output setting, i.e., the pro-
posed methods are defined in terms of transfer function, matrix fraction description,
or input/state/output representations. The salient feature of the errors-in-variables
problems, however, is that all variables are treated on an equal footing as noise
corrupted. Therefore, the input/output partitioning implied by the classical model
representations is irrelevant in the errors-in-variables problem.
Signal Processing
Computer Vision
Analysis Problems
The distance to uncontrollability with respect to the distance measure dist is natu-
rally defined as
min dist B, B .
∈Lctrb
B
B
where A, B and A, are parameters of input/state/output representations
B = Bi/s/o (A, B, C, D) and B = Bi/s/o A, B,
C,
D,
References
Antoulas A (2005) Approximation of large-scale dynamical systems. SIAM, Philadelphia
Aoki M, Yue P (1970) On a priori error estimates of some identification methods. IEEE Trans
Autom Control 15(5):541–548
Bresler Y, Macovski A (1986) Exact maximum likelihood parameter estimation of superimposed
exponential signals in noise. IEEE Trans Acoust Speech Signal Process 34:1081–1089
Cadzow J (1988) Signal enhancement—a composite property mapping algorithm. IEEE Trans Sig-
nal Process 36:49–62
De Moor B (1993) Structured total least squares and L2 approximation problems. Linear Algebra
Appl 188–189:163–207
De Moor B (1994) Total least squares for affinely structured matrices and the noisy realization
problem. IEEE Trans Signal Process 42(11):3104–3113
Fu H, Barlow J (2004) A regularized structured total least squares algorithm for high-resolution
image reconstruction. Linear Algebra Appl 391(1):75–98
Glover K (1984) All optimal Hankel-norm approximations of linear multivariable systems and
their l ∞ -error bounds. Int J Control 39(6):1115–1193
Golub G, Milanfar P, Varah J (1999) A stable numerical method for inverting shape from moments.
SIAM J Sci Comput 21:1222–1243
Gustafsson B, He C, Milanfar P, Putinar M (2000) Reconstructing planar domains from their mo-
ments. Inverse Probl 16:1053–1070
Ho BL, Kalman RE (1966) Effective construction of linear state-variable models from input/output
functions. Regelungstechnik 14(12):545–592
Kalman RE, Falb PL, Arbib MA (1969) Topics in mathematical system theory. McGraw-Hill, New
York
Kukush A, Markovsky I, Van Huffel S (2005) Consistency of the structured total least squares
estimator in a multivariate errors-in-variables model. J Stat Plan Inference 133(2):315–358
Kung S (1978) A new identification method and model reduction algorithm via singular value
decomposition. In: Proc 12th asilomar conf circuits, systems, computers, Pacific Grove, pp
705–714
Lemmerling P, De Moor B (2001) Misfit versus latency. Automatica 37:2057–2067
Lemmerling P, Mastronardi N, Van Huffel S (2003) Efficient implementation of a structured total
least squares based speech compression method. Linear Algebra Appl 366:295–315
Markovsky I (2008) Structured low-rank approximation and its applications. Automatica
44(4):891–909
Markovsky I, Willems JC, Van Huffel S, Moor BD, Pintelon R (2005) Application of structured
total least squares for system identification and model reduction. IEEE Trans Autom Control
50(10):1490–1500
Mastronardi N, Lemmerling P, Kalsi A, O’Leary D, Van Huffel S (2004) Implementation of the
regularized structured total least squares algorithms for blind image deblurring. Linear Algebra
Appl 391:203–221
References 131
Mastronardi N, Lemmerling P, Van Huffel S (2005) Fast regularized structured total least squares
algorithm for solving the basic deconvolution problem. Numer Linear Algebra Appl 12(2–
3):201–209
Milanfar P, Verghese G, Karl W, Willsky A (1995) Reconstructing polygons from moments with
connections to array processing. IEEE Trans Signal Process 43:432–443
Moore B (1981) Principal component analysis in linear systems: controllability, observability and
model reduction. IEEE Trans Autom Control 26(1):17–31
Mühlich M, Mester R (1998) The role of total least squares in motion analysis. In: Burkhardt H
(ed) Proc 5th European conf computer vision. Springer, Berlin, pp 305–321
Ng M, Plemmons R, Pimentel F (2000) A new approach to constrained total least squares image
restoration. Linear Algebra Appl 316(1–3):237–258
Ng M, Koo J, Bose N (2002) Constrained total least squares computations for high resolution
image reconstruction with multisensors. Int J Imaging Syst Technol 12:35–42
Paige CC (1981) Properties of numerical algorithms related to computing controllability. IEEE
Trans Autom Control 26:130–138
Pintelon R, Schoukens J (2001) System identification: a frequency domain approach. IEEE Press,
Piscataway
Pintelon R, Guillaume P, Vandersteen G, Rolain Y (1998) Analyses, development, and applica-
tions of TLS algorithms in frequency domain system identification. SIAM J Matrix Anal Appl
19(4):983–1004
Pruessner A, O’Leary D (2003) Blind deconvolution using a regularized structured total least norm
algorithm. SIAM J Matrix Anal Appl 24(4):1018–1037
Roorda B (1995) Algorithms for global total least squares modelling of finite multivariable time
series. Automatica 31(3):391–404
Roorda B, Heij C (1995) Global total least squares modeling of multivariate time series. IEEE
Trans Autom Control 40(1):50–63
Schuermans M, Lemmerling P, Lathauwer LD, Van Huffel S (2006) The use of total least squares
data fitting in the shape from moments problem. Signal Process 86:1109–1115
Söderström T (2007) Errors-in-variables methods in system identification. Automatica 43:939–
958
Younan N, Fan X (1998) Signal restoration via the regularized constrained total least squares.
Signal Process 71:85–93
Zeiger H, McEwen A (1974) Approximate linear realizations of given dimension via Ho’s algo-
rithm. IEEE Trans Autom Control 19:153
Part II
Miscellaneous Generalizations
Chapter 5
Missing Data, Centering, and Constraints
where
+
,
, q N
,
ΔD Σ := Σ ΔD F = - σij eij , Σ ∈ Rq×N and Σ ≥ 0
i=1 j =1
is a seminorm. (The σij ’s are weights; not noise standard deviations.) In the extreme
case of a zero weight, e.g., σij = 0, the corresponding element dij of D is not taken
into account in the approximation and therefore it is treated as a missing value. In
this case, the approximation problem is called a singular problem. The algorithms,
described in Chap. 3, for the regular weighted low rank approximation problem
fail in the singular case. In this section, the methods are extended to solve singular
problems and therefore account for missing data.
Exercise 5.1 (Missing rows and columns) Show that in the case of missing rows
and/or columns of the data matrix, the singular low rank approximation problem
reduces to a smaller dimensional regular problem.
which turns problem (EWLRA) into the following parameter optimization problem
Unfortunately the problem is nonconvex and there are no efficient methods to solve
it. We present local optimization methods based on the alternating projections and
variable projection approaches. These methods are initialized by a suboptimal solu-
tion of (EWLRAP ), computed by a direct method.
Algorithms
Direct Method
The initial approximation for the iterative optimization methods is obtained by solv-
ing unweighted low rank approximation problem where all missing elements (en-
coded as NaN’s) are filled in by zeros.
136a Low rank approximation with missing data 136a≡
function [p, l] = lra_md(d, m)
d(isnan(d)) = 0; [q, N] = size(d);
if nargout == 1, data compression 137, end
matrix approximation 136b
Defines:
lra_md, used in chunks 140a and 143e.
The problem is solved using the singular value decomposition, however, in view
of the large scale of the data the function svds which computes selected singular
values and corresponding singular vectors is used instead of the function svd.
136b matrix approximation 136b≡ (136a)
[u, s, v] = svds(d, m); p = u(:, 1:m); l = p’ * d;
The model B = (P ) depends only on the left singular vectors of the data ma-
is an optimal model for the data DQ, where Q is any orthog-
trix D. Therefore, B
onal matrix. Let
R
D = Q 1 , where R1 is upper triangular (QR)
0
Alternating Projections
The alternating projections method exploits the fact that problem (EWLRAP ) is a
linear least squares problem in either P or L. This suggests a method of alternatively
minimizing over P with L fixed to the value computed on the previous iteration step
and minimizing over L with P fixed to the value computed on the previous iteration
step. A summary of the alternating projections method is given in Algorithm 3.
M ATLAB-like notation is used for indexing a matrix. For a q × N matrix D and
subsets I and J of the sets of, respectively, row and column indices, DI ,J
denotes the submatrix of D with elements whose indices are in I and J . Either
of I and J can be replaced by “:” in which case all rows/columns are indexed.
On each iteration step of the alternating projections algorithm, the cost function
value is guaranteed to be non-increasing and is typically decreasing. It can be shown
that the iteration converges and that the local convergence rate is linear.
The quantity e(k) , computed on step 9 of the algorithm is the squared approxima-
tion error
e(k) = D − D (k) 2
Σ
on the kth iteration step. Convergence of the iteration is judged on the basis of the
relative decrease of the error e(k) after an update step. This corresponds to choosing
a tolerance on the relative decrease of the cost function value. More expensive al-
ternatives are to check the convergence of the approximation D (k) or the size of the
gradient of the cost function with respect to the model parameters.
Variable Projections
The inner minimization is a weighted least squares problem and therefore can be
solved in closed form. Using M ATLAB set indexing notation, the solution is
N
2 2 −1 2
f (P ) = DJ ,j diag ΣJ ,j PJ ,: PJ ,: diag ΣJ ,j PJ ,: PJ ,: diag ΣJ ,j DJ ,j ,
j =1
(5.2)
where J is the set of indices of the non-missing elements in the j th column of D.
138 5 Missing Data, Centering, and Constraints
Exercise 5.2 Derive the expression (5.2) for the function f in (EWLRAP ).
The outer minimization is a nonlinear least squares problem and can be solved by
general purpose local optimization methods. The inner minimization is a weighted
projection on the subspace spanned by the columns of P . Consequently, f (P ) has
the geometric interpretation of the sum of squared distances from the data points
to the subspace. Since the parameter P is modified by the outer minimization, the
projections are on a varying subspace.
Implementation
Both the alternating projections and the variable projections methods for solving
weighted low rank approximation problems with missing data are callable through
the function wlra.
139 Weighted low rank approximation 139≡
function [p, l, info] = wlra(d, m, s, opt)
tic, default parameters opt 140a
switch lower(opt.Method)
case {’altpro’, ’ap’}
alternating projections method 140b
case {’varpro’, ’vp’}
variable projections method 141d
otherwise
error(’Unknown method %s’, opt.Method)
end
info.time = toc;
Defines:
wlra, used in chunks 143e and 230.
The output parameter info gives the
• approximation error D − D 2 (info.err),
Σ
• number of iterations (info.iter), and
• execution time (info.time) for computing the local approximation D.
The optional parameter opt specifies the
• method (opt.Method) and, in the case of the variable projections method, al-
gorithm (opt.alg) to be used,
• initial approximation (opt.P),
140 5 Missing Data, Centering, and Constraints
Note 5.4 (Large scale, sparse data) In an application of (EWLRA) to building rec-
ommender systems, the data matrix D is large but only a small fraction of the ele-
ments are given. Such problems can be handled efficiently, encoding D and Σ as
sparse matrices. The convention in this case is that missing elements are zeros.
Alternating Projections
Variable Projections
case {’lsqnonlin’}
[p, rn, r, f, info] = ...
lsqnonlin(@(p)mwlra2(p, d, s), p, [], []);
otherwise
error(’Unknown algorithm %s.’, opt.alg)
end
[info.err, l] = mwlra(p, d, s); % obtain the L parameter
Uses mwlra 142a and mwlra2 142b.
The inner minimization in (EWLRAP ) has an analytic solution (5.2). The imple-
mentation of (5.2) is the chunk of code for computing the L parameter, given the P
parameter, already used in the alternating projections algorithm.
142a dist(D, B) (weighted low rank approximation) 142a≡
function [ep, l] = mwlra(p, d, s)
N = size(d, 2); m = size(p, 2); compute L, given P 140c
Defines:
mwlra, used in chunk 141d.
In the case of using a nonlinear least squares type algorithm, the cost function is not
the sum of squares of the errors but the correction matrix ΔD (dd).
142b Weighted low rank approximation correction matrix 142b≡
function dd = mwlra2(p, d, s)
N = size(d, 2); m = size(p, 2); compute L, given P 140c
Defines:
mwlra2, used in chunk 141d.
Then the number of given elements of the data matrix per row is
143a Test missing data 2 142c+≡ 142d 143b
ner = round(ne / q);
The variables I and J contain the row and column indices of the given elements.
They are randomly chosen.
143b Test missing data 2 142c+≡ 143a 143c
I = []; J = [];
for i = 1:q
I = [I i*ones(1, ner)]; rp = randperm(N);
J = [J rp(1:ner)];
end
ne = length(I);
By construction there are ner given elements in each row of the data matrix,
however, there may be columns with a few (or even zero) given elements. Columns
with less than m given elements cannot be recovered from the given observations,
even when the data are noise-free. Therefore, we remove such columns from the
data matrix.
143c Test missing data 2 142c+≡ 143b 143d
tmp = (1:N)’;
J_del = find(sum(J(ones(N, 1),:) ...
== tmp(:, ones(1, ne)), 2) < m);
l0(:, J_del) = [];
tmp = sparse(I, J, ones(ne, 1), q, N); tmp(:, J_del) = [];
[I, J] = find(tmp); N = size(l0, 2);
Next, a noisy data matrix with missing elements is constructed by adding to the
true values of the given data elements independent, identically, distributed, zero
mean, Gaussian noise, with a specified standard deviation s. The weight matrix Σ
is binary: σij = 1 if dij is given and σij = 1 if dij is missing.
143d Test missing data 2 142c+≡ 143c 143e
d0 = p0 * l0;
Ie = I + q * (J - 1);
d = zeros(q * N, 1);
d(Ie) = d0(Ie) + sigma * randn(size(d0(Ie)));
d = reshape(d, q, N);
s = zeros(q, N); s(Ie) = 1;
The methods implemented in lra and wlra are applied on the noisy data ma-
trix D with missing elements and the results are validated against the complete true
matrix D0 .
143e Test missing data 2 142c+≡ 143d 144a
tic, [p0, l0] = lra_md(d, m); t0 = toc;
err0 = norm(s .* (d - p0 * l0), ’fro’) ^ 2;
e0 = norm(d0 - p0 * l0, ’fro’) ^ 2;
[ph1, lh1, info1] = wlra(d, m, s);
e1 = norm(d0 - ph1 * lh1, ’fro’) ^ 2;
opt.Method = ’vp’; opt.alg = ’fminunc’;
[ph2, lh2, info2] = wlra(d, m, s, opt);
144 5 Missing Data, Centering, and Constraints
Table 5.1 Results for Experiment 1. (SVT—singular value thresholding, VP—variable projec-
tions)
lra_md ap VP + fminunc VP + lsqnonlin SVT
D−D 2 / D 2 0.02 0 0 0 0
Σ Σ
2/
D0 − D D 2 0.03 0 0 0 0
F F
Execution time (sec) 0.04 0.18 0.17 0.18 1.86
D−D 2 / D 2 0.049 0.0257 0.0257 0.0257 0.025
Σ Σ
2/
D0 − D D 2 0.042 0.007 0.007 0.007 0.007
F F
Execution time (sec) 0.04 0.11 0.11 0.11 1.51
D−D 2 / D 2 0.17 0.02 0.02 0.02 0.02
Σ Σ
2/
D0 − D D 2 0.27 0.018 0.018 0.018 0.17
F F
Execution time (sec) 0.04 0.21 0.21 0.21 1.87
value thresholding method is further away from being (locally) optimal, but is still
much better than the solution of lra_md—1% vs. 25% relative prediction error.
The three methods based on local optimization (ap, VP + fminunc, VP +
lsqnonlin) need not compute the same solution even when started from the same
initial approximation. The reason for this is that the methods only guarantee con-
vergence to a locally optimal solution, however, the problem is non convex and may
have multiple local minima. Moreover, the trajectories of the three methods in the
parameter space are different because the update rules of the methods are different.
The computation times for the three methods are different. The number of float-
ing point operations per iteration can be estimated theoretically which gives an indi-
cation which of the methods may be the faster per iteration. Note, however, that the
number of iterations, needed for convergence, is not easily predictable, unless the
methods are started “close” to a locally optimal solution. The alternating projections
methods is most efficient per iteration but needs most iteration steps. In the current
implementation of the methods, the alternating projections method is still the win-
ner of the three methods for large scale data sets because for q more than a few
hundreds optimization methods are too computationally demanding. This situation
may be improved by analytically computing the gradient and Hessian.
The MovieLens data sets were collected and published by the GroupLens Research
Project at the University of Minnesota in 1998. Currently, they are recognized as a
benchmark for predicting missing data in recommender systems. The “100K data
set” consists of 100000 ratings of q = 943 users’ on N = 1682 movies and de-
mographic information for the users. (The ratings are encoded by integers in the
range from 1 to 5.) Here, we use only the ratings, which constitute a q × N matrix
with missing elements. The task of a recommender system is to fill in the missing
elements.
Assuming that the true complete data matrix is rank deficient, building a recom-
mender system is a problem of low rank approximation with missing elements. The
assumption that the true data matrix is low rank is reasonable in practice because
user ratings are influences by a few factors. Thus, we can identify typical users (re-
lated to different combinations of factors) and reconstruct the ratings of any user as
a linear combination of the ratings of the typical users. As long as the typical users
are fewer than the number of users, the data matrix is low rank. In reality, the num-
ber of factors is not small but there are a few dominant ones, so that the true data
matrix is approximately low rank.
It turns out that two factors allow us to reconstruct the missing elements
with 7.1% average error. The reconstruction results are validated by cross valida-
tion with 80% identification data and 20% validation data. Five such partitionings
of the data are given on the MovieLens web site. The matrix
(k)
Σidt ∈ {0, 1}q×N
5.2 Affine Data Modeling 147
1
5
eidt := D − D
(k) 2 (k) / D 2
(k)
5 Σ idt Σidt
k=1
and
1
5
eval := D − D
(k) 2 (k) / D 2
(k) ,
5 Σ val Σval
k=1
where D (k) is the reconstructed matrix in the kth partitioning of the data. The singu-
lar value thresholding method issues a message “Divergence!”, which explains the
poor results obtained by this method.
Problem Formulation
Closely related to the linear model is the affine one. The observations
D = {d1 , . . . , dN }
Matrix Centering
The matrix centering operation is subtraction of the mean E(D) from all columns
of the data matrix D:
1
C(D) := D − E(D)1N = D I − 1N 1N .
N
The following proposition justifies the name “matrix centering” for C(·).
5.2 Affine Data Modeling 149
Proposition 5.5 (Matrix centering) The matrix C(D) is column centered, i.e., its
mean is zero:
E C(D) = 0.
subject to = c1
D N.
Note 5.7 (Intercept) Data fitting with an intercept is a special case of centering when
all but one of the row means are set to zero, i.e., centering of one row. Intercept is
appropriate when an input/output partition of the variables is imposed and there is a
single output that has an offset.
In this section, we consider the low rank approximation problem in the Frobenius
norm with centering:
minimize over D and c D − c1
N −D F
(LRAc )
subject to rank(D) ≤ m.
The following theorem shows that the two-stage procedure yields a solution
to (LRAc ).
Setting the partial derivatives of L to zero, we obtain the necessary optimality con-
ditions
=0
∂L/∂ D =⇒ D − c1
N −D=R Λ , (L1)
∂L/∂c = 0 =⇒ N,
N c = (D − D)1 (L2)
∂L/∂R = 0 =⇒ = R Ξ,
DΛ (L3)
∂L/∂Λ = 0 =⇒ = 0,
RD (L4)
∂L/∂Ξ = 0 =⇒ RR = I. (L5)
The theorem follows from the system of (L1–L5). Next we list the derivation steps.
From (L3), (L4), and (L5), it follows that Ξ = 0, and from (L1) we obtain
= c1
D−D
N +R Λ .
Multiplying (L1) from the left by R and using (L4) and (L5), we have
R D − c1N =Λ .
(∗)
Now, multiplication of the last identity from the right by 1N and use of Λ 1N = 0,
shows that c is the row mean of the data matrix D,
1
R D1N − N c = 0 =⇒ c= D1N .
N
Next, we show that D is an optimal in a Frobenius norm rank-m approximation
of D − c1 . Multiplying = 0, we have
(L1) from the right by Λ and using DΛ
N
D − c1
N Λ = R Λ Λ. (∗∗)
5.2 Affine Data Modeling 151
Defining
0
Σ := Λ Λ and V := ΛΣ −1 ,
(∗) and (∗∗) become
R D − c1
N = ΣV , V V = I
D − c1
N V = R Σ, RR = I.
The above equations show that the rows of R and the columns of V span, respec-
tively, left and right m-dimensional singular subspaces of the centered data matrix
D − c1 N . The optimization criterion is minimization of
1
D − D − c1 = R Λ = trace ΛΛ = trace(Σ).
N F F
Therefore, a minimum is achieved when the rows of R and the columns of V span
the, respectively, left and right m-dimensional singular subspaces of the centered
data matrix D − c1 N , corresponding to the m smallest singular values. The solution
is unique if and only if the mth singular value is strictly bigger than the (m + 1)st
singular value. Therefore, D is a Frobenius norm optimal rank-m approximation of
the centered data matrix D − c1 N , where c = D1N /N .
Proof
c1
N + D = c1N + P L
= c1
N + P z1N + P L − P z1N
= (c + P z) 1 + P L − z1
N = c 1N + D .
N
c L
is a solution, then (c , D
Therefore, if (c, D) ) is also a solution. From Theorem 5.8,
it follows that c = E(D), D = P L is a solution.
The same type of nonuniqueness appears in weighted and structured low rank
approximation problems with centering. This can cause problems in the optimiza-
tion algorithms and implies that solutions produced by different methods cannot be
compared directly.
152 5 Missing Data, Centering, and Constraints
where (cm m
∗ ,D ∗ ) is a solution to the unweighted low rank approximation problem
with centering
minimize over D m and cm Dm − cm 1
N − Dm F
subject to m ) ≤ m.
rank(D
√ √
for the modified data matrix Dm := Wl D Wr .
where
0 0 0 0 0 √
Dm = Wl D Wr , m =
D Wr ,
Wl D and cm = Wl c λ.
Therefore, the considered problem is equivalent to the low rank approximation prob-
lem (LRAc ) for the modified data matrix Dm .
5.2 Affine Data Modeling 153
to be non-increasing and the cost function is bounded from below, the sequence
of cost function values, generated by the algorithm converges. Moreover, it can be
shown that the sequence of parameter approximations c(k) , P (k) , L(k) converges to
a locally optimal solution of (WLRAc,P ).
Example 5.11 Implementation of the methods for weighted and structured low
rank approximation with centering, presented in this section, are available from the
book’s web page. Figure 5.1 shows the sequence of the cost function values for a
randomly generated weighted rank-1 approximation problem with q = 3 variables
and N = 6 data points. The mean of the data matrix and the approximation of the
mean, produced by the Algorithm 4 are, respectively
⎡ ⎤ ⎡ ⎤
0.5017 0.4365
c(0) = ⎣0.7068⎦ and c = ⎣0.6738⎦ .
0.3659 0.2964
and therefore can be solved analytically. This reduces the original problem to a
nonlinear least squares problem over P only. We have
1
−1
f (P ) = vec (D)W P P W P P W vec(D),
where
P := IN ⊗ P 1N ⊗ Iq .
For the outer minimization any standard unconstrained nonlinear (least squares)
algorithm is used.
Example 5.12 For the same data, initial approximation, and convergence tolerance
as in Example 5.11, the variable projections algorithm, using numerical approxima-
tion of the derivatives in combination with quasi-Newton method converges to a lo-
cally optimal solution with approximation error 0.1477—the same as the one found
by the alternating projections algorithm. The optimal parameters found by the two
algorithms are equivalent up to the nonuniqueness of a solution (Theorem 5.9).
In the case of data centering we consider the following modified Hankel low rank
approximation problem:
minimize over w and c w − c − w 2
(HLRAc )
subject to rank Hn+1 ( w) ≤ r.
Algorithm
We have
w) = 0
RHn+1 ( ⇐⇒ T (R)
w = 0,
where
⎡ ⎤
R0 R1 ··· Rn
⎢ R0 R1 ··· Rn ⎥
⎢ ⎥
T (R) = ⎢ .. .. .. ⎥
⎣ . . . ⎦
R0 R1 ··· Rn
156 5 Missing Data, Centering, and Constraints
(all missing elements are zeros). Let P be a full rank matrix, such that
(P ) = ker T (R) .
Then the constraint of (HLRAc ) can be replaced by
= P ,
there is , such that w
which leads to the following problem equivalent to (LRAc )
minimize over R f (R),
where
c
f (R) := min w − 1 ⊗ I P .
c,
N q
2
The latter is a standard least squares problem, so that the evaluation of f for a
given R can be done efficiently. Moreover, one can exploit the Toeplitz structure
of the matrix T in the computation of P and in the solution of the least squares
problem.
Problem (CLS) is a complex linear least squares problem with constraint that all
elements of the solution have the same phase.
As formulated, (CLS) is a nonlinear optimization problem. General purpose local
optimization methods can be used for solving it, however, this approach has the
usual disadvantages of local optimization methods: need of initial approximation, no
guarantee of global optimality, convergence issues, and no insight in the geometry
of the solutions set. In Bydder (2010) the following closed-form solution of (CLS)
is derived:
+
x = ' AH A ' AH be−iφ ,
(SOL1 x)
1
= ∠ AH b ' AH A + AH b ,
φ )
(SOL1 φ
2
where '(A)/((A) is the real/imaginary part, ∠(A) is the angle, AH is the com-
plex conjugate transpose, and A+ is the pseudoinverse of A. Moreover, in the case
when a solution of (CLS) is not unique, (SOL1 x , SOL1 φ ) is a least norm ele-
ment of the solution set, i.e., a solution (x, φ), such that x 2 is minimized. Ex-
pression (SOL1 x ) is the result of minimizing the cost function Axeiφ − b 2 with
respect to x, for a fixed φ. This is a linear least squares problems (with complex
valued data and real valued solution). Then minimization of the cost function with
respect to φ, for x fixed to its optimal value (SOL1 x ), leads through a nontrivial
chain of steps to (SOL1 φ ).
Solution
1 Two optimization problems are equivalent if the solution of the first can be obtained from the
solution of the second by a one-to-one transformation. Of practical interest are equivalent problems
for which the transformation is “simple”.
158 5 Missing Data, Centering, and Constraints
we have
'(beiφ ) '(b) −((b) y1
= .
((beiφ ) ((b) '(b) y2
subject to y 2 = 1,
or
with
'(A) '(b) −((b) 0 0
C := ∈ R2m×(n+2) ∈ R(n+2)×(n+2) .
and D :=
((A) ((b) '(b) 0 I2
(C, D)
It is well known that a solution of problem (CLS”) can be obtained from the general-
ized eigenvalue decomposition of the pair of matrices (C C, D). More specifically,
the smallest generalized eigenvalue λmin of (C C, D) is equal to the minimum
value of (CLS”), i.e.,
2
λmin = A x eiφ − b2 .
If λmin is simple, a corresponding generalized eigenvector zmin is of the form
⎡ ⎤
x
⎢ )⎥
zmin = α ⎣− cos(φ ⎦,
sin(φ )
Theorem 5.15 Let λmin be the smallest generalized eigenvalue of the pair of ma-
trices (C C, D), defined in (C, D), and let zmin be a corresponding generalized
eigenvector. Assuming that λmin is a simple eigenvalue, problem (CLS) has unique
solution, given by
1 n
x= z1 , φ = ∠(−z2,1 + iz2,2 ), where zmin =: z1 . (SOL2)
z2 2 z2 2
5.3 Complex Least Squares Problem with Constrained Phase 159
Remarks
1. Generalized eigenvalue decomposition vs. generalized singular value decompo-
sition Since the original data are the matrix A and the vector b, the generalized
singular value decomposition of the pair (C, D) can be used instead of the gen-
eralized eigenvalue decomposition of the pair (C C, D). This avoids “squaring”
the data and is recommended from a numerical point of view.
2. Link to low rank approximation and total least squares Problem (CLS”) is
equivalent to the generalized low rank approximation problem
minimize over C ∈ R2m×(n+2) (C − C)D
F
(GLRA)
subject to rank(C) ≤ n + 1 and CD = CD ⊥ ,⊥
where
⊥ I 0
D = n ∈ R(n+2)×(n+2)
0 0
and · F is the Frobenius norm. Indeed, the constraints of (GLRA) imply that
C −C D = b − b2 , where b = Axeiφ .
F
The normalization (SOL2) is reminiscent to the generic solution of the total least
squares problems. The solution of total least squares problems, however, involves
a normalization by scaling with the last element of a vector zmin in the approxi-
mate kernel of the data matrix C, while the solution of (CLS) involves normal-
ization by scaling with the norm of the last two elements of the vector zmin .
3. Uniqueness of the solution and minimum norm solutions A solution x of (CLS)
is nonunique when A has nontrivial null space. This source of nonuniqueness is
fixed in Bydder (2010) by choosing from the solutions set a least norm solution.
A least norm solution of (CLS), however, may also be nonunique due to possible
nonuniquess of φ. Consider the following example,
1 i 1
A= , b= ,
−i 1 −i
xeiφ = −xei(φ±π)
with both φ and one of the angles φ ± π in the interval (−π, π].
160 5 Missing Data, Centering, and Constraints
Computational Algorithms
and
cls3 — O m3 + (n + 2)2 m2 + (n + 2)2 m + (n + 2)3 .
Note, however, that cls2 and cls3 compute the full generalized eigenvalue
decomposition and generalized singular value decomposition, respectively, while
only the smallest generalized eigenvalue/eigenvector or singular value/singular vec-
tor pair is needed for solving (CLS). This suggests a way of reducing the computa-
tional complexity by a factor of magnitude.
The equivalence between problem (CLS) and the generalized low rank approxi-
mation problem (GLRA), noted in remark 2 above, allows us to use the algorithm
from Golub et al. (1987) for solving problem (CLS). The resulting Algorithm 5 is
implemented in the function cls4.
161 Complex least squares, solution by Algorithm 5 161≡
function cx = cls4(A, b)
define C, D, and n 160c
R = triu(qr(C, 0));
[u, s, v] = svd(R((n + 1):(n + 2), (n + 1):end));
phi = angle(v(1, 2) - i * v(2, 2));
x = R(1:n, 1:n)
\ (R(1:n, (n + 1):end) * [v(1, 2); v(2, 2)]);
cx = x * exp(i * phi);
Defines:
cls4, used in chunk 163b.
Its computational cost is
cls4 — O (n + 2)2 m .
A summary of the computational costs for the methods, implemented in the func-
tions cls1-4, is given in Table 5.5.
Table 5.5 Summary of methods for solving the complex least squares problem (CLS)
Function Method Computational cost
cls1 (SOL1 )
x , SOL1 φ O(n2 m + n3 )
cls2 full generalized eigenvalue decomp. O((n + 2)2 m + (n + 2)3 )
cls3 full generalized singular value decomp. O(m3 + (n + 2)2 m2 + (n + 2)2 m + (n + 2)3 )
cls4 Algorithm 5 O((n + 2)2 m)
Fig. 5.2 Computation time for the four methods, implemented in the functions cls1, . . . , cls4
Numerical Examples
Problem Formulation
Rank Estimation
D := D0 + D̃
“close” to a rank-m0 matrix in the sense that the distance of D to the manifold of
rank-m0 matrices
dist(D, m0 ) := minD − D subject to rank D
F
= m0 (5.12)
D
is less than the perturbation size ε. Therefore, provided that the size ε of the pertur-
bation D̃ is known, the distance measure dist(D, m), for m = 1, 2, . . . , can be used
to estimate the rank of the unperturbed matrix as follows
m = arg min m | dist(D, m) < ε .
The answer to the above question depends on the type of the perturbation D̃.
If D̃ is a random matrix with zero mean elements that are normally distributed,
independent, and with equal variances, then the estimate D, defined by (5.12) is
a maximum likelihood estimator of D0 , i.e., it is statistically optimal. If, however,
one or more of the above assumptions are not satisfied, D is not optimal and can
be improved by modifying problem (5.12). Our objective is to justify this statement
in a particular case when there is prior information about the true matrix D0 in the
form of structure in a normalized rank-revealing factorization and the elements of
the perturbation D̃ are independent but possibly with different variances.
Next, we present an algorithm for approximate low rank factorization with struc-
tured factors and test its performance on synthetic data. We use the alternating pro-
jections approach, because it is easier to modify for constrained optimization prob-
lems. Certain constrained problems can be treated also using a modification of the
variable projections.
The true data matrix D0 has rank equal to m and the measurement errors d̃ij are zero
mean, normal, and uncorrelated, with covariance σ 2 vi+q(j −1) . The vector v ∈ RqN
specifies the element-wise variances of the measurement error matrix D̃ up to an
unknown factor σ 2 .
In order to make the parameters P0 and L0 unique, we impose the normalization
constraint (or assumption on the “true” parameter values)
I
P0 = m . (A1 )
P0
166 5 Missing Data, Centering, and Constraints
L0 = 1
l ⊗ L0 , (A3 )
nonnegative
L0 ≥ 0, (A4 )
and with smooth rows in the sense that
L D2 ≤ δ, (A5 )
0 F
The maximum likelihood estimator for the parameters P0 and L0 in (EIV0 ) under
assumptions (A1 –A5 ), with known parameters m, v, S, and δ, is given by the fol-
lowing optimization problem:
minimize over P , L , and D D − D 2 (cost function) (C0 )
Σ
The rank and measurement errors assumptions in the model (EIV0 ) imply the
weighted low rank approximation nature of the estimation problem (C0 –C5 ) with
weight matrix given by (Σ). Furthermore, the assumptions (A1 –A5 ) about the true
data matrix D0 correspond to the constraints (C1 –C5 ) in the estimation problem.
Computational Algorithm
of the row of D on the span of the rows of L, and problem on step 2 is the orthogonal
projection
= P P P −1 P D = ΠP D
D
of the columns of D on the span of the column of P . The algorithm iterates the two
projections.
Note 5.16 (Rank deficient factors P and L) If the factor L is rank deficient, the in-
dicated inverse in the computation of the projected matrix P ∗ does not exist. (This
happens when the rank of the approximation D if less than m.) The projection P ∗ ,
however, is still well defined by the optimization problem on step 1 of the algo-
rithm and can be computed in closed form by replacing the inverse with the pseudo
inverses. The same is true when the factor L is rank deficient.
Assuming that there exists a solution to the problem (C0 –C5 ) and any (locally op-
(k) , P (k) ,
timal) solution is unique (i.e., it is a strict minimum), the sequences D
(k)
and L converge element-wise, i.e.,
(k) → D ∗ ,
D P (k) → P ∗ , and L(k) → L∗ , as k → ∞, (D (k) → D ∗ )
∗ := P ∗ L∗ is a (locally optimal) solution of (C0 –C5 ).
where D
Simulation Results
In this section, we show empirically that exploiting prior knowledge ((Σ ) and as-
sumptions (A1 –A5 )) improves the performance of the estimator. The data matrix D
is generated according to the errors-in-variables model (EIV0 ) with parameters
N = 100, q = 6, and m = 2. The true low rank matrix D0 = P0 L0 is random and the
parameters P0 and L0 are normalized according to assumption (A1 ) (so that they
are unique). For the purpose of validating the algorithm, the element p0,qN is set to
zero but this prior knowledge is not used in the parameter estimation.
The estimation algorithm is applied on M = 100 independent noise realizations
of the data D. The estimated parameters on the ith repetition are denoted by P (i) ,
L(i) and D(i) := P (i) L(i) . The performance of the estimator is measured by the
following average relative estimation errors:
1
M (i)
D0 − D 2
1
M
P0 − P (i) 2
eD = 2
F
, eP = 2
F
,
M D0 F
M P0 F
i=1 i=1
1
M
L0 − L(i) 2
1
M
/ (i) /
eL = F
, and ez = /p /.
qN
M
i=1
L0 2F M
i=1
For comparison the estimation errors are reported for the low rank approximation
algorithm, using only the normalization constraint (A1 ), as well as for the proposed
algorithm, exploiting the available prior knowledge. The difference between the two
estimation errors is an indication of how important is the prior knowledge in the
estimation.
Lack of prior knowledge is reflected by specific choice of the simulation param-
eters as follows:
5.4 Approximate Low Rank Factorization with Structured Factors 169
Σ l S = [] nonneg
rand(q, N ) 1 yes 0
ones(q, N ) 3 yes 0
ones(q, N ) 1 no 0
ones(q, N ) 1 yes 1
rand(q, N ) 3 no 1
which test individually the effect of (Σ ), assumptions (A2 ), (A3 ), (A4 ), and their
combined effect on the estimation error. Figures 5.3–5.7 show the average rela-
tive estimation errors ( is the estimator that exploits prior knowledge and
is the estimator that does not exploit prior knowledge) versus the mea-
surement noise standard deviation σ , for the five experiments. The vertical bars on
170 5 Missing Data, Centering, and Constraints
the plots visualize the standard deviation of the estimates. The results indicate that
main factors for the improved performance of the estimator are:
1. assumption (A3 )—known zeros in the P0 and
2. (Σ)—known covariance structure of the measurement noise.
Files reproducing the numerical results and figures presented are available from the
book’s web page.
Implementation of Algorithm 6
Initial Approximation
For initial approximation (P (0) , L (0) ) we choose the normalized factors of a rank
revealing factorization of the solution D of (5.12). Let
D = U ΣV
5.4 Approximate Low Rank Factorization with Structured Factors 171
m q−m
m q−m Σ 0 m m N −m
U =: U1 U2 , Σ =: 1 , V =: V1 V2 .
0 Σ2 q − m
Furthermore, let
U11
:= U, with U11 ∈ Rm×m .
U21
Then
−1
P (0)
:= U21 U11 and L(0) := U11 ΣV
define the Frobenius-norm optimal unweighted and unconstrained low rank approx-
imation
I
D :=
(0)
L(0) .
P (0)
More sophisticated choices for the initial approximation that take into account the
weight matrix Σ are described in Sect. 2.4.
172 5 Missing Data, Centering, and Constraints
In the weighted case, the projection on step 1 of the algorithm is computed sepa-
rately for each row p i of P . Let d i be the ith row of D and w i be the ith row of Σ .
The problem
minimize over P D − PL 2
Σ subject to (C1 –C2 )
is equivalent to the problem
i
minimize over p i d − p i L diag w i 2
2
(∗)
subject to (C1 –C2 ), for i = 1, . . . , m.
The projection on step 2 of the algorithm is not separable due to constraint (C5 ).
Since the first m rows of P are fixed, we do not solve (∗) for i = 1, . . . , m, but define
p i := ei , for i = 1, . . . , m,
where ei is the ith unit vector (the ith column of the identity matrix Im ).
5.4 Approximate Low Rank Factorization with Structured Factors 173
Fig. 5.7 Effect of weighting, periodicity, and nonnegativity of L, and zero elements in P (
—exploiting prior knowledge, —without exploiting prior knowledge, vertical
bars—standard deviations)
S vec(P ) = 0 ⇐⇒ p i Si = 0, for i = m + 1, . . . , q.
(If there are no zeros in the ith row, then Si is skipped.) The ith problem in (∗)
becomes
2
minimize over p i d i − p i L diag w i 2 subject to p i Si = 0. (∗∗)
Let the rows of the matrix Ni form a basis for the left null space of Si . Then p i Si = 0
if and only if p i = zi Ni , for certain zi , and problem (∗∗) becomes
i
minimize over zi d − zi Ni L diag w i 2 .
2
Note 5.18 It is not necessary to explicitly construct the matrices Si and compute
basis Ni for their left null spaces. Since Si is a selector matrix, it is a submatrix of
the identity matrix Im . The rows of the complementary submatrix of Im form a basis
for the left null space of Si . This particular matrix Ni is also a selector matrix, so
that the product Ni L need not be computed explicitly.
We have,
D − P L = D − P 1l ⊗ L = D1 · · · Dl − P L · · · L
⎡ ⎤ ⎡ ⎤
D1 P
⎢ .. ⎥ ⎢ .. ⎥
= ⎣ . ⎦ − ⎣ . ⎦ L =: D − (1l ⊗ P ) L = D − P L .
Dl P P
Let
⎡ ⎤
Σ1
⎢ .. ⎥
Σ := ⎣ . ⎦ , where Σ =: Σ1 · · · ΣN .
Σl
Then the problem
minimize over L D −P L 2
Σ subject to (C4 –C5 ).
Adding the nonnegativity constraint changes the least squares problem to a nonneg-
ative least squares problem, which is a standard convex optimization problem for
which robust and efficient methods and software exist.
The problem
for certain regularization parameter γ . The latter problem is equivalent to the stan-
dard least squares problem
2
diag vec(Σ) vec(D) diag vec(Σ) (I ⊗ P )
minimize over L − √ vec(L)
0 γ (D ⊗ I ) .
2
Stopping Criteria
The iteration is terminated when the following stopping criteria are satisfied
(k+1) (k+1)
P L − P (k) L(k) Σ /P (k+1) L(k+1) Σ < εD ,
(k+1)
P − P (k) L(k+1) Σ /P (k+1) L(k+1) Σ < εP , and
(k+1) (k+1)
L L − L(k) Σ /P (k+1) L(k+1) Σ < εL .
5.5 Notes
Missing Data
Optimization methods for solving weighted low rank approximation problems with
nonsingular weight matrix have been considered in the literature under different
names:
• criss-cross multiple regression (Gabriel and Zamir 1979),
• Riemannian singular value decomposition (De Moor 1993),
• maximum likelihood principal component analysis (Wentzell et al. 1997),
• weighted low rank approximation (Manton et al. 2003), and
• weighted low rank approximation (Markovsky et al. 2005).
Gabriel and Zamir (1979) consider an element-wise weighted low rank approxi-
mation problem with diagonal weight matrix W , and propose an iterative solution
method. Their method, however, does not necessarily converge to a minimum point,
see the discussion in Gabriel and Zamir (1979, Sect. 6, p. 491). Gabriel and Zamir
(1979) proposed an alternating projections algorithm for the case of unweighted ap-
proximation with missing values, i.e., wij ∈ {0, 1}. Their method was further gener-
alized by Srebro (2004) for arbitrary weights.
The Riemannian singular value decomposition framework of De Moor (1993)
includes the weighted low rank approximation problem with rank specification
r = min(m, n) − 1 and a diagonal weight matrix W as a special case. In De Moor
(1993), an algorithm resembling the inverse power iteration algorithm is proposed.
The method, however, has no proven convergence properties.
176 5 Missing Data, Centering, and Constraints
Manton et al. (2003) treat the problem as an optimization over a Grassman man-
ifold and propose steepest decent and Newton type algorithms. The least squares
nature of the problem is not exploited in this work and the proposed algorithms are
not globally convergent.
The maximum likelihood principal component analysis method of Wentzell et al.
(1997) is developed for applications in chemometrics, see also (Schuermans et al.
2005). This method is an alternating projections algorithm. It applies to the general
weighted low rank approximation problems and is globally convergent. The conver-
gence rate, however, is linear and the method could be rather slow when the r + 1st
and the rth singular values of the data matrix D are close to each other. In the un-
weighted case this situation corresponds to lack of uniqueness of the solution, cf.,
Theorem 2.23. The convergence properties of alternating projections algorithms are
studied in Krijnen (2006), Kiers (2002).
An implementation of the singular value thresholding method in M ATLAB is
available at http://svt.caltech.edu/. Practical methods for solving the recommender
system problem are given in Segaran (2007). The MovieLens data set is available
from GroupLens (2009).
The notation D ≥ 0 is used for a matrix D ∈ Rq×N whose elements are nonnegative.
A low rank approximation problem with element-wise nonnegativity constraint
minimize over D D − D
(NNLRA)
subject to rank(D) ≤ m and D ≥0
arises in Markov chains (Vanluyten et al. 2006) and image mining (Lee and Seung
1999). Using the image representation, we obtain the following problem
optimization problems) are solved. The resulting alternating least squares algorithm
is Algorithm 7.
References
Berman A, Shaked-Monderer N (2003) Completely positive matrices. World Scientific, Singapore
Bydder M (2010) Solution of a complex least squares problem with constrained phase. Linear
Algebra Appl 433(11–12):1719–1721
Cai JF, Candés E, Shen Z (2009) A singular value thresholding algorithm for matrix completion.
www-stat.stanford.edu/~candes/papers/SVT.pdf
Candés E, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math
9:717–772
De Moor B (1993) Structured total least squares and L2 approximation problems. Linear Algebra
Appl 188–189:163–207
Gabriel K, Zamir S (1979) Lower rank approximation of matrices by least squares with any choice
of weights. Technometrics 21:489–498
Golub G, Hoffman A, Stewart G (1987) A generalization of the Eckart–Young–Mirsky matrix
approximation theorem. Linear Algebra Appl 88/89:317–327
GroupLens (2009) Movielens data sets. www.grouplens.org/node/73
Kiers H (2002) Setting up alternating least squares and iterative majorization algorithms for solving
various matrix optimization problems. Comput Stat Data Anal 41:157–170
Krijnen W (2006) Convergence of the sequence of parameters generated by alternating least
squares algorithms. Comput Stat Data Anal 51:481–489
Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature
401:788–791
Manton J, Mahony R, Hua Y (2003) The geometry of weighted low-rank approximations. IEEE
Trans Signal Process 51(2):500–514
Markovsky I, Rastello ML, Premoli A, Kukush A, Van Huffel S (2005) The element-wise weighted
total least squares problem. Comput Stat Data Anal 50(1):181–209
Schuermans M, Markovsky I, Wentzell P, Van Huffel S (2005) On the equivalence between total
least squares and maximum likelihood PCA. Anal Chim Acta 544:254–267
Segaran T (2007) Programming collective intelligence: building smart Web 2.0 applications.
O’Reilly Media
Srebro N (2004) Learning with matrix factorizations. PhD thesis, MIT
Vanluyten B, Willems JC, De Moor B (2006) Matrix factorization and stochastic state representa-
tions. In: Proc 45th IEEE conf on dec and control, San Diego, California, pp 4188–4193
Wentzell P, Andrews D, Hamilton D, Faber K, Kowalski B (1997) Maximum likelihood principal
component analysis. J Chemom 11:339–366
Chapter 6
Nonlinear Static Data Modeling
Introduction
Identifying a curve in a set of curves that best fits given data points is a common
problem in computer vision, statistics, and coordinate metrology. More abstractly,
approximation by Fourier series, wavelets, splines, and sum-of-exponentials are also
curve-fitting problems. In the applications, the fitted curve is a model for the data
and, correspondingly, the set of candidate curves is a model class.
Data modeling problems are specified by choosing a model class and a fitting cri-
terion. The fitting criterion is maximization of a measure for fit between the data and
a model. Equivalently, the criterion can be formulated as minimization of a measure
for lack of fit (misfit) between the data and a model. Data modeling problems can be
classified according to the type of model and the type of fitting criterion as follows:
• linear/affine vs. nonlinear model class,
• algebraic vs. geometric fitting criterion.
A model is a subset of the data space. The model is linear/affine if it is a sub-
space/affine set. Otherwise, it is nonlinear. A geometric fitting criterion minimizes
the sum-of-squares of the Euclidean distances from the data points to a model. An
algebraic fitting criterion minimizes an equation error (residual) in a representation
of the model. In general, the algebraic fitting criterion has no simple geometric in-
terpretation. Problems using linear model classes and algebraic criteria are easier
to solve numerically than problems using nonlinear model classes and geometric
criteria.
In this chapter, a nonlinear model class of bounded complexity, consisting of
affine varieties, i.e., kernels of systems of multivariable polynomials is considered.
The complexity of an affine variety is defined as the pair of the variety’s dimen-
sion and the degree of its polynomial representation. In Sect. 6.2, an equivalence
is established between the data modeling problem and low rank approximation of
a polynomially structured matrix constructed from the data. Algorithms for solving
nonlinearly structured low rank approximation problems are presented in Sect. 6.3.
As illustrated in Sect. 6.4, the low rank approximation setting makes it possible to
use a single algorithm and a piece of software for solving a wide variety of curve-
fitting problems.
D = {d1 , . . . , dN } ⊂ Rq .
The observations d1 , . . . , dN are real q-dimensional vectors. A model for the data D
is a subset of the data space Rq and a model class M q for D is a set of subsets of
the data space Rq , i.e., M q is an element of the powerset 2R . For example, the
q
that this relation defines. The input/output representation y = f (u) implies that the
variables u are inputs and the variables y are outputs of the model B.
Input-output representations are appealing because they are explicit functions,
mapping some variables (inputs) to other variables (outputs) and thus display a
causal relation among the variables (the inputs cause the outputs). The alternative
kernel representation
B = ker(R) := d ∈ Rq | R(d) = 0 (KER)
defines the model via an implicit function R(d) = 0, which does not a priori bound
one set of variables as a cause and another set of variables as an effect.
A priori fixed causal relation, imposed on the model by an input/output represen-
tation, is restrictive. Consider, for example, data fitting by a model that is a conic
section. Only parabolas and lines can be represented by functions. Hyperbolas, el-
lipses, and the vertical line {(u, y) | u = 0} are not graphs of a function y = f (u)
and therefore cannot be modeled by an input/output representation.
6.1 A Framework for Nonlinear Static Data Modeling 181
Special Cases
q
The model class Pm,d and the related exact and approximate modeling problems
(EM) and (AM) have as an important special case the linear model class and linear
data modeling problems.
1. Linear/affine model class of bounded complexity. An affine model B (i.e., an
affine set in Rq ) is an affine variety, defined by a first order polynomial through
kernel or image representation. The dimension of the affine variety coincides
q
with the dimension of the affine set. Therefore, Pm,1 is an affine model class
in R with complexity bounded by m. The linear model class in Rq , with dimen-
q
q q
sion bounded by m, is a subset Lm,0 of Pm,1 .
2. Geometric fitting by a linear model. Approximate data modeling using the linear
model class Lmq and the geometric fitting criterion (dist) is a low rank approxi-
mation problem
minimize over D Φ(D) − Φ(D)
F
(LRA)
subject to rank Φ(D) ≤ m,
182 6 Nonlinear Static Data Modeling
where
Φ(D) := d1 · · · dN .
The rank constraint in (LRA) is equivalent to the constraint that the data D are
exact for a linear model of dimension bounded by m. This justifies the statement
that exact modeling is an ingredient of approximate modeling.
3. Algebraic curves. In the special case of a curve in the plane, we use the notation
Examples of 4rd order algebraic curves, see Fig. 6.2, are the eight curve
B = col(x, y) | y 2 − x 2 + x 4 = 0
uniquely defines the vector of monomials φ. The degrees matrix D depends only on
the number of variables q and the degree d. For example, with q = 2 and d = 2,
211000
D = .
010210
The function monomials generates an implicit function phi that evaluates the
2-variate vector of monomials φ, with degree d.
184a Monomials constructor 184a≡ 184b
function [Deg, phi] = monomials(deg)
Defines:
monomials, used in chunks 190a and 192e.
First an extended degrees matrix Dext ∈ {0, 1, . . . , d}(d+1) ×2 , corresponding to all
2
Then the rows of Dext are scanned and those with degree less than or equal to d are
selected to form a matrix D.
185 Monomials constructor 184a+≡ 184b
str = []; Deg = []; q = 2;
for i = 1:size(Deg_ext, 1)
if (sum(Deg_ext(i, :)) <= deg)
for k = q:-1:1,
str = sprintf(’.* d(%d,:) .^ %d %s’, ...
k, Deg_ext(i, k), str);
end
str = sprintf(’; %s’, str(4:end));
Deg = [Deg_ext(i, :); Deg];
end
end
eval(sprintf(’phi = @(d) [%s];’, str(2:end)))
Minimality of the kernel representation is equivalent to the condition that the pa-
rameter Θ is full row rank. The nonuniqueness of RΘ corresponds to a nonunique-
ness of Θ. The parameters Θ and QΘ, where Q is a nonsingular matrix, define the
same model. Therefore, without loss of generality, we can assume that the represen-
tation is minimal and normalize it, so that
ΘΘ = Ip .
Note that a p × qext full row rank matrix Θ defines via (RΘ ) a polynomial
matrix RΘ , which defines a minimal kernel representation (KER) of a model BΘ
q
in Pm,d . Therefore, Θ defines a function
q
BΘ : Rp×qext → Pm,d .
q
Vice verse, a model B in Pm,d corresponds to a (nonunique) p × qext full row rank
matrix Θ, such that B = BΘ . For a given q, there are one-to-one mappings
qext ↔ d and p ↔ m,
Main Results
We show a relation of the approximate modeling problems (AM) and (EM) for the
model class Pm,d to low rank approximation problems.
plexity Pm,d
+
,
, N
, RΘ (dj )2
minimize over Θ ∈ Rp×qext - F
j =1 (AMΘ )
subject to ΘΘ = Ip
Proof Using the polynomial representation (RΘ ), the squared cost function of
(AMΘ ) can be rewritten as a quadratic form:
N
RΘ (dj )2 = ΘΦd (D)2
F F
j =1
= trace ΘΦd (D)Φd (D)Θ = trace ΘΨd (D)Θ .
Proposition 6.2 (Geometric fit ⇐⇒ polynomial structured low rank approx) The
geometric fitting problem for the model class of affine varieties with bounded com-
plexity Pm,d
minimize over B ∈ Pm,d dist(D, B) (AM)
is equivalent to the polynomially structured low rank approximation problem
minimize over D ∈ Rq×N D − D
F
(PSLRA)
subject to rank Φd (D) ≤ qext − p.
⊂ B ∈ Pm,d .
subject to D
to replace the constraint of (∗) with a rank constraint for the structured matrix
= Φd (D),
Φd (D) this latter problem becomes a polynomially structured low rank
approximation problem (PSLRA).
Propositions 6.1 and 6.2 show a relation between the algebraic and geometric
fitting problems.
Corollary 6.3 The algebraic fitting problem (AMΘ ) is a relaxation of the geometric
fitting problem (AM), obtained by removing the structure constraint of the approxi-
mating matrix Φd (D).
6.3 Algorithms
In the linear case, the misfit computation problem is a linear least norm problem.
This fact is effectively used in the variable projections method. In the nonlinear
case, the misfit computation problem is a nonconvex optimization problem. Thus the
elimination step of the variable projections approach is not possible in the nonlinear
case. This requires the data approximation D = {d1 , . . . , dN } to be treated as an
extra optimization variable together with the model parameter Θ. As a result, the
computational complexity and sensitivity to local minima increases in the nonlinear
case.
The above consideration makes critical the choice of the initial approximation.
The default initial approximation is obtained from a direct method such as the alge-
braic fitting method. Next, we present a modification of the algebraic fitting method
that is motivated by the objective of obtaining an unbiased estimate in the errors-in-
variables setup.
Here B0 is the to-be-estimated true model. The estimate B obtained by the alge-
braic fitting method (AMΘ ) is biased, i.e., E(B) = B0 . In this section, we derive a
bias correction procedure. The correction depends on the noise variance σ 2 , how-
ever, the noise variance can be estimated from the data D together with the model
parameter Θ. c is invariant to rigid trans-
The resulting bias corrected estimate B
formations. Simulation results show that Bc has smaller orthogonal distance to the
data than alternative direct methods.
188 6 Nonlinear Static Data Modeling
The algebraic fitting method computes the rows of parameter estimate Θ as eigen-
vectors related to the p smallest eigenvalues of Ψ . We construct a “corrected” matrix
Ψc , such that
E(Ψc ) = Ψ0 . (∗)
This property ensures that the corrected estimate Θc , obtained from the eigenvectors
related to the p smallest eigenvalues of Ψc , is unbiased.
188a Bias corrected low rank approximation 188a≡
function [th, sh] = bclra(D, deg)
[q, N] = size(D); qext = nchoosek(q + deg, deg);
construct the corrected matrix Ψc 190a
estimate σ 2 and θ 190c
Defines:
bclra, used in chunk 192e.
The key tool to achieve bias correction is the sequence of the Hermite polynomi-
als, defined by the recursion
(See Table 6.1 for explicit expressions of h2 , . . . , h10 .) The Hermite polynomials
have the deconvolution property
E hk (x0 + .
x ) = x0k , where . x ∼ N(0, 1). (∗∗)
The following code generates a cell array h of implicit function that evaluate
the sequence of Hermite polynomials: h{k+1}(d)= hk (d). (The difference in the
indices of the h and h is due to M ATLAB convention indices to be positive integers.)
188b define the Hermite polynomials 188b≡ (190a)
h{1} = @(x) 1; h{2} = @(x) x;
6.3 Algorithms 189
By the data generating assumption (EIV), d.k are independent, zero mean, normally
distributed. Then, using the deconvolution property (∗∗) of the Hermite polynomi-
als, we have
N :
q
ψc,ij := hdiq +dj q (dk ) (ψij )
=1 k=1
has the unbiasedness property (∗), i.e.,
N :
q
d +dj q
E(ψc,ij ) = iq
d0,k =: ψ0,ij .
=1 k=1
The elements ψc,ij of the corrected matrix are even polynomials of σ of degree
less than or equal to
4 5
qd + 1
dψ = .
2
The following code constructs a 1 × (dψ + 1) vector of the coefficients of ψc,ij
as a polynomial of σ 2 . Note that the product of Hermite polynomials in (ψij ) is a
convolution of their coefficients.
189 construct ψc,ij 189≡ (190a)
Deg_ij = Deg(i, :) + Deg(j, :);
for l = 1:N
psi_ijl = 1;
for k = 1:q
psi_ijl = conv(psi_ijl, h{Deg_ij(k) + 1}(D(k, l)));
end
psi_ijl = [psi_ijl zeros(1, dpsi - length(psi_ijl))];
psi(i, j, :) = psi(i, j, :) + ...
reshape(psi_ijl(1:dpsi), 1, 1, dpsi);
end
190 6 Nonlinear Static Data Modeling
The nonlinearly structured low rank approximation problem (PSLRA) is solved nu-
merically using Optimization Toolbox.
6.4 Examples 191
6.4 Examples
In this section, we apply the algebraic and geometric fitting methods on a range of
algebraic curve fitting problems. In all examples, except for the last one, the data D
are simulated in the errors-in-variables setup, see (EIV) on p. 187. The perturbations
d.j , j = 1, . . . , N are independent, zero mean, normally distributed 2 × 1 vectors
with covariance matrix σ 2 I2 . The true model B0 = ker(r0 ), the number of data
points N , and the perturbation standard deviation σ are simulation parameters. The
true model is plotted by a solid line, the data points by circles, the algebraic fit by
a dotted line, the bias corrected fit by , and the geometric fit by a
.
Test Function
in a region, defined by the vector rect, is done with the function plot_model.
193a Plot the model 193a≡
function H = plot_model(r, rect, varargin)
H = ezplot(r, rect);
if nargin > 2, for h = H’, set(h, varargin{:}); end, end
Defines:
plot_model, used in chunk 192.
The function th2poly converts a vector of polynomial coefficients to a function
that evaluates that polynomial.
193b Θ → RΘ 193b≡
function r = th2poly(th, phi),
r = @(x, y) th’ * phi([x y]’);
Defines:
th2poly, used in chunk 192f.
Simulation 1: Parabola
B = {col(x, y) | y = x 2 + 1}
193c Curve fitting examples 193c≡ 194a
clear all
name = ’parabola’;
N = 20; s = 0.1;
deg = 2; syms x y;
r = x^2 - y + 1;
ax = [-1 1 1 2.2];
test_curve_fitting
Uses test_curve_fitting 192a.
194 6 Nonlinear Static Data Modeling
Simulation 2: Hyperbola
B = {col(x, y) | x 2 − y 2 − 1 = 0}
194a Curve fitting examples 193c+≡ 193c 194b
name = ’hyperbola’;
N = 20; s = 0.3;
deg = 2; syms x y;
r = x^2 - y^2 - 1;
ax = [-2 2 -2 2];
test_curve_fitting
Uses test_curve_fitting 192a.
Simulation 3: Cissoid
B = { col(x, y) |
y 2 (1 + x) = (1 − x)3 }
194b Curve fitting examples 193c+≡ 194a 194c
name = ’cissoid’;
N = 25; s = 0.02;
deg = 3; syms x y;
r = y^2 * (1 + x) ...
- (1 - x)^3;
ax = [-1 0 -10 10];
test_curve_fitting
Defines:
examples_curve_fitting,
never used.
Uses test_curve_fitting 192a.
Simulation 4: Folium
of Descartes
B = { col(x, y) |
x 3 + y 3 − 3xy = 0}
194c Curve fitting examples 193c+≡ 194b 195a
name = ’folium’;
N = 25; s = 0.1;
deg = 3; syms x y;
r = x^3 + y^3
- 3 * x * y;
ax = [-2 2 -2 2];
test_curve_fitting
Uses test_curve_fitting 192a.
6.4 Examples 195
6.5 Notes
Fitting curves to data is a basic problem in coordinate metrology, see Van Huffel
(1997, Part IV). In the computer vision literature, there is a large body of work
on ellipsoid fitting (see, e.g., Bookstein 1979; Gander et al. 1994; Kanatani 1994;
Fitzgibbon et al. 1999; Markovsky et al. 2004), which is a special case of the con-
sidered in this chapter data fitting problem when the degree of the polynomial is
equal to two.
In the systems and control literature, the geometric distance is called misfit and
the algebraic distance is called latency, see Lemmerling and De Moor (2001). Iden-
tification of a linear time-invariant dynamical systems, using the latency criterion
leads to the autoregressive moving average exogenous setting, see Ljung (1999) and
Söderström and Stoica (1989). Identification of a linear time-invariant dynamical
systems, using the misfit criterion leads to the errors-in-variables setting, see Söder-
ström (2007).
State-of-the art image segmentation methods are based on the level set approach
(Sethian 1999). Level set methods use implicit equations to represent a contour in
the same way we use kernel representations to represent a model in this chapter.
The methods used for parameter estimation in the level set literature, however, are
based on solution of partial differential equations while we use classical parameter
estimation/optimization methods.
Relaxation of the nonlinearly structured low rank approximation problem, based
on ignoring the nonlinear structure and thus solving the problem as unstructured
low rank approximation (i.e., the algebraic fitting method) is known in the machine
learning literature as kernel principal component analysis (Schölkopf et al. 1999).
The principal curves, introduced in Hastie and Stuetzle (1989), lead to a problem
of minimizing the sum of squares of the distances from data points to a curve. This
is a polynomially structured low rank approximation problem. More generally, di-
mensionality reduction by manifold learning, see, e.g., Zhang and Zha (2005) is
References 197
related to the problem of fitting an affine variety to data, which is also polynomially
structured low rank approximation.
Nonlinear (Vandermonde) structured total least squares problems are discussed
in Lemmerling et al. (2002) and Rosen et al. (1998) and are applied to fitting a sum
of damped exponentials model to data. Fitting a sum of damped exponentials to data,
however, can be solved as a linear (Hankel) structured approximation problem. In
contrast, the geometric data fitting problem considered in this chapter in general
cannot be reduced to a linearly structured problem and is therefore a genuine appli-
cation of nonlinearly structured low rank approximation.
The problem of passing from image to kernel representation of the model is
known as the implicialization problem (Cox et al. 2004, p. 96) in computer algebra.
The reverse transformation—passing from a kernel to an image representation of
the model, is a problem of solving a system of multivariable polynomial equations.
References
Bookstein FL (1979) Fitting conic sections to scattered data. Comput Graph Image Process 9:59–
71
Cox D, Little J, O’Shea D (2004) Ideals, varieties, and algorithms. Springer, Berlin
Fitzgibbon A, Pilu M, Fisher R (1999) Direct least-squares fitting of ellipses. IEEE Trans Pattern
Anal Mach Intell 21(5):476–480
Gander W, Golub G, Strebel R (1994) Fitting of circles and ellipses: least squares solution. BIT
Numer Math 34:558–578
Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84:502–516
Kanatani K (1994) Statistical bias of conic fitting and renormalization. IEEE Trans Pattern Anal
Mach Intell 16(3):320–326
Lemmerling P, De Moor B (2001) Misfit versus latency. Automatica 37:2057–2067
Lemmerling P, Van Huffel S, De Moor B (2002) The structured total least squares approach for
nonlinearly structured matrices. Numer Linear Algebra Appl 9(1–4):321–332
Ljung L (1999) System identification: theory for the user. Prentice-Hall, Upper Saddle River
Markovsky I, Kukush A, Van Huffel S (2004) Consistent least squares fitting of ellipsoids. Numer
Math 98(1):177–194
Rosen J, Park H, Glick J (1998) Structured total least norm for nonlinear problems. SIAM J Matrix
Anal Appl 20(1):14–30
Schölkopf B, Smola A, Müller K (1999) Kernel principal component analysis. MIT Press, Cam-
bridge, pp 327–352
Sethian J (1999) Level set methods and fast marching methods. Cambridge University Press, Cam-
bridge
Söderström T (2007) Errors-in-variables methods in system identification. Automatica 43:939–
958
Söderström T, Stoica P (1989) System identification. Prentice Hall, New York
Van Huffel S (ed) (1997) Recent advances in total least squares techniques and errors-in-variables
modeling. SIAM, Philadelphia
Zhang Z, Zha H (2005) Principal manifolds and nonlinear dimension reduction via local tangent
space alignment. SIAM J Sci Comput 26:313–338
Chapter 7
Fast Measurements of Slow Processes
7.1 Introduction
The core idea developed in this chapter is expressed in the following problem from
Luenberger (1979, p. 53):
Problem 7.1 A thermometer reading 21°C, which has been inside a house for a
long time, is taken outside. After one minute the thermometer reads 15°C; after two
minutes it reads 11°C. What is the outside temperature? (According to Newton’s
law of cooling, an object of higher temperature than its environment cools at a rate
that is proportional to the difference in temperature.)
The solution of the problem shows that measurement of a signal from a “slow”
processes can be speeded up by data processing. The solution applies to the special
case of exact data from a first order single input single output linear time-invariant
system. Our purpose is to generalize Problem 7.1 and its solution to multi input multi
output processes with higher order linear time-invariant dynamics and to make the
solution a practical tool for improvement of the speed-accuracy characteristics of
measurement devices by signal processing.
A method for speeding up of measurement devices is of generic interest. Specific
applications can be found in process industry where the process, the measurement
device, or both have slow dynamics, e.g., processes in biotechnology that involve
chemical reactions and convection. Of course, the notion of “slow process” is rel-
ative. There are applications, e.g., weight measurement, where the dynamics may
be fast according to the human perception but slow according to the state-of-the-art
technological requirements.
The dynamics of the process is assumed to be linear time-invariant but otherwise
it need not be known. Knowledge of the process dynamics simplifies considerably
the problem. In some applications, however, such knowledge is not available a pri-
ori. For example, in Problem 7.1 the heat exchange coefficient (cooling time con-
stant) depends on unknown environmental factors, such as pressure and humidity.
As another example, consider the dynamic weighing problem, where the unknown
mass of the measured object affects the dynamics of the weighting process. The lin-
earity assumption is justifiable on the basis that nonlinear dynamical processes can
be approximated by linear models, and existence of an approximate linear model is
often sufficient for solving the problem to a desirable accuracy. The time-invariance
assumption can be relaxed in the recursive estimation algorithms proposed by us-
ing windowing of the processed signal and forgetting factor in the recursive update
formula.
Although only the steady-state measurement value is of interest, the solution of
Problem 7.1 identifies explicitly the process dynamics as a byproduct. The pro-
posed methods for dynamic weighing (see the notes and references section) are also
model-based and involve estimation of model parameters. Similarly, the generalized
problem considered reduces to a system identification question—find a system from
step response data. Off-line and recursive methods for solving this latter problem,
under different noise assumptions, exist in the literature, so that these methods can
be readily used for input estimation.
In this chapter, a method for estimation of the parameter of interest—the input
step value—without identifying a model of the measurement process is described.
The key idea that makes model-free solution possible comes from work on data-
driven estimation and control. The method proposed requires only a solution of a
system of linear equations and is computationally more efficient and less expensive
to implement in practice than the methods based on system identification. Modifi-
cations of the model-free algorithm compute approximate solutions that are optimal
for different noise assumptions. The modifications are obtained by replacing ex-
act solution of a system of linear equations with approximate solution by the least
squares method or one of its variations. On-line versions of the data-driven method,
using ordinary or weighted least squares approximation criterion, are obtained by
using a standard recursive least squares algorithms. Windowing of the processed
signal and forgetting factor in the recursive update allow us to make the algorithm
adaptive to time-variation of the process dynamics.
The possibility of using theoretical results, algorithms, and software for differ-
ent standard identification methods in the model based approach and approximation
methods in the data-driven approach lead to a range of new methods for speedup
of measurement devices. These new methods inherit desirable properties (such as
consistency, statistical and computational efficiency) from the identification or ap-
proximation methods being used.
The generalization of Problem 7.1 considered is defined in this section. In
Sect. 7.2, the simpler version of the problem when the measurement process dy-
namics is known is solved by reducing the problem to an equivalent state estima-
tion problem for an augmented autonomous system. Section 7.3 reduces the general
problem without known model to standard identification problems—identification
from step response data as well as identification in a model class of autonomous
system (sum-of-damped exponentials modeling). Then the data-driven method is
derived as a modification of the method based on autonomous system identifica-
tion. Examples and numerical results showing the performance of the methods are
presented in Sect. 7.4. The proofs of the statements are almost identical for the con-
7.1 Introduction 201
tinuous and discrete-time cases, so that only the proof for the discrete-time case is
given.
Problem Formulation
Problem 7.2 A step input u = ūs, where ū ∈ R, is applied to a first order stable
linear time-invariant system with unit dc gain.1 Find the input step level ū from the
output values y(0) = 21, y(1) = 15, and y(2) = 11.
The heat exchange dynamics in Problem 7.1 is indeed a first order stable linear
time-invariant system with unit dc gain and step input. The input step level ū is
therefore equal to the steady-state value of the output and is the quantity of interest.
Stating Problem 7.1 in system theoretic terms opens a way to generalizations and
makes clear the link of the problem to system identification. The general problem
considered is defined as follows.
1 The dc (or steady-state) gain of a linear time-invariant system, defined by an input/output repre-
sentation with transfer function H , is G = H (0) in the continuous-time case and G = H (1) in the
discrete-time case. As the name suggest the dc gain G is the input-output amplification factor in a
steady-state regime, i.e., constant input and output.
202 7 Fast Measurements of Slow Processes
Note 7.5 (Exact vs. noisy observations) Problem 7.3, aims to determine the exact
input value ū. Correspondingly, the given observations y of the sensor are assumed
to be exact. Apart from the problem of finding the exact value of ū from exact
data y, the practically relevant estimation and approximation problems are consid-
ered, where the observations are subject to measurement noise
y = y0 + .
y, where y0 ∈ B and .
y is white process with .
y (t) ∼ N(0, V ) (OE)
and
with
A B
Aaut := and Caut := C D . (∗)
0 Im
Then
p
(ūs, y) ∈ B ∈ Lm,n
m+p
⇐⇒ y ∈ Baut ∈ L0,n+m and Baut has
m poles at 0 in the continuous-time case, or at 1 in the discrete-time case.
7.2 Estimation with Known Measurement Process Dynamics 203
The state vector x of Bi/s/o (A, B, C, D) and the state vector xaut of Bi/s/o (Aaut , Caut ),
are related by xaut = (x, ū).
Proof
Proposition 7.7 shows that Problem 7.6 can be solved as a state estimation prob-
lem for the augmented system Bi/s/o (Aaut , Baut ). In the case of discrete-time ex-
act data y, a dead-beat observer computes the exact state vector in a finite (less
than n + m) time steps. In the case of noisy observations (OE), the maximum-
likelihood state estimator is the Kalman filter.
The algorithm for input estimation, resulting from Proposition 7.7 is
203a Algorithm for sensor speedup in the case of known dynamics 203a≡
function uh = stepid_kf(y, a, b, c, d, v, x0, p0)
[p, m] = size(d); n = size(a, 1);
if nargin == 6,
x0 = zeros(n + m, 1); p0 = 1e8 * eye(n + m);
end
model augmentation: B → Baut 203b
state estimation: (y, Baut ) → xaut = (x, ū) 203c
Defines:
stepid_kf, used in chunks 213 and 222a.
The obligatory inputs to the function stepid_kf are a T × p matrix y of uni-
formly sampled outputs (y(t) = y(t,:)’), parameters a, b, c, d of a state space
representation Bi/s/o (A, B, C, D) of the measurement process, and the output noise
variance v. The output uh is a T × m matrix of the sequence of parameter ū esti-
mates u(t) = uh(t,:)’. The first step B → Baut of the algorithm is implemented
as follows, see (∗).
203b model augmentation: B → Baut 203b≡ (203a)
a_aut = [a b; zeros(m, n) eye(m)]; c_aut = [c d];
Using the function tvkf_oe, which implements the time-varying Kalman filter
for an autonomous system with output noise, the second step
Note 7.8 (Comparison of other methods with stepid_kf) In the case of noisy
data (OE), stepid_kf is a statistically optimal estimator for the parameter ū.
Therefore, the performance of stepid_kf is an upper bound for the achievable
performance with the methods described in the next section.
Proposition 7.9 Let P ∈ Rm×m be a nonsingular matrix and define the mappings
(P , B) → B , by B := (P u, y) | (u, y) ∈ B
and
−1
(P , B ) → B, by B := P u, y | (u, y) ∈ B .
Then, under Assumption 7.4, we have
(ūs, y) ∈ B ∈ Lm,n
m+p
and dcgain(B) = G
⇐⇒ (1m s, y) ∈ B ∈ Lm,n
m+p
and dcgain(B ) = G , (∗)
where ū = P 1m and GP = G .
m+p m+p
Proof Obviously, B ∈ Lm,n and dcgain(B) = G implies B ∈ Lm,n and
dcgain(B ) = GP =: G .
m+p m+p
Vice verse, if B ∈ Lm,n and dcgain(B ) = G , then B ∈ Lm,n and
dcgain(B) = P −1 G = G.
Note 7.10 The input value 1m in (∗) is arbitrary. The equivalence holds for any
nonzero vector ū , in which case ū = P ū .
The importance of Proposition 7.9 stems from the fact that while in the left-hand
side of (∗) the input ū is unknown and the gain G is known, in the right-hand side
of (∗), the input 1m is known and the gain G is unknown. Therefore, for the case
p+m
p = m, the standard identification problem of finding B ∈ Lm,n from the data
p+m
(1m s, y) is equivalent to Problem 7.3, i.e., find ū ∈ R and B ∈ Lm,n
m , such that
(1m s, y) ∈ B and dcgain(B) = G. (The p = m condition is required in order to
ensure that the system GP = G has a unique solution P for any p × m full column
rank matrices G and G .)
Next, we present an algorithm for solving Problem 7.3 using Proposition 7.9.
205 Algorithm for sensor speedup based on reduction to step response system identification 205≡
function [uh, sysh] = stepid_si(y, g, n)
system identification: (1m s, y) → B 206a
ū := G−1 G 1m , where G := dcgain(B ) 206b
Defines:
stepid_si, never used.
206 7 Fast Measurements of Slow Processes
and
We have
p
(s ū, y) ∈ B ∈ Lm,n
m+p
and dcgain(B) = G ⇐⇒ y ∈ Baut ∈ L0,n+1 and
Baut has a pole at 0 in the continuous-time or 1 in the discrete-time. (∗)
{z1 , . . . , zn } := λ(B).
Using the prior knowledge about the dc gain of the system, we have
The significance of Proposition 7.13 is that Problem 7.3 can be solved equiva-
lently as an autonomous system identification problem with a fixed pole at 0 in the
continuous-time case or at 1 in the discrete-time case. The following proposition
shows how a preprocessing step makes possible standard methods for autonomous
system identification (or equivalently sum-of-damped exponential modeling) to be
used for identification of a system with a fixed pole at 0 or 1.
Proposition 7.14
A b d
y ∈ Bi/s/o , Cd ⇐⇒ Δy := y ∈ ΔB := Bi/s/o (A, C)
0 0 dt
The initial conditions (xini , ū) of Bi/s/o (Ae , Ce ) and Δxini of Bi/s/o (A, C) are re-
lated by
(I − A)xini = Δxini .
Once the model parameters A and C are determined via autonomous system
identification from Δy, the parameter of interest ū can be computed from the equa-
tion
Using the fact that the columns of the extended observability matrix OT (A, C) form
a basis for ΔB|[1,T ] , we obtain the following system of linear equations for the
estimation of ū:
ū
1T ⊗ G OT (A, C) = col y(ts ), . . . , y(T ts ) . (SYS AUT)
xini
Propositions 7.13 and 7.14, together with (SYS AUT), lead to the following al-
gorithm for solving Problem 7.3.
209a Algorithm for sensor speedup based on reduction to autonomous system identification 209a≡
function [uh, sysh] = stepid_as(y, g, n)
preprocessing by finite difference filter Δy := (1 − σ −1 )y 209b
autonomous system identification: Δy → ΔB 209c
computation of ū by solving (SYS AUT) 209d
Defines:
stepid_as, never used.
where
209b preprocessing by finite difference filter Δy := (1 − σ −1 )y 209b≡ (209a 210)
dy = diff(y);
and, using the function ident_aut
209c autonomous system identification: Δy → ΔB 209c≡ (209a)
sysh = ident_aut(dy, n);
Uses ident_aut 112a.
Notes 7.11 and 7.12 apply for stepid_as as well. Alternatives to the prediction
error method in the second step are methods for sum-of-damped exponential model-
ing (e.g., the Prony, Yule-Walker, or forward-backward linear prediction methods)
and approximate system realization methods (e.g., Kung’s method, implemented
in the function h2ss). Theoretical results and numerical studies justify different
methods as being effective for different noise assumptions. This allows us to pick
the “right” identification method, to be used in the algorithm for solving Problem 7.3
under additional assumptions or prior knowledge about the measurement noise.
Finally, the third step of stepid_as is implemented as follows.
209d computation of ū by solving (SYS AUT) 209d≡ (209a)
T = size(y, 1); [p, m] = size(g); yt = y’; O = sysh.c;
for t = 2:T
O = [O; O(end - p + 1:end, :) * sysh.a];
end
xe = [kron(ones(T, 1), g) O] \ yt(:); uh = xe(1:m);
Data-Driven Solution
A signal w is persistently exciting of order l if the Hankel matrix Hl (w) has full
row rank. By Lemma 4.11, a persistently exciting signals of order l cannot be fitted
210 7 Fast Measurements of Slow Processes
exactly by a system in the model class L0,l . Persistency of excitation of the input
is a necessary identifiability condition in exact system identification.
Assuming that Δy is persistently exciting of order n,
Since
Bmpum (Δy) = span σ τ Δy | τ ∈ R ,
we have
Bmpum (Δy)|[1,T −n] = span HT −n (Δy) .
Then, from (AUT), we obtain the system of linear equations for ū
ū
1T −n ⊗ G HT −n (Δy) = col y (n + 1)ts , . . . , y(T ts ) , (SYS DD)
H
which depends only on the given output data y and gain matrix G. The resulting
model-free algorithm is
210 Data-driven algorithm for sensor speedup 210≡
function uh = stepid_dd(y, g, n, ff)
if nargin == 3, ff = 1; end
preprocessing by finite difference filter Δy := (1 − σ −1 )y 209b
computation of ū by solving (SYS DD) 211
Defines:
stepid_dd, used in chunks 213 and 221.
As proved next, stepid_dd computes the correct parameter value ū under less
restrictive condition than identifiability of ΔB from Δy, i.e., persistency of excita-
tion of Δy of order n is not required.
Proof The derivation of (SYS DD) and the exact data assumption (∗) imply that
there exists ¯ ∈ Rn , such that (ū, ) ¯ is a solution of (SYS DD). Our goal is to show
that all solutions of (SYS DD) are of this form.
By Assumption 7.4, B is a stable system, so that 1 ∈ / λ(B). It follows that for
any ȳ ∈ Rp ,
ȳ, . . . , ȳ ∈/ Baut |[1,T ] = ΔB|[1,T ] .
T times
7.3 Estimation with Unknown Measurement Process Dynamics 211
Therefore,
span(1T ⊗ G) ∩ Baut |[1,T ] = { 0 }.
By the assumption T ≥ 2n + m, the matrix H in (SYS DD) has at least as many
rows as columns. Then using the full row rank property of G (Assumption 7.4), it
follows that a solution ū of (SYS DD) is unique.
It follows from Note 7.16 that if the order n is not specified a priori but is es-
timated from the data by, e.g., computing the numerical rank of HT −n (Δy), the
system of equations (SYS DD) has a unique solution.
In rls, the first $(n + m)/p% − 1 samples are used for initialization, so that in
order to match the index of uh with the actual discrete-time when the estimate is
computed, uh is padded with n additional rows.
212 7 Fast Measurements of Slow Processes
Using the recursive least squares algorithm rls, the computational cost of
stepid_dd is O((m + n)2 p). Therefore, its computational complexity compares
favorably with the one of stepid_kf with precomputed filter gain (see Sect. 7.2).
The fact that Problem 7.3 can be solved with the same order of computations with
and without knowledge of the process dynamics is surprising and remarkable. We
consider this fact as our main result.
In the next section, the performance of stepid_dd and stepid_kf is com-
pared on test examples, where the data are generated according to the output error
noise model (OE).
Note 7.18 (Mixed least squares Hankel structured total least squares approximation
method) In case of noisy data (OE), the ordinary least squares approximate solution
of (SYS DD) is not maximum likelihood. The reason for this is that the matrix
in the left-hand-side of (SYS DD) depends on y, which is perturbed by the noise.
A statistically better approach is to be used the mixed least squares total least squares
approximation method that accounts for the fact that the block 1T ⊗ G in H is exact
but the block HT −n (Δy) of H as well as the right-hand side of (SYS DD) are noisy.
The least squares total least squares method, however, requires the more expensive
singular value decomposition of the matrix [H y] and is harder to implement as a
recursive on-line method. In addition, although the mixed least squares total least
squares approximation method improves on the standard least squares method it is
also not maximum likelihood either, because it does not take into account the Hankel
structure of the perturbations. A maximum-likelihood data-driven method requires
an algorithm for structured total least squares.
In the simulations, we use the output error model (OE). The exact data y0 in the esti-
mation problems is a uniformly sampled output trajectory y0 = (y0 (ts ), . . . , y0 (T ts ))
of a continuous-time system B = Bi/s/o (A, B, C, D), obtained with input u0 and
initial condition xini .
212a Test sensor speedup 212a≡ 212b
initialize the random number generator 89e
sys = c2d(ss(A, B, C, D), ts); G = dcgain(sys);
[p, m] = size(G); n = size(sys, ’order’);
y0 = lsim(sys, u0, [], xini);
Defines:
test_sensor, used in chunks 215–20.
According to (OE), the exact trajectory y0 is perturbed with additive noise .
y , which
is modeled as a zero mean, white, stationary, Gaussian process with standard devi-
ation .
212b Test sensor speedup 212a+≡ 212a 213a
y = y0 + randn(T, p) * s;
7.4 Examples and Real-Life Testing 213
After the data y is simulated, the estimation methods stepid_kf and stepid_dd
are applied
213a Test sensor speedup 212a+≡ 212b 213b
uh_dd = stepid_dd(y, G, n, ff);
uh_kf = stepid_kf(y, sys.a, sys.b, sys.c, sys.d, ...
s^2 * eye(size(D, 1)));
Uses stepid_dd 210 and stepid_kf 203a.
and the corresponding estimates are plotted as functions of time, together with the
“naive estimator”
u := G+ y,
where G+ = (G G)−1 G .
1
N
n
e= ū −
u(i) 1 , where x 1 := |xi |
N
i=1 i=1
Dynamic Cooling
The first example is the temperature measurement problem from the introduction.
The heat transfer between the thermometer and its environment is governed by
Newton’s law of cooling, i.e., the changes in the thermometer’s temperature y is
proportional to the difference between the thermometer’s temperature and the en-
vironment’s temperature ū. We assume that the heat capacity of the environment
is much larger than the heat capacity of the thermometer, so that the heat transfer
between the thermometer and the environment does not change the environment’s
temperature. Under this assumption, the dynamics of the measurement process is
given by the differential equation
d
y = a ūs − y ,
dt
where a is a positive constant that depends on the thermometer and the environment.
The differential equation defines a first order linear time-invariant dynamical system
B = Bi/s/o (−a, a, 1, 0) with input u = ūs.
214b cooling process 214b≡ (215 220)
A = -a; B = a; C = 1; D = 0;
The dc gain of the system is equal to 1, so that it can be assumed known, inde-
pendent of the process’s parameter a. This matches the setup of Problem 7.3, where
the dc gain is assumed a priori known but the process dynamics is not.
7.4 Examples and Real-Life Testing 215
The average error for both stepid_kf and stepid_dd is zero (up to errors
incurred by the numerical computation). The purpose of showing simulation results
of an experiment with exact data is verification of the theoretical results stating that
the methods solve Problem 7.3.
In the case of output noise (OE), stepid_kf is statistically optimal estimator,
while stepid_dd, implemented with the (recursive) least squares approximation
method, is not statistically optimal (see Note 7.18). In the next simulation example
we show how far from optimal is stepid_dd in the dynamic colling example with
the simulation parameters given below.
Temperature-Pressure Measurement
Consider ideal gas in a closed container with a fixed volume. We measure the tem-
perature (as described in the previous section) by a slow but accurate thermometer,
and the pressure by fast but inaccurate pressure sensor. By Gay-Lussac’s law, the
temperature (measured in Kelvin) is proportional to the pressure, so by proper cal-
ibration, we can measure the temperature also with the pressure sensor. Since the
pressure sensor is much faster than the thermometer, we model it as a static system.
The measurement process in this example is a multivariable (one input, two outputs)
system Bi/s/o (A, B, C, D), where
1 0
A = −a, B = a, C = , and D = .
0 1
the beginning when the estimate of the slow but accurate sensor is still far off the
true value. This intuitive explanation is confirmed by the results of an experiment in
which only the pressure sensor is used.
Dynamic Weighing
The third example is the dynamic weighing problem. An object with mass M is
placed on a weighting platform with mass m that is modeled as a mass, spring,
damper system, see Fig. 7.1.
At the time of placing the object, the platform is in a specified (in general
nonzero) initial condition. The object placement has the effect of a step input as
well as a step change of the total mass of the system—platform and object. The goal
is to measure the object’s mass while the platform is still in vibration.
We choose the origin of the coordinate system at the equilibrium position of the
platform when there is no object placed on it with positive direction being upwards,
perpendicular to the ground. With y(t) being the platform’s position at time t, the
measurement process B is described by the differential equation
d2 d
(M + m) y = −ky − d y − Mg,
dt 2 dt
where g is the gravitational constant
217b define the gravitational constant 217b≡ (218a)
g = 9.81;
k is the elasticity constant of the spring, and d is the damping constant of the damper.
Defining the state vector x = (y, ddt y) and taking as an input u0 = Ms, we obtain
218 7 Fast Measurements of Slow Processes
Simulation 5: M = 1
218b Sensor speedup examples 214a+≡ 217a 219a
m = 1; M = 1; k = 1; d = 1; T = 12; weighting process 218a
ts = 1; s = 0.02; ff = 1; xini = 0.1 * [1; 1]; test_sensor
Uses test_sensor 212a.
7.4 Examples and Real-Life Testing 219
Simulation 6: M = 10
219a Sensor speedup examples 214a+≡ 218b 219b
m = 1; M = 10; k = 1; d = 1; T = 15; weighting process 218a
ts = 1; s = 0.05; ff = 1; xini = 0.1 * [1; 1]; test_sensor
Uses test_sensor 212a.
Simulation 7: M = 100
219b Sensor speedup examples 214a+≡ 219a 220
m = 1; M = 100; k = 1; d = 1; T = 70; weighting process 218a
ts = 1; s = 0.5; ff = 1; xini = 0.1 * [1; 1]; test_sensor
Uses test_sensor 212a.
Time-Varying Parameter
Finally, we show an example with a time-varying measured parameter ū. The mea-
surement setup is the cooling process with the parameter a changing from 1 to 2 at
time t = 25. The performance of the estimates in the interval [1, 25] (estimation of
the initial value 1) was already observed in Simulations 1 and 2 for the exact and
noisy case, respectively. The performance of the estimates in the interval [26, 50]
220 7 Fast Measurements of Slow Processes
(estimation of the new value ū = 2) is new characteristic for the adaptive properties
of the algorithms.
The result shows that by choosing “properly” the forgetting factor f (f = 0.5
in Simulation 8), stepid_dd tracks the changing parameter value. In contrast,
stepid_kf which assumes constant parameter value is much slower in correcting
the old parameter estimate u ≈ 1.
Currently the choice of a the forgetting factor f is based on the heuristic rule that
“slowly” varying parameter requires value of f “close” to 1 and “quickly” changing
parameter requires value of f close to 0. A suitable value is fine tuned by trail and
error.
Another possibility for making stepid_dd adaptive is to include windowing
of the data y by down-dating of the recursive least squares solution. In this case the
tunable parameter (similar to the forgetting factor) is the window length. Again there
is an obvious heuristic for choosing the window length but no systematic procedure.
Windowing and exponential weighting can be combined, resulting in a method with
two tunable parameter.
Real-Life Testing
The data-driven algorithms for input estimation are tested also on real-life data of
the “dynamic cooling” application. The experimental setup for the data collection is
based on the Lego NXT mindstorms digital signal processor and digital temperature
sensor, see Fig. 7.2.
7.4 Examples and Real-Life Testing 221
Fig. 7.2 Experimental setup: Lego NXT mindstorms brick (left) and temperature sensor (right)
Fig. 7.3 Left: model fit to the data (solid blue—measured data, dashed red—model fit). Right:
parameter estimates (solid black—naive estimator, dashed blue—stepid_dd, dashed dotted
red—stepid_kf)
Then the data are modeled by removing the steady-state value and fitting an expo-
nential to the residual. The obtained model is a first order dynamical system, which
is used for the model based input estimation method.
222a Test sensor speedup methods on measured data 221+≡ 221 222b
yc = y - ub; f = yc(1:end - 1) \ yc(2:end);
yh = f .^ (0:(T - 1))’ * yc(1) + ub;
sys = ss(f, 1 - f, 1, 0, t(2) - t(1));
uh_kf = stepid_kf(y, sys.a, sys.b, sys.c, sys.d, s^2);
Uses stepid_kf 203a.
Finally, the estimation results for the naive, model-based, and data-driven meth-
ods are plot for comparison.
222b Test sensor speedup methods on measured data 221+≡ 222a
figure(1), hold on, plot(t, y, ’b-’, t, yh, ’r-’)
axis([1 t(end) y(1) y(end)]), print_fig(’lego-test-fit’)
figure(2), hold on,
plot(t, abs(ub - y / G’), ’k-’),
plot(t, abs(ub - uh_dd), ’-b’),
plot(t, abs(ub - uh_kf), ’-.r’),
axis([t(10) t(end) 0 5]), print_fig(’lego-test-est’)
Uses print_fig 25a.
The function tvkf_oe implements the time-varying Kalman filter for the discrete-
time autonomous stochastic system, described by the state space representation
σ x = Ax, y = Cx + v,
223a Time-varying Kalman filter for autonomous output error model 222c+≡ 222c
x = zeros(size(a, 1), T); x(:, 1) = x0;
for t = 1:(T-1)
k = (a * p * c’) / (v + c * p * c’);
x(:, t + 1) = a * x(:, t) + k * (y(:, t) - c * x(:, t));
p = a * p * a’ - k * (v + c * p * c’) * k’;
end
The algorithm is initialized with the solution of the system formed by the first n
equations
−1
x(0) := A−1 1:n,: b1:n , P (0) := A 1:n,: A1:n,: .
224 initialization 224≡ (223c)
ai = a(1:n, 1:n); x = zeros(n, m);
x(:, n) = ai \ b(1:n); p = inv(ai’ * ai);
7.6 Notes
In metrology, the problem considered in this chapter is called dynamic measurement.
The methods proposed in the literature, see the survey (Eichstädt et al. 2010) and the
references there in, pose and solve the problem as a compensator design problem,
i.e., the input estimation problem is solved by:
1. designing off-line a dynamical system, called compensator, such that the series
connection of the measurement process with the compensator is an identity, and
2. processing on-line the measurements by the compensator.
Most authors aim at a linear time-invariant compensator and assume that a model
of the measurement process is a priori given. This is done presumably due to the
simplification that the linear time-invariant assumption of the compensator brings
in the design stage and the reduced computational cost in the on-line implementa-
tion compared to alternative nonlinear compensators. In the case of known model,
step 1 of the dynamic measurement problem reduces to the classical problem of
designing an inverse system (Sain and Massey 1969). In the presence of noise, how-
ever, compensators that take into account the noise are needed. To the best of our
knowledge, there is no theoretically sound solution of the dynamic measurement
problem in the noisy case available in the literature although, as shown in Sect. 7.2,
the problem reduces to a state estimation problem for a suitably defined autonomous
linear time-invariant system. As a consequence, under standard assumptions about
the measurement and process noises, the maximum-likelihood solution is given by
the Kalman filter, designed for the autonomous system.
More flexible is the approach of Shu (1993), where the compensator is tuned
on-line by a parameter estimation algorithm. In this case, the compensator is a non-
linear system and an a priori given model of the process is no longer required. The
solutions proposed in Shu (1993) and Jafaripanah et al. (2005), however, are tai-
lored to the dynamic weighing problem, where the measurement process dynamics
is a specific second order system.
Compared with the existing results on the dynamic measurement problem in the
literature the methods described in the chapter have the following advantages.
• The considered measurement process dynamics is a general linear multivariable
system. This is a significant generalization of the previously considered dynamic
measurement problems (single input single output, first and second order sys-
tems).
References 225
References
Eichstädt S, Elster C, Esward T, Hessling J (2010) Deconvolution filters for the analysis of dynamic
measurement processes: a tutorial. Metrologia 47:522–533
Friedland B (1969) Treatment of bias in recursive filtering. IEEE Trans Autom Control 14(4):359–
367
Jafaripanah M, Al-Hashimi B, White N (2005) Application of analog adaptive filters for dynamic
sensor compensation. IEEE Trans Instrum Meas 54:245–251
Kailath T, Sayed AH, Hassibi B (2000) Linear estimation. Prentice Hall, New York
Luenberger DG (1979) Introduction to dynamical systems: theory, models and applications. Wiley,
New York
Markovsky I (2010) Closed-loop data-driven simulation. Int J Control 83:2134–2139
Markovsky I, Rapisarda P (2008) Data-driven simulation and control. Int J Control 81(12):1946–
1959
Markovsky I, Willems JC, Van Huffel S, De Moor B (2006) Exact and approximate modeling of
linear systems: a behavioral approach. SIAM, Philadelphia
Sain M, Massey J (1969) Invertibility of linear time-invariant dynamical systems. IEEE Trans
Autom Control 14:141–149
226 7 Fast Measurements of Slow Processes
Shu W (1993) Dynamic weighing under nonzero initial conditions. IEEE Trans Instrum Meas
42(4):806–811
Stoica P, Selén Y (2004) Model-order selection: a review of information criterion rules. IEEE
Signal Process Mag 21:36–47
Willems JC, Rapisarda P, Markovsky I, Moor BD (2005) A note on persistency of excitation.
Control Lett 54(4):325–329
Willman W (1969) On the linear smoothing problem. IEEE Trans Autom Control 14(1):116–117
Appendix A
Approximate Solution of an Overdetermined
System of Equations
where the matrix B is modified as little as possible in the sense of minimizing the
correction size B − B F , so that the modified system of equations AX = B is com-
patible. The classical least squares problem has an analytic solution: assuming that
the matrix A is full column rank, the unique least squares approximate solution is
ls = A A −1 A B
X ls = A A A −1 A B.
and B
In the case when A is rank deficient, the solution is either nonunique or does not
exist. Such least squares problems are solved numerically by regularization tech-
niques, see, e.g., Björck (1996, Sect. 2.7).
There are many variations and generalizations of the least squares method for
solving approximately an overdetermined system of equations. Well known ones are
methods for recursive least squares approximation (Kailath et al. 2000, Sect. 2.6),
regularized least squares (Hansen 1997), linear and quadratically constrained least
squares problems (Golub and Van Loan 1996, Sect. 12.1).
Next, we list generalizations related to the class of the total least squares methods
because of their close connection to corresponding low rank approximation prob-
lems. Total least squares methods are low rank approximation methods using an
input/output representation of the rank constraint. In all these problems the basic
idea is to modify the given data as little as possible, so that the modified data define
a consistent system of equations. In the different methods, however, the correction is
done and its size is measured in different ways. This results in different properties of
the methods in a stochastic estimation setting and motivates the use of the methods
in different practical setups.
• The data least squares (Degroat and Dowling 1991) method is the “reverse” of the
least squares method in the sense that the matrix A is modified and the matrix B
is not:
minimize over A and X A − A subject to AX = B. (DLS)
F
As in the least squares problem, the solution of the data least squares is com-
putable in closed form.
• The classical total least squares (Golub and Reinsch 1970; Golub 1973; Golub
and Van Loan 1980) method modifies symmetrically the matrices A and B:
minimize over A, B, and X A B − A B
F
(TLS)
subject to AX = B.
Conditions for existence and uniqueness of a total least squares approximate solu-
tion are given in terms of the singular value decomposition of the augmented data
matrix [A B]. In the generic case when a unique solution exists, that solution is
given in terms of the right singular vectors of [A B] corresponding to the smallest
B]
singular values. In this case, the optimal total least squares approximation [A
of the data matrix [A B] coincides with the Frobenius norm optimal low rank
approximation of [A B], i.e., in the generic case, the model obtained by the total
least squares method coincides with the model obtained by the unstructured low
rank approximation in the Frobenius norm.
∗ = image(D
and let B ∗ ) be the corresponding optimal linear static model. The
parameter X ∗ of an input/output representation B ∗ = Bi/o (X
∗ ) of the optimal
model is a solution to the total least squares problem (TLS) with data matrices
m A
= D ∈ Rq×N .
q − m B
The theorem makes explicit the link between low rank approximation and total
least squares. From a data modeling point of view,
Here the matrices are Wl and Wr are positive semidefinite weight matrices—Wl
corresponds to weighting of the rows and Wr to weighting of the columns of the
correction [ΔA ΔB]. Similarly to the classical total least squares problem, the
existence and uniqueness of a generalized total least squares approximate solution
is determined from the singular value decomposition. The data least squares and
total least squares problems are special cases of the generalized total least squares
problem.
• The restricted total least squares (Van Huffel and Zha 1991) method constrains
the correction to be in the form
ΔA ΔB = Pe ELe , for some E,
i.e., the row and column span of the correction matrix are constrained to be within
the given subspaces image(Pe ) and image(L e ), respectively. The restricted total
least squares problem is:
B,
minimize over A, E, and X E F
(RTLS)
subject to AB − A B
= Pe ELe = B.
and AX
Global and efficient solution methods for solving regularized total least squares
problems are derived in Beck and Ben-Tal (2006).
• The structured total least squares method (De Moor 1993; Abatzoglou et al.
1991) method is a total least squares method with the additional constraint that
the correction should have certain specified structure
minimize over A, and X A B − A
B,
B
F
(STLS)
subject to AX = B and
A B has a specified structure.
References 231
Hankel and Toeplitz structured total least squares problems are the most often
studied ones due to the their application in signal processing and system theory.
• The structured total least norm method (Rosen et al. 1996) is the same as the
structured total least squares method with a general matrix norm in the approxi-
mation criterion instead of the Frobenius norm.
For generalizations and applications of the total least squares problem in the
periods 1990–1996, 1996–2001, and 2001–2006, see, respectively, the edited books
(Van Huffel 1997; Van Huffel and Lemmerling 2002), and the special issues (Van
Huffel et al. 2007a, 2007b). An overview of total least squares problems is given
in Van Huffel and Zha (1993), Markovsky and Van Huffel (2007b) and Markovsky
et al. (2010).
References
Abatzoglou T, Mendel J, Harada G (1991) The constrained total least squares technique and its
application to harmonic superresolution. IEEE Trans Signal Process 39:1070–1087
Beck A, Ben-Tal A (2006) On the solution of the Tikhonov regularization of the total least squares.
SIAM J Optim 17(1):98–118
Björck Å (1996) Numerical methods for least squares problems. SIAM, Philadelphia
De Moor B (1993) Structured total least squares and L2 approximation problems. Linear Algebra
Appl 188–189:163–207
Degroat R, Dowling E (1991) The data least squares problem and channel equalization. IEEE Trans
Signal Process 41:407–411
Fierro R, Golub G, Hansen P, O’Leary D (1997) Regularization by truncated total least squares.
SIAM J Sci Comput 18(1):1223–1241
Golub G (1973) Some modified matrix eigenvalue problems. SIAM Rev 15:318–344
Golub G, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math
14:403–420
Golub G, Van Loan C (1980) An analysis of the total least squares problem. SIAM J Numer Anal
17:883–893
Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press
Golub G, Hansen P, O’Leary D (1999) Tikhonov regularization and total least squares. SIAM J
Matrix Anal Appl 21(1):185–194
Hansen PC (1997) Rank-deficient and discrete ill-posed problems: numerical aspects of linear
inversion. SIAM, Philadelphia
Kailath T, Sayed AH, Hassibi B (2000) Linear estimation. Prentice Hall, New York
Manton J, Mahony R, Hua Y (2003) The geometry of weighted low-rank approximations. IEEE
Trans Signal Process 51(2):500–514
Markovsky I, Van Huffel S (2007a) Left vs right representations for solving weighted low rank
approximation problems. Linear Algebra Appl 422:540–552
Markovsky I, Van Huffel S (2007b) Overview of total least squares methods. Signal Process
87:2283–2302
Markovsky I, Rastello ML, Premoli A, Kukush A, Van Huffel S (2005) The element-wise weighted
total least squares problem. Comput Stat Data Anal 50(1):181–209
Markovsky I, Sima D, Van Huffel S (2010) Total least squares methods. Wiley Interdiscip Rev:
Comput Stat 2(2):212–217
Meyer C (2000) Matrix analysis and applied linear algebra. SIAM, Philadelphia
Paige C, Strakos Z (2005) Core problems in linear algebraic systems. SIAM J Matrix Anal Appl
27:861–875
232 A Approximate Solution of an Overdetermined System of Equations
Rosen J, Park H, Glick J (1996) Total least norm formulation and solution of structured problems.
SIAM J Matrix Anal Appl 17:110–126
Sima D (2006) Regularization techniques in model fitting and parameter estimation. PhD thesis,
ESAT, KU Leuven
Sima D, Van Huffel S, Golub G (2004) Regularized total least squares based on quadratic eigen-
value problem solvers. BIT 44:793–812
Strang G (1976) Linear algebra and its applications. Academic Press, San Diego
Trefethen L, Bau D (1997) Numerical linear algebra. SIAM, Philadelphia
Van Huffel S (ed) (1997) Recent advances in total least squares techniques and errors-in-variables
modeling. SIAM, Philadelphia
Van Huffel S, Lemmerling P (eds) (2002) Total least squares and errors-in-variables modeling:
analysis, algorithms and applications. Kluwer, Amsterdam
Van Huffel S, Vandewalle J (1988) Analysis and solution of the nongeneric total least squares
problem. SIAM J Matrix Anal Appl 9:360–372
Van Huffel S, Vandewalle J (1989) Analysis and properties of the generalized total least squares
problem AX ≈ B when some or all columns in A are subject to error. SIAM J Matrix Anal
10(3):294–315
Van Huffel S, Vandewalle J (1991) The total least squares problem: computational aspects and
analysis. SIAM, Philadelphia
Van Huffel S, Zha H (1991) The restricted total least squares problem: formulation, algorithm and
properties. SIAM J Matrix Anal Appl 12(2):292–309
Van Huffel S, Zha H (1993) The total least squares problem. In: Rao C (ed) Handbook of statistics:
comput stat, vol 9. Elsevier, Amsterdam, pp 377–408
Van Huffel S, Cheng CL, Mastronardi N, Paige C, Kukush A (2007a) Editorial: Total least squares
and errors-in-variables modeling. Comput Stat Data Anal 52:1076–1079
Van Huffel S, Markovsky I, Vaccaro RJ, Söderström T (2007b) Guest editorial: Total least squares
and errors-in-variables modeling. Signal Proc 87:2281–2282
Wentzell P, Andrews D, Hamilton D, Faber K, Kowalski B (1997) Maximum likelihood principal
component analysis. J Chemometr 11:339–366
Appendix B
Proofs
. . . the ideas and the arguments with which the mathematician is concerned have physical,
intuitive or geometrical reality long before they are recorded in the symbolism.
The proof is meaningful when it answers the students doubts, when it proves what is not
obvious. Intuition may fly the student to a conclusion but where doubt remains he may then
be asked to call upon plodding logic to show the overland route to the same goal.
Kline (1974)
subject to ≤ m.
rank(D) (LRA)
and let
∗ := U ∗ Σ ∗ (V ∗ )
D
∗ . By the unitary invariance of the Frobenius
be a singular value decomposition of D
norm, we have
D − D
∗ = (U ∗ ) D − D ∗ V ∗
F F
∗
= (U ) DV −Σ F ,∗ ∗
D
Partition
which shows that Σ ∗ is an optimal approximation of D.
=: D11 D12
D
D21 D22
12 = 0. Similarly D
so that D 21 = 0. Observe also that
11 0
D
rank ≤ m and D 11 = Σ1∗
0 0
∗
11 0
D Σ1 0
=⇒ D −
< D− ,
0 0 F 0 0 F
11 = Σ ∗ . Therefore,
so that D 1
∗
= Σ1 0
D 22 .
0 D
Let
22 = U22 Σ22 V22
D
Therefore,
I 0 Σ1∗ 0 I 0
D = U∗ ∗
(V )
0 U22 0 Σ22 0 V22
is a singular value decomposition of D.
Then, if σm > σm+1 , the rank-m truncated singular value decomposition
∗ ∗
∗ ∗ Σ1 0 ∗ ∗ I 0 Σ1 0 I 0 ∗
D =U (V ) = U (V )
0 0 0 U22 0 0 0 V22
where “const” is a term that does not depend on D The log-likelihood func-
and B.
tion is
⎧
2 −1 ,
⎪ − const · 2σ 2 vec(D) − vec(D)
1
⎨ V
D
B, = if image( ⊂B
D) and dim(B)
≤m
⎪
⎩
−∞, otherwise,
and the maximum likelihood estimation problem is
and D
1 2 −1
vec(D) − vec(D)
minimize over B 2 V
2σ
subject to and dim(B)
⊂B
image(D) ≤ m,
Note B.1 (Weight matrix in the norm specification) The weight matrix W in the
norm specification is the inverse of the measurement noise covariance matrix V . In
case of singular covariance matrix (e.g., missing data) the method needs modifica-
tion.
The polynomial equations (GCD) are equivalent to the following systems of alge-
braic equations
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0
p c0
q0 c0
⎢p ⎥ ⎢ c ⎥ ⎢
q ⎥ ⎢ c ⎥
⎢ 1⎥ ⎢ 1⎥ ⎢ 1⎥ ⎢ 1⎥
⎢ .. ⎥ = Td+1 (u) ⎢ . ⎥ , ⎢ . ⎥ = Td+1 (v) ⎢ . ⎥ ,
⎣ . ⎦ ⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
n
p cd
qn cd
and
p q with degree(c) ≤ d if and only if the system of equations
⎡ ⎤ ⎡ ⎤
0
p q0 u0 v0
⎢p q1 ⎥ ⎢ u1 v1 ⎥
⎢ 1 ⎥ ⎢ ⎥
⎢ .. .. ⎥ = Tn−d+1 (c) ⎢ .. .. ⎥
⎣ . . ⎦ ⎣ . . ⎦
n
p qn un−d vn−d
has a solution.
The condition degree(c) = d implies that the highest power coefficient cd of c
is different from 0. Since c is determined up to a scaling factor, we can impose the
normalization cd = 1. Conversely, imposing the constraint cd = 1 in the optimiza-
tion problem to be solved ensures that degree(c) = d. Therefore, Problem 3.13 is
equivalent to
Substituting [
p q ] in the cost function and minimizing with respect to [u v] by
solving a least squares problem gives the equivalent problem (AGCD’).
only if the union of the first order optimality conditions for the problems on steps 1
and 2 are satisfied. Then
∗
P (k−1)
=P (k)
=: P and L (k−1) = L (k) =: L ∗ .
From the above conditions for a stationary point and the Lagrangians of the prob-
lems of steps 1 and 2 and (C0 –C5 ), it is easy to see that the union of the first order
optimality conditions for the problems on steps 1 and 2 coincides with the first order
optimality conditions of (C0 –C5 ).
References
Kiers H (2002) Setting up alternating least squares and iterative majorization algorithms for solving
various matrix optimization problems. Comput Stat Data Anal 41:157–170
Kline M (1974) Why Johnny can’t add: the failure of the new math. Random House Inc
Vanluyten B, Willems JC, De Moor B (2006) Matrix factorization and stochastic state representa-
tions. In: Proc 45th IEEE conf on dec and control, San Diego, California, pp 4188–4193
Appendix P
Problems
Problem P.1 (Least squares data fitting) Verify that the least squares fits, shown in
Fig. 1.1 on p. 4, minimize the sums of squares of horizontal and vertical distances.
The data points are:
−2 −1 0 1 2
d1 = , d2 = , d3 = , d4 = , d5 = ,
1 4 6 4 1
2 1 0 −1 −2
d6 = , d7 = , d8 = , d9 = , d10 =
−1 −4 −6 −4 −1
Problem P.2 (Distance from a data point to a linear model) The 2-norm distance
from a point d ∈ Rq to a linear static model B ⊂ Rq is defined as
dist(d, B) := min d − d2 , (dist)
B
d∈
i.e., dist(d, B) is the shortest distance from d to a point d in B. A vector d∗ that
achieves the minimum of (dist) is a point in B that is closest to d.
Next we consider the special case when B is a linear static model.
1. Let
B = image(a) = {αa | α ∈ R}.
Explain how to find dist(d, image(a)). Find
dist col(1, 0), image col(1, 1) .
Note that the best approximation d∗ of d in image(a) is the orthogonal projection
of d onto image(a).
2. Let B = image(P ), where P is a given full column rank matrix. Explain how to
find dist(d, B).
3. Let B = ker(R), where R is a given full row rank matrix. Explain how to find
dist(d, B).
4. Prove that in the linear static case, a solution d∗ of (dist) is always unique?
5. Prove that in the linear static case, the approximation error Δd ∗ := d − d∗ is
orthogonal to B. Is the converse true, i.e., is it true that if for some d, d − d is
orthogonal to B, then d = d ? ∗
Problem P.3 (Distance from a data point to an affine model) Consider again the
distance dist(d, B) defined in (dist). In this problem, B is an affine static model,
i.e.,
B = B + a,
where B is a linear static model and a is a fixed vector.
1. Explain how to reduce the problem of computing the distance from a point to an
affine static model to an equivalent problem of computing the distance from a
point to a linear static model (Problem P.2).
2. Find
0 1
dist , ker 1 1 + .
0 2
Problem P.4 (Geometric interpretation of the total least squares problem) Show
that the total least squares problem
2
N aj
a ∈ R , and
minimize over x ∈ R, N
b∈R N
dj −
b j 2 (tls)
j =1
aj x =
subject to bj , for j = 1, . . . , N
minimizes the sum of the squared orthogonal distances from the data points
d1 , . . . , dN to the fitting line
B = col(a, b) | xa = b
over all lines passing through the origin, except for the vertical line.
Problem P.5 (Unconstrained problem, equivalent to the total least squares problem)
A total least squares approximate solution xtls of the linear system of equations
Ax ≈ b is a solution to the following optimization problem
2
minimize over x, A, b Ab − A
and b F =
subject to Ax b. (TLS)
Ax − b 2
minimize ftls (x), where ftls (x) := 2
. (TLS’)
x 2
2 +1
Problem P.6 (Lack of total least squares solution) Using the formulation (TLS’),
derived in Problem P.5, show that the total least squares line fitting problem (tls) has
no solution for the data in Problem P.1.
Problem P.9 (Line fitting by rank-1 approximation) Plot the cost function flra (P )
for the data in Problem P.1 over all P such that P P = 1. Find from the graph
of flra the minimum points. Using the link between ((lraP )) and ((lraP )), established
in Problem P.7, interpret the minimum points of flra in terms of the line fitting prob-
lem for the data in Problem P.1. Compare and contrast with the total least squares
approach, used in Problem P.6.
Problem P.10 (Analytic solution of a rank-1 approximation problem) Show that for
the data in Problem P.1,
140 0
flra (P ) = P P.
0 20
Using geometric or analytic arguments, conclude that the minimum of flra for a P
on the unit circle is 20 and is achieved for
P ∗,1 = col(0, 1) and P ∗,2 = col(0, −1).
Compare the results with those obtained in Problem P.9.
242 P Problems
Problem P.12 (Analytic solution of scalar total least squares) Find an analytic ex-
pression for the total least squares solution of the system ax ≈ b, where a, b ∈ Rm .
Problem P.14 (Two-sided weighted low rank approximation) Prove Theorem 2.29
on p. 65.
Problem P.15 (Most poweful unfalsified model for autonomous models) Given a
trajectory
y = y(1), y(2), . . . , y(T )
of an autonomous linear time-invariant system B of order n, find a state space
representation Bi/s/o (A, C) of B. Modify your procedure, so that it does not require
prior knowledge of the system order n but only an upper bound nmax for it.
Problem P.16 (Algorithm for exact system identification) Develop an algorithm for
exact system identification that computes a kernel representation of the model, i.e.,
implement the mapping
wd → R(z), where B := ker R(z) is the identified model.
Consider separately the cases of known and unknown model order. You can assume
that the system is single input single output and its order is known.
* Problem P.18 (When is Bmpum (wd ) equal to the data generating system?)
Choose a (random) linear time-invariant system B0 (the “true data generating sys-
tem”) and a trajectory wd = (ud , yd ) of B0 . The aim is to recover the data generating
system B0 back from the data wd . Conjecture that this can be done by computing
the most powerful unfalsified model Bmpum (wd ). Verify whether and when in sim-
ulation Bmpum (wd ) coincides with B0 . Find counter examples when the conjecture
is not true and based on this experience revise the conjecture. Find sufficient condi-
tions for Bmpum (wd ) = B0 .
Problem P.20 (Computing approximate common divisor with slra) Given poly-
nomials p and q of degree n or less and an integer d < n, use slra to solve the
Sylvester structured low rank approximation problem
minimize over p ,q ∈ Rn+1 p q − p q F
subject to rank Rd p ,q ≤ 2n − 2d + 1
in order to compute an approximate common divisor c of p and q with degree at
least d. Verify the answer with the alternative approach developed in Sect. 3.2.
Problem P.23 (Nonnegative low rank approximation) Implement and test the algo-
rithm for nonnegative low rank approximation (Algorithm 7 on p. 177).
Problem P.24 (Luenberger 1979, p. 53) A thermometer reading 21°C, which has
been inside a house for a long time, is taken outside. After one minute the thermome-
ter reads 15°C; after two minutes it reads 11°C. What is the outside temperature?
(According to Newton’s law of cooling, an object of higher temperature than its
environment cools at a rate that is proportional to the difference in temperature.)
Problem P.25 Solve first Problem P.24. Consider the system of equations
ū
1T −n ⊗ G HT −n (Δy) = col y (n + 1)ts , · · · , y T ts , (SYS DD)
(the data-driven algorithm for input estimation on p. 210) in the case of a first order
single input single output system and three data points. Show that the solution of the
system (SYS DD) coincides with the one obtained in Problem P.24.
Problem P.26 Consider the system of equations (SYS DD) in the case of a first
order single input single output system and N data points. Derive an explicit for-
mula for the least squares approximate solution of (SYS DD). Propose a recursive
algorithm that updates the current solution when new data point is obtained.
Problem P.27 Solve first Problem P.26. Implement the solution obtained in Prob-
lem P.26 and validate it against the function stepid_dd.
References
De Moor B (1999) DaISy: database for the identification of systems. www.esat.kuleuven.be/sista/
daisy/
Luenberger DG (1979) Introduction to dynamical systems: theory, models and applications. Wiley,
New York
Notation
Symbolism can serve three purposes. It can communicate ideas effectively; it can conceal
ideas; and it can conceal the absence of ideas.
M. Kline, Why Johnny Can’t Add: The Failure of the New Math
Sets of numbers
R the set of real numbers
Z, Z+ the sets of integers and positive integers (natural numbers)
Matrix operations
A+ , A pseudoinverse, transpose
vec(A) column-wise vectorization
vec−1 operator reconstructing
a the matrix A back from vec(A)
col(a, b) the column vector b
col dim(A) the number of block columns of A
row dim(A) the number of block rows of A
image(A) the span of the columns of A (the image or range of A)
ker(A) the null space of A (kernel of the function defined by A)
diag(v), v ∈ Rn the diagonal matrix diag(v1 , . . . , vn )
⊗ Kronecker product A ⊗ B := [aij B]
element-wise (Hadamard) product A B := [aij bij ]
Fixed symbols
B, M model, model class
S structure specification
Hi (w) Hankel matrix with i block rows, see (Hi ) on p. 10
Ti (c) upper triangular Toeplitz matrix with i block rows, see (T ) on p. 86
R(p, q) Sylvester matrix for the pair of polynomials p and q, see (R) on p. 11
Oi (A, C) extended observability matrix with i block-rows, see (O) on p. 51
Ci (A, B) extended controllability matrix with i block-columns, see (C ) on p. 51
q,n Z
Lm,l := B ⊂ Rq | B is linear time-invariant with
m(B) ≤ m, l(B) ≤ l, and n(B) ≤ n
Miscellaneous
:= / =: left (right) hand side is defined by the right (left) hand side
: ⇐⇒ left-hand side is defined by the right-hand side
⇐⇒ : right-hand side is defined by the left-hand side
στ the shift operator (σ τ f )(t) = f (t + τ )
i imaginary unit
δ Kronecker delta, δ0 = 1 and δt = 0 for all t = 0
1
.
1n = .. vector with n elements that are all ones
1
W "0 W is positive definite
$a% rounding to the nearest integer greater than or equal to a
With some abuse of notation, the discrete-time signal, vector, and polynomial
w(1), . . . , w(T ) ↔ col w(1), . . . , w(T ) ↔ z1 w(1) + · · · + zT w(T )
are all denoted by w. The intended meaning is understood from the context.
List of Code Chunks
(S0 , S, p = S (
) → D p ) 83a (TF) → P 88e
(S0 , S, p
) → D = S (
p ) 82 (TF) → R 89d
ū := G−1 G 1m , where G := Algorithm for sensor speedup based
dcgain(B ) 206b on reduction to autonomous system
H → B 109d identification 209a
p → H 109c Algorithm for sensor speedup based
S → S 83d on reduction to step response system
S → (m, n, np ) 83b identification 205
S → S 83c Algorithm for sensor speedup in the
π → Π 37a case of known dynamics 203a
P → R 89c alternating projections method 140b
P → (TF) 88d approximate realization structure 109a
R → (TF) 89b autonomous system identification:
R → P 89a Δy → ΔB 209c
2-norm optimal approximate realiza- Bias corrected low rank approxima-
tion 108b tion 188a
(R, Π) → X 42a bisection on γ 99b
(X, Π) → P 40a call cls1-4 163b
(X, Π) → R 39 call optimization
solver 85b
dist(D, B) (weighted low rank approx- check n < min pi − 1, mj 75b
imation) 142a check exit condition 141b
dist(wd , B) 87b Compare h2ss and h2ss_opt 110c
Γ, Δ) → (A, B, C) 74a Compare ident_siso and
Θ → RΘ 193b ident_eiv 90b
H → Bi/s/o (A, B, C, D) 76d Compare w2h2ss and
P → R 40c ident_eiv 116d
R → minimal R 41 Complex least squares, solution by
R → P 40b (SOL1 ) 160a
x , SOL1 φ
R → Π 43a Complex least squares, solution by Al-
w → H 80b gorithm 5 161
Here is a list of the defined functions and where they appear. Underlined entries
indicate the place of definition. This index is generated automatically by noweb.
D H
Data clustering, 28 Hadamard product, 60
Data fusion, 216 Halmos, P., 14
Data modeling Hankel matrix, 10
behavioral paradigm, 1 Hankel structured low rank approximation, see
classical paradigm, 1 low rank approximation
Data-driven methods, 78, 200 Harmonic retrieval, 112
Dead-beat observer, 203 Hermite polynomials, 188
Deterministic identification, 10 Horizontal distance, 3
Dimensionality reduction, vi
Diophantine equation, 124 I
Direction of arrival, 11, 28 Identifiability, 54
Distance Identification, 27, 126
algebraic, 56 autonomous system, 111
geometric, 55 errors-in-variables, 115
horizontal, 3 finite impulse response, 119
orthogonal, 4 output error, 116
problem, 28 output only, 111
to uncontrollability, 90, 122 Ill-posed problem, 2
vertical, 3 Image mining, 176
Dynamic Image representation, 2
measurement, 224 image(P ), 35
weighing, 199, 217 Implicialization problem, 197
Implicit representation, 180
E Infinite Hankel matrix, 8
Eckart–Young–Mirsky theorem, 23 Information retrieval, vi
Epipolar constraint, 20 Input/output partition, 1
Errors-in-variables, 29, 57, 115 Intercept, 149
ESPRIT, 73 Inverse system, 224
Exact identification, 9
Exact model, 54 K
Expectation maximization, 24 Kalman filter, 203, 222
Explicit representation, 180 Kalman smoothing, 87
Index 255