Optimization Learning Control
Optimization Learning Control
Anders Hansson
Linköping University
Linköping
Sweden
Martin Andersen
Technical University of Denmark
Kongens Lyngby
Denmark
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or
authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc.,
222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com.
Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons,
Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/
go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affiliates in the United States and other countries and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product
or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional where appropriate.
Further, readers should be aware that websites listed in this work may have changed or disappeared between when
this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or
any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer
Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317)
572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Contents
Preface xvii
Acknowledgments xix
Glossary xxi
Acronyms xxv
About the Companion Website xxvii
1 Introduction 3
1.1 Optimization 3
1.2 Unsupervised Learning 3
1.3 Supervised Learning 4
1.4 System Identification 4
1.5 Control 5
1.6 Reinforcement Learning 5
1.7 Outline 5
2 Linear Algebra 7
2.1 Vectors and Matrices 7
2.2 Linear Maps and Subspaces 10
2.2.1 Four Fundamental Subspaces 10
2.2.2 Square Matrices 12
2.2.3 Affine Sets 13
2.3 Norms 13
2.4 Algorithm Complexity 15
2.5 Matrices with Structure 16
2.5.1 Diagonal Matrices 16
2.5.2 Orthogonal Matrices 17
2.5.3 Triangular Matrices 18
2.5.4 Symmetric and Skew-Symmetric Matrices 19
2.5.5 Toeplitz and Hankel Matrices 19
2.5.6 Sparse Matrices 20
viii Contents
3 Probability Theory 40
3.1 Probability Spaces 40
3.1.1 Probability Measure 41
3.1.2 Probability Function 41
3.1.3 Probability Density Function 42
3.2 Conditional Probability 42
3.3 Independence 44
3.4 Random Variables 44
3.4.1 Vector-Valued Random Variable 45
3.4.2 Marginal Distribution 45
3.4.3 Independence of Random Variables 46
3.4.4 Function of Random Variable 46
3.5 Conditional Distributions 47
3.5.1 Conditional Probability Function 47
3.5.2 Conditional Probability Density Function 47
3.6 Expectations 48
3.6.1 Moments 49
3.6.2 Expected Value of Function of Random Variable 49
3.6.3 Covariance 50
3.7 Conditional Expectations 50
3.8 Convergence of Random Variables 51
3.9 Random Processes 51
3.10 Markov Processes 53
Contents ix
Part II Optimization 61
4 Optimization Theory 63
4.1 Basic Concepts and Terminology 63
4.1.1 Optimization Problems 64
4.1.2 Equivalent Problems 65
4.2 Convex Sets 66
4.2.1 Convexity-Preserving Operations 67
4.2.2 Examples of Convex Sets 68
4.2.3 Generalized Inequalities 71
4.3 Convex Functions 72
4.3.1 First- and Second-Order Conditions for Convexity 73
4.3.2 Convexity-Preserving Operations 75
4.3.3 Examples of Convex Functions 78
4.3.4 Conjugation 78
4.3.5 Dual Norms 79
4.4 Subdifferentiability 80
4.4.1 Subdifferential Calculus 82
4.5 Convex Optimization Problems 84
4.5.1 Optimality Condition 84
4.5.2 Equality Constrained Convex Problems 85
4.6 Duality 86
4.6.1 Lagrangian Duality 86
4.6.2 Lagrange Dual Problem 87
4.6.3 Fenchel Duality 88
4.7 Optimality Conditions 90
4.7.1 Convex Optimization Problems 90
4.7.2 Nonconvex Optimization Problems 91
Exercises 91
5 Optimization Problems 94
5.1 Least-Squares Problems 94
5.2 Quadratic Programs 96
5.3 Conic Optimization 97
5.3.1 Conic Duality 99
5.3.2 Epigraphical Cones 100
5.4 Rank Optimization 103
5.5 Partially Separability 106
5.5.1 Minimization of Partially Separable Functions 106
5.5.2 Principle of Optimality 108
x Contents
Appendix A 373
A.1 Notation and Basic Definitions 373
A.2 Software 374
A.2.1 Modeling Software 374
Contents xv
References 379
Index 387
xvii
Preface
This is a book about optimization for learning and control. The literature and the techniques for
learning are vast, but we will here not focus on all possible learning methods. Instead we will
discuss some of them, and especially the ones that result in optimization problems. We will also dis-
cuss what optimization methods are relevant to use for these optimization problems. The book is
primarily intended for graduate students with a background in science or engineering and who
want to learn more about what optimization methods are suitable for learning problems. It is
also useful for those who want to study optimal control. Very limited knowledge of optimization,
control, or learning is needed as a background. The book is accompanied with a large number
of exercises, many of which involve computer tools in order for the students to obtain hands-on
experience.
The topics in learning span a wide range from classical statistical learning problems like
regression and maximum likelihood estimation to more recent problems like deep learning using,
e.g. recurrent neural networks. Regarding optimization methods, we cover methods from simple
gradient methods to more advanced interior-point methods for conic optimization. A special
emphasis is on stochastic methods applicable to the training of neural networks. We also put a
special emphasis on nondifferentiable problems for which we discuss subgradient methods and
proximal methods. We cover second-order methods, variable-metric methods, and augmented
Lagrangian methods. Regarding applications to system identification, we discuss identification
both for input–output models as well as for state-space models. Recurrent neural networks
and temporal convolutional networks are naturally introduced as ways of modeling nonlinear
dynamical systems. We also cover calculus of variations and dynamic programming in detail, and
its generalization to reinforcement learning.
The book can be used to teach several different courses. One could be an introductory
course in optimization based on Chapters 4–6. Another course could be on optimal control
covering Chapters 7–8, and possibly also Chapter 11. Another course could be on learning
covering Chapters 9–10 and perhaps Chapter 12. There is of course also the possibility to combine
more chapters, and a course that has been taught at Linköping University for PhD students covers
all but the material for optimal control.
Acknowledgments
We would like to thank Andrea Garulli at University of Siena who invited Anders Hansson to
give a course on optimization for learning in the spring of 2019. The experience from teaching
that course provided most valuable inspiration for writing this book. Daniel Cederberg, Markus
Fritzsche, and Magnus Malmström are gratefully acknowledged for having proofread some of the
chapters.
Glossary
Sets
ℕ set of natural numbers
ℕk set {1, 2, … , k}
ℤ set of integers
ℤk set {0, 1, … , k}
ℤ+ set of nonnegative integer numbers
ℝ set of real numbers
ℝ+ set of nonnegative real numbers
ℝ++ set of positive real numbers
̄
ℝ set of extended real numbers
̄+
ℝ set of nonnegative extended real numbers
̄ ++
ℝ set of positive extended real numbers
ℂ set of complex numbers
𝕊n set of symmetric real-valued matrices of order n
𝕊n+ set of positive semidefinite real-valued matrices of order n
𝕊n++ set of positive definite real-valued matrices of order n
ℚn quadratic cone of dimension n
Δn probability simplex of dimension n − 1
∅ empty set
Elementary Functions
exp natural exponential function
log logarithm function
ln natural logarithm function
log2 logarithm function, base 2
xxii Glossary
Probability Spaces
𝔼 expectation functional
normal probability density function
ℙ probability measure
Var variance functional
Glossary xxiii
Symbols
∑
summation
∏
product
∫ integral
∮ contour integral
∞ infinity
∈ belongs to
∉ does not belong to
⊂ proper subset of
⊆ subset of
⊃ proper superset of
⊇ superset of
⊄ not proper subset of
⊈ not subset of
⊅ not proper superset of
⊉ not superset of
∪ set union
∩ set intersection
∖ set difference
+ plus
− minus
± plus or minus
× multiplied by
⊗ Kronecker product
⚬ Hadamard product or composition of functions
= is equal to
< is less than
≤ is less than or equal to
> is greater than
≥ is greater than or equal to
≠ is not equal to
≮ is not less than
≰ is neither less than nor equal to
≯ is not greater than
≱ is neither greater than nor equal to
≪ much smaller than
≫ much greater than
≈ is approximately equal to
∼ asymptotically equivalent to
∝ proportional to
≺ precedes
⪯ precedes or equals
≻ succeeds
⪰ succeeds or equals
⊀ does not precede
neither precedes nor equals
⊁ does not succeed
neither succeeds nor equals
xxiv Glossary
∃ there exists
∄ there is no
∀ for all
¬ logical not
∧ logical and
∨ logical or
⟹ implies
⟸ is implied by
⟺ is equivalent to
→ to or tends toward
↔ corresponds to
↘ tends toward from above
→ maps to
⟂ is perpendicular to
| such that or given
: such that
xxv
Acronyms
MA moving average
MAP maximum a posteriori
MDP Markov decision process
ML maximum likelihood
MPC model predictive control
m.s. mean square
MSE mean square error
NP nondeterministic polynomial time
ODE ordinary differential equation
OE output error
PCA principal component analysis
pdf probability density function
pf probability function
PI policy iteration
PMP Pontryagin maximum principle
QP quadratic program
RBM restricted Boltzmann machine
RMSprop root mean square propagation
RNN recurrent neural network
ReLU rectified linear unit
SA stochastic approximation
SAA stochastic average approximation
SARSA state-action-reward-state-action
SG stochastic gradient
SGD stochastic gradient descent
SMW Sherman–Morrison–Woodbury
SNR signal-to-noise ratio
SR1 symmetric rank-1
SVD singular value decomposition
SVM support vector machine
SVRG stochastic variance-reduced gradient
TCM temporal convolutional network
TPBVP two-point boundary value problem
VI value iteration
w.p.1 with probability one
xxvii
www.wiley.com/go/opt4lc
Part I
Introductory Part
3
Introduction
This book will take you on a journey through the fascinating field of optimization, where we explore
techniques for designing algorithms that can learn and adapt to complex systems. Whether you are
an engineer, a scientist, or simply curious about the world of optimization, this book is for you. We
will start with the basics of optimization and gradually build up to advanced techniques for learning
and control. By the end of this book, you will have a solid foundation in optimization theory and
practical tools to apply to real-world problems. In this opening, we informally introduce problems
and concepts, and we will explain their close interplay with simple formulations and examples.
Chapters 2–13 will explore the topic with more rigor, and we end this chapter with an outline of
the remaining content of the book.
1.1 Optimization
Optimization is about choosing a best option from a set of available alternatives based on a specific
criterion. This concept applies to a range of disciplines, including computer science, engineer-
ing, operations research, and economics, and has a long history of conceptual and methodological
development. One of the most common optimization problems is of the form
∑
m
minimize fk (x)2 , (1.1)
k=1
with variable x ∈ ℝ n . This is called a nonlinear least-squares problem, since we are minimizing the
squares of the possibly nonlinear functions fk . We will see that the least-squares problem and its
generalizations have many applications to learning and control. It is also the backbone of several
optimization methods for solving more complicated optimization problems. In Chapter 4, we set
the foundations for optimization theory, in Chapter 5, we cover different classes of optimization
problems that are relevant for learning and control, and in Chapter 6, we discuss different methods
for solving optimization problems numerically.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
4 1 Introduction
with variable x. This is an example of a least-squares problem for which fk (x) = aTk x − yk is an affine
function. Hence, this is often called a linear least-squares problem. Typically, m is much larger
than n, and therefore, the optimal solution x⋆ is of lower dimension than the measurements. If we
later get a new value of a, we may predict the value of the corresponding measurement as aT x⋆
without performing the measurement. For our application, this means that we can estimate the
distance to the city by just checking how long we have been traveling. We do not have to wait for a
new sign to appear. This is a so-called supervised learning problem, since for each ak , we know the
corresponding yk . For learning the length of the piece of wood the data did not come in pairs, but
instead, we just had one stream of data yk . We learned the mean of the data. That is the reason for
the name unsupervised learning for such a learning problem. We will discuss supervised learning
in more detail in Chapter 10.
studying dynamical systems, the pairs (uk , yk ) are called stimuli and response or input and output,
respectively. Sometimes the word signal is added at the end for each of the four words. Often,
the above equation does not hold exactly due to measurement errors ek , and hence, it is more
appropriate to consider
yk+1 = ayk + buk + ek , k = 1, … , m.
When we do not know the parameters (a, b) ∈ ℝ × ℝ, we can use supervised learning to learn the
values assuming that we are given pairs of data (uk , yk ) for 1 ≤ k ≤ m + 1. The following linear
least-squares problem
∑
m
( )2
minimize yk+1 − ayk − buk ,
k=1
with variables (a, b) will provide an estimate of the parameters. Learning for dynamical systems is
called system identification, and it will be discussed in more detail in Chapter 12.
1.5 Control
Control is about making dynamical systems behave in a way we find desirable. Let us again consider
the dynamical system in (1.2), where we are going to influence the behavior by manipulating the
input signal uk . In the context of control, we also often call it the control signal. We assume that the
initial value y0 and the parameters (a, b) are known, and our objective is to make yk for 1 ≤ k ≤ m
small. We can make y1 equal to zero by taking u0 = −ay0 ∕b, and then we can make all future values
of yk equal to zero by taking all future values of uk equal to zero. However, it could be that the value
of u0 is large, and in applications, this could be costly. Hence, we are interested in finding a trade-off
between how large the values of uk are in comparison to the values of yk . This can be accomplished
by solving
∑
m
( 2 )
minimize yk + 𝜌u2k ,
k=1 (1.3)
subject to yk+1 = ayk + buk , k = 1, … , m − 1,
with variables (u1 , y2 , … , um , ym ). This is an equality constrained linear least-squares problem. The
parameter 𝜌 > 0 can be used to trade-off how small ||yk || should be as compared to ||uk ||. We will cover
control of dynamical systems in Chapter 7 for continuous time, and in Chapter 8 for discrete time.
1.7 Outline
The book is organized as follows: first, we give a background on linear algebra and probabilities
in Chapters 2 and 3. Background on optimization is given in Chapter 4. We will cover both convex
6 1 Introduction
and nonconvex optimization. Chapter 5 introduces different classes of optimization problems that
we will encounter in the learning chapters later on. In Chapter 6, we discuss different optimization
methods that are suitable for solving learning problems. After this, we discuss calculus of variations
in Chapter 7 and dynamic programming in Chapter 8. We then cover unsupervised learning in
Chapter 9, supervised learning in Chapter 10, and reinforcement learning in Chapter 11. Finally,
we discuss system identification in Chapter 12. For information about notation, basic definitions,
and software useful for optimization and for the applications we consider, see the Appendix.
7
Linear Algebra
Linear algebra is the study of vector spaces and linear maps on such spaces. It constitutes a funda-
mental building block in optimization and is used extensively for theoretical analysis and deriva-
tions as well as numerical computations.
A procedure for solving systems of simultaneous linear equations appeared already in an
ancient Chinese mathematical text. Systems of linear equations were introduced in Europe
in the seventeenth century by René Descartes in order to represent lines and planes by linear
equations and to compute their intersections. Gauss developed the method of elimination. Further
important developments were done by Gottfried Wilhelm von Leibniz, Gabriel Cramer, Hermann
Grassmann, and James Joseph Sylvester, the latter introducing the term “matrix.”
The purpose of this chapter is to review key concepts from linear algebra and calculus in
finite-dimensional vector spaces as well as a number of useful identities that will be used
throughout the book. We also discuss some computational aspects, including a number of matrix
factorizations and their application to solving systems of linear equations.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
8 2 Linear Algebra
A matrix A of size m-by-n, also written as m × n, is an ordered rectangular array that consists of
mn elements arranged in m rows and n columns, i.e.
⎡ a11 a12 ··· a1n ⎤
⎢ ⎥
a a22 ··· a2n ⎥
A = ⎢ 21 ,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢a am2 ··· amn ⎥⎦
⎣ m1
where aij denotes the element of A in its ith row and jth column. The set of m-by-n matrices with
real-valued elements is denoted by ℝm×n . The transpose of A is the n × m matrix defined as
⎡a11 a21 ··· am1 ⎤
⎢ ⎥
a a22 ··· am2 ⎥
AT = ⎢ 12 ,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢a a2n ··· ⎥
amn ⎦
⎣ 1n
i.e. the (i, j)th element of AT is the (j, i)th element of A.
It is often convenient to think of a vector as a matrix with a single row or column. For example,
when interpreted as a matrix of size 1 × n, the vector a ∈ ℝn is referred to as a row vector, and
similarly, when interpreted as a matrix of size n × 1, a is referred to as a column vector. In this
book, we use the convention that all vectors are column vectors. Thus, a vector x ∈ ℝn is always
interpreted as the column vector
⎡x1 ⎤
⎢ ⎥
x
x = ⎢ 2⎥ ,
⎢⋮⎥
⎢x ⎥
⎣ n⎦
[ ]
and hence, xT is interpreted as the row vector x1 x2 · · · xn . Similarly, to refer to the columns of a
matrix A ∈ ℝm×n , we will sometimes use the notation
[ ]
A = a1 a2 · · · an ,
where a1 , a2 , … , an ∈ ℝm . When referring to the rows of A, we will define
[ ]
AT = a1 a2 · · · am ,
where a1 , a2 , … , am ∈ ℝn such that A is the matrix with rows aT1 , aT2 , … , aTm . The notation is some-
what ambiguous because ai may refer to the ith element of a vector a or the ith column of either A or
AT , but the meaning will be clear from the context and follows from our convention that vectors
are column vectors.
Given two vectors x, y ∈ ℝn , the inner product ⟨x, y⟩ can also be expressed as the product
⎡y1 ⎤
[ ] ⎢ ⎥ ∑ n
y
xT y = x1 x2 · · · xn ⎢ 2 ⎥ = xy.
⎢ ⋮ ⎥ i=1 i i
⎢y ⎥
⎣ n⎦
In contrast, the outer product of two vectors u ∈ ℝm and 𝑣 ∈ ℝn , not necessarily of the same length,
is defined as the m × n matrix
⎡ u1 𝑣1 u1 𝑣2 … u1 𝑣n ⎤
⎢ ⎥
u 𝑣 u 𝑣 … u2 𝑣n ⎥
u𝑣T = ⎢ 2 1 2 2 .
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢u 𝑣 u 𝑣 … u 𝑣 ⎥
⎣ m 1 m 2 m n⎦
2.1 Vectors and Matrices 9
The product of a matrix A ∈ ℝm×n , with columns a1 , … , an ∈ ℝm , and a vector x ∈ ℝn is the linear
combination
y = a1 x1 + a2 x2 + · · · + an xn .
Equivalently, the ith element of the vector y = Ax is the inner product of x and the ith row of A.
The vector inner and outer products and matrix–vector multiplication are special cases of matrix
multiplication. Two matrices A and B are said to be conformable for multiplication if the number of
columns in A is equal to the number of rows in B. Given two such matrices A ∈ ℝm×p and B ∈ ℝp×n ,
the product C = AB is the m × n matrix with elements
∑
p
cij = aik bkj , i ∈ ℕm , j ∈ ℕn .
k=1
Note that cij is the inner product of the ith row of A and the jth column of B. As a result, C = AB
may be expressed as
T T T T
⎡ a1 ⎤ ⎡ a1 b1 a1 b2 … a1 bn ⎤
⎢ aT ⎥ [ ] ⎢ aT b aT b … aT2 bn ⎥
C = ⎢ 2 ⎥ b1 b2 · · · bn = ⎢ 2 1 2 2 ⎥,
⎢⋮⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ T⎥ ⎢ T ⎥
⎣am ⎦ ⎣am b1 aTm b2 … aTm bn ⎦
where aT1 , … , aTm are the rows of A and b1 , … , bn are the columns of B. Equivalently, by expressing
A in terms of its columns and B in terms of its rows, C may also be written as the sum of p outer
products
T
⎡b1 ⎤
[ ] bT ⎥ ∑
⎢ p
C = a1 a2 · · · ap ⎢ 2 ⎥ = a bT ,
⎢ ⋮ ⎥ i=1 i i
⎢ T⎥
⎣bp ⎦
where a1 , … , ap are the columns of A and bT1 , … , bTp are the rows of B.
It is important to note that matrix multiplication is associative, but unlike scalar multiplication,
it is not commutative. In other words, the associative property (AB)C = A(BC) holds, provided that
A, B, and C are conformable for multiplication, but the identity AB = BA does not hold in general.
The Frobenius inner product of two matrices A, B ∈ ℝm×n is defined as
∑∑
m n
⟨A, B⟩ = aij bij . (2.2)
i=1 j=1
The inner product ⟨A, B⟩ may also be written as vec(A)T vec(B), where vec(A) maps a matrix
A ∈ ℝm×n to a vector of length mn by stacking the columns of A, i.e.
([ ]) ⎡ a1 ⎤
vec(A) = vec a1 · · · an = ⎢ ⋮ ⎥ ∈ ℝmn .
⎢ ⎥
⎣an ⎦
10 2 Linear Algebra
The span of n vectors 𝑣1 , … , 𝑣n ∈ ℝm is the set of all linear combinations of these vectors, i.e.
𝛼1 𝑣1 + · · · + 𝛼n 𝑣n = 0 ⟺ 𝛼1 = · · · = 𝛼n = 0,
and otherwise, the set is said to be linearly dependent. Equivalently, the set 𝑣1 , … , 𝑣n is linearly
independent if and only if all vectors in span(𝑣1 , … , 𝑣n ) have a unique representation as a linear
combination of 𝑣1 , … , 𝑣n . The span of a set of k linearly independent vectors 𝑣1 , … , 𝑣k ∈ ℝm is
a k-dimensional subspace of ℝm , and 𝑣1 , … , 𝑣k is a basis for the subspace. In other words, the
dimension of a subspace is equal to the number of linearly independent vectors that span the sub-
space. The vectors 𝑣1 , … , 𝑣n are said to be orthonormal if they are mutually orthogonal and of unit
length, i.e. 𝑣Ti 𝑣j = 0 if i ≠ j and 𝑣Ti 𝑣i = 1 for i ∈ ℕn . The standard basis or natural basis for ℝm is the
orthonormal basis e1 , … , em , where ei ∈ ℝm is the unit vector whose ith element is equal to 1, and
the rest are zero.
The range of a matrix A ∈ ℝm×n , denoted (A), is the span of its columns. This is also referred
to as the column space of A, whereas (AT ) is referred to as the row space of A. The dimension of
(A) is called the rank of A, denoted rank(A). The null space of A ∈ ℝm×n , denoted (A), consists
of all vectors 𝑣 such that A𝑣 = 0, i.e.
(A) = {𝑣 ∈ ℝn ∣ A𝑣 = 0}.
The dimension of (A) is called the nullity of A and is denoted nullity(A). The null space of A is
said to be trivial if (A) = {0} in which case nullity(A) = 0.
To see that (A) = (AT )⟂ , note that for every y ∈ (AT ), we have that
(AT y)T x = yT Ax = 0, ∀x ∈ ℝn ,
or equivalently, if we let u = Ax, we see that yT u = 0 for all u ∈ (A). This shows that (A)
and (AT ), which are both subspaces of ℝm , are orthogonal complements. Similarly, for every
x ∈ (A), it immediately follows that yT Ax = 0 for all y ∈ ℝm , and hence, (AT ) is the orthogonal
complement of (A).
The result (2.3) can be used to derive the so-called rank-nullity theorem which states that
To see this, note that the identity (AT ) = (A)⟂ combined with the fact that
implies that rank(AT ) = n − nullity(A). The rank-nullity theorem follows from the identity
𝐴𝑇
𝐴 𝑇 𝑦1 𝐴 𝑦1
ℛ(𝐴𝑇 ) ℛ(𝐴)
𝑥1 𝐴𝑥1
dim. 𝑟 𝑥 1 + 𝑥2 dim. 𝑟
𝐴
0 0
ℝ𝑛 ℝ𝑚
𝐴𝑇
𝑥2
dim. 𝑛 − 𝑟 𝐴 𝑦2 dim. 𝑚 − 𝑟
𝒩(𝐴) 𝒩(𝐴𝑇 )
Figure 2.1 The four subspaces associated with an m × n matrix A with rank r.
which we will now derive. First, suppose that rank(AT ) = r, and let 𝑣1 , … , 𝑣r ∈ ℝn be a linearly
independent set of vectors that span (AT ). It follows that the set of vectors A𝑣1 , … , A𝑣r is linearly
independent since
As a result, we have that rank(A) ≥ rank(AT ). Applying the same inequality to B = AT implies that
rank(A) = rank(AT ). A direct consequence of this identity is that rank(A) ≤ min (m, n). We say that
A has full rank if rank(A) = min (m, n), and we will use the term full row rank when r = m and the
term full column rank when r = n. The four subspaces (A), (AT ), (AT ), and (A) and their
dimensions are illustrated in Figure 2.1.
If two matrices A and B are conformable for multiplication, then
This follows from the fact that (AB) ⊆ (A) and (BT AT ) ⊆ (BT ) which implies that
rank(AB) ≤ rank(A) and rank(BT AT ) ≤ rank(BT ) = rank(B). Furthermore, we have that (AB) =
(A) when B has full row rank, in which case rank(AB) = rank(A).
If A and B are conformable for addition, then
This means that rank is subadditive, and it follows from the fact that (A + B) ⊆ (A) + (B)
where
is the Minkowski sum of (A) and (B). Subadditivity implies that a rank k matrix can be decom-
posed as the sum of k or more rank 1 matrices. Thus, a rank k matrix A ∈ ℝm×n can be factorized
as
∑
k
A = CRT = ci riT ,
i=1
where c1 , … , ck and r1 , … , rk are the columns of C ∈ ℝm×k and R ∈ ℝn×k , respectively, and (C) =
(A) and (R) = (AT ).
12 2 Linear Algebra
where Mij denotes the minor of aij which is the determinant of the (n − 1) × (n − 1) matrix that
is obtained by removing the ith row and jth column of A. This expression is a so-called Laplace
expansion of the determinant along the jth column of A, and it holds for every j ∈ ℕn . As a special
case, the determinant of a 2 × 2 matrix A may be expressed as
det (A) = a11 a22 − a12 a21 ,
and its absolute value may be interpreted as the area of a parallelogram defined by the columns
of A, as illustrated in Figure 2.2. More generally, the absolute value of the determinant of an n × n
matrix A is the volume of the parallelotope defined by the columns of A, i.e. the set
{Ax ∈ ℝn | 0 ≤ xi ≤ 1 ∀i ∈ ℕn }.
As a result, det (A) ≠ 0 if and only if A has full rank.
The term (−1)i+j Mij is called the cofactor of aij . The n × n matrix composed of all the cofactors,
i.e. the matrix C with elements
cij = (−1)i+j Mij , i, j ∈ ℕn ,
is called the cofactor matrix. Expressed in terms of the cofactors, the Laplace expansion (2.8) may
be written as the inner product of the jth column of A and C, i.e.
∑
n
det (A) = akj ckj .
k=1
Furthermore, since the Laplace expansion is valid for any j ∈ ℕn , the diagonal elements of the
matrix CT A are all equal to det (A). In fact, it can be shown that
CT A = det (A)I,
𝑎
0
2.3 Norms 13
where 𝜃1 + · · · + 𝜃k = 1. A set ⊆ ℝn is an affine set if and only if it contains all affine combinations
of its points. Equivalently, is affine if for every pair of points x, y ∈ ,
{𝜃x + (1 − 𝜃)y | 𝜃 ∈ ℝ} ⊆ 𝒜 .
An example of an affine set is the set of solutions to the system of equation Ax = b with A ∈ ℝm×n
and b ∈ ℝm , i.e.
= {x ∈ ℝn | Ax = b}.
In fact, every affine subset of ℝn may be expressed in this form. The affine hull of a set ⊆ ℝn ,
which we denote aff, is the smallest possible affine set that contains .
2.3 Norms
Equipped with the Euclidean inner product (2.1), the set ℝn is a Euclidean vector space of dimen-
sion n with the induced norm
√ ( )1∕2
||a||2 = ⟨a, a⟩ = a21 + a22 + · · · + a2n . (2.10)
More generally, a norm of a vector x ∈ ℝn , denoted ||x||, is a function with the following defining
properties:
𝜃=0
𝑦 𝜃 = −0.4
14 2 Linear Algebra
1. Subadditive1 :
||u + 𝑣|| ≤ ||u|| + ||𝑣||, ∀u, 𝑣 ∈ ℝn .
2. Absolutely homogeneous:
||𝛼x|| = |𝛼|||x||, ∀x ∈ ℝn , ∀𝛼 ∈ ℝ.
3. Positive definite:
||x|| = 0 if and only if x = 0.
4. Nonnegative:
||x|| ≥ 0, ∀x ∈ ℝn .
Such a function is not unique, so to distinguish between different norms, a subscript is typically
used to denote specific norms. For example, the 2-norm or Euclidean norm of a vector x, defined
in (2.10), is denoted ||x||2 . Other examples are the 1-norm and the infinity norm, defined as
∑
p
{ }
||x||1 = |xi | and ||x||∞ = max |x1 |, … , ||xn | .
i=1
This is sometimes referred to as an entrywise norm. In contrast, given both vector norms on ℝm and
ℝn , the operator norm or induced norm on ℝm×n may be defined as
||A|| = inf {t | ||Ax|| ≤ t||x|| ∀x ∈ ℝn }.
t≥0
It follows directly from this definition that ||Ax|| ≤ ||A||||x|| for all x ∈ ℝn and that ||I|| = 1. The
induced norm may also be expressed as
{ }
||Ax||
||A|| = sup , (2.14)
x≠0 ||x||
from which the submultiplicative property ||AB|| ≤ ||A||||B|| follows:
{ }
||ABx||
||AB|| = sup
x≠0 ||x||
{ } { }
||ABx|| ||ABx|| ||Bx||
= sup = sup
Bx≠0 ||x|| Bx≠0 ||Bx|| ||x||
{ } { }
||Ay|| ||Bx||
≤ sup sup = ||A||||B||.
y≠0 ||y|| x≠0 ||x||
The matrix p-norm of A is the norm induced by the vector p-norm, and it is denoted ||A||p .
The next example considers the matrix 1-norm, and the matrix infinity norm is treated in
Exercise 2.3.
Example 2.1 The matrix 1-norm on ℝm×n is the induced norm defined as
{ }
||Ax||1 { }
||A||1 = sup = sup ||Ax||1 = max ||ai ||1 ,
x≠0 ||x||1 ||x||1 =1 i=ℕn
where a1 , … , an denote the n columns of A. To verify the third equality, first note that
{ }
sup ||Ax||1 ≥ ||Aej ||1 = ||aj ||1 , j ∈ ℕn ,
||x||1 =1
where e1 , … , en are the columns of the identity matrix of order n. Moreover, subadditivity, i.e. the
triangle inequality, implies that
{‖ n ‖} { n }
‖∑ ‖ ∑
||A||1 = sup ‖ ax‖ ≤ sup |xj |||aj ||1 = max ||aj ||1 ,
‖ j j‖
||x||1 =1 ‖ j=1 ‖ ||x||1 =1 j∈ℕn
‖ ‖1 j=1
Table 2.1 FLOP counts for basic matrix and vector operations: 𝛼 denotes a
scalar, x and y are n-vectors, and A ∈ ℝm×n and B ∈ ℝn×p are matrices.
Operation FLOPs
Scaling 𝛼x n
Vector addition/subtraction x±y n
T
Inner product x y 2n − 1
Matrix–vector multiplication Ax m(2n − 1)
Matrix–matrix multiplication AB mp(2n − 1)
The function g provides an upper bound on the growth rate of f as the parameters m
and/or n increase. For example, the FLOP count for the matrix–vector product Ax is O(mn)
since m(2n − 1) ≤ 2 mn for all m ≥ 1, n ≥ 1. Similarly, the function f (m, n) = m2 + 10m + n3
satisfies f (m, n) = O(m2 + n3 ) since m2 + 10m + n3 ≤ 2 (m2 + n3 ) for m ≥ 10, n ≥ 1.
Example 2.2 Recall that matrix multiplication is associative, i.e. given three matrices A, B, and C
that are conformable for multiplication, we have that A(BC) = (AB)C. This implies that the prod-
uct of three or more matrices A1 A2 … Ak can be evaluated in several ways, each of which may have
a different FLOP count. For example, the product ABC with A ∈ ℝm×n , B ∈ ℝn×p , and C ∈ ℝp×q
may be evaluated left-to-right by first computing L = AB and then LC, or right-to-left by comput-
ing R = BC and then AR. The first approach requires O(mp(n + q)) FLOPs, whereas the second
approach requires O(nq(m + p)) FLOPs. A special case is the product abT x with a ∈ ℝm , b ∈ ℝp ,
and x ∈ ℝp which requires O(mp) FLOPs when evaluated left-to-right, but only O(m + p) FLOPs
when evaluated right-to-left. More generally, the product A1 A2 … Ak of k matrices requires k − 1
matrix–matrix multiplications which can be carried out in any order since matrix multiplication is
associative. The problem of finding the order that yields the lowest FLOP count is a combinatorial
optimization problem, which is known as the matrix chain ordering problem, see Exercise 5.7. This
problem can be solved by means of dynamic programming, which we introduce in Section 5.5 and
discuss in more detail in Chapter 8.
when referring to the diagonal matrix with the elements of d on its diagonal. Diagonal matrices
are also attractive from a computational point of view. For example, a matrix–vector product of the
form diag(d)x = (d1 x1 , … , dn xn ) involves only n FLOPs, whereas a matrix–vector product Ax with
a general square matrix A requires O(n2 ) FLOPs. Similarly, the inverse of a nonsingular diagonal
matrix diag(d) is also diagonal, i.e. diag(d)−1 = diag(1∕d1 , … , 1∕dn ).
𝐻𝑥
0
Yet another example of a family of orthogonal matrices are permutation matrices. A permutation
of the elements of a vector x ∈ ℝn is a reordering of the elements. This can be expressed as a map
of the form
(x1 , … , xn ) → (x𝜋(1) , … , x𝜋(n) ), (2.19)
where the function 𝜋 ∶ ℕn → ℕn defines the permutation and satisfies 𝜋(i) ≠ 𝜋(j) ⟺ i ≠ j. This
map can be expressed as a matrix–vector product Px, where P ∈ ℝn×n is the permutation matrix
defined as
⎡eT𝜋(1) ⎤
⎢ ⎥
P = ⎢ ⋮ ⎥. (2.20)
⎢ T ⎥
⎣e𝜋(n) ⎦
It is easy to verify that PPT = I since 𝜋(i) ≠ 𝜋(j) whenever i ≠ j, which implies that P is orthogonal.
It follows that the map (2.19) is invertible, and P−1 = PT is the permutation matrix that corresponds
to the inverse permutation 𝜋 −1 . The special case where 𝜋(i) = n + 1 − i corresponds to reversing the
order of n elements and leads to the permutation matrix
⎡eTn ⎤ ⎡ 1⎤
⎢ ⎥
J = ⎢⋮⎥ = ⎢ ⋰ ⎥,
⎢ T ⎥ ⎢⎣1 ⎥
⎦
⎣e ⎦1
which follows from the Laplace expansion (2.8). A direct consequence is that T is nonsingular if
and only if all of the diagonal entries of T are nonzero. The inverse of a nonsingular lower (upper)
triangular matrix T is another lower (upper) triangular matrix. This follows from the identity (2.9)
by noting that the cofactor matrix associated with T is itself triangular. To see this, first recall that
the minor of tij is the determinant of the (n − 1) × (n − 1) matrix obtained by deleting the ith row
and the jth column of T. This implies that if, say, T is lower triangular and i > j, then the effect of
deleting the ith row and the jth column of T is a triangular matrix of order n − 1. This matrix will
have at least one diagonal entry that is equal to zero, and hence, it is singular, which implies that
2.5 Matrices with Structure 19
the minor of tij is zero. We note that the product of two lower (upper) triangular matrices is another
lower (upper) triangular matrix.
Given a nonsingular triangular matrix T, it is possible to compute matrix–vector products of the
form y = T −1 x and y = T −T x without computing T −1 first. For example, if T is lower triangular,
then y = T −1 x can be computed by means of forward substitution, i.e.
( )
∑
k−1
y1 ← T11 x1 , yk ← Tkk xk −
−1 −1
Tki yi , k = 2, … , n.
i=1
n2
This requires approximately FLOPs. Similarly, if T is an upper triangular matrix, then y = T −1 x
can be computed using backward substitution, i.e.
( )
∑n
yn ← Tnn xn , yk ← Tkk xk −
−1 −1
Tki yi , k = n − 1, … , 1,
i=k+1
i.e.
√ svec stacks the lower triangular part of the columns of A and scales the off-diagonal entries by
2. It is straightforward to verify that this definition implies that for all X, Y ∈ 𝕊n ,
x = svec(X), y = svec(Y ) ⟹ xT y = tr(X T Y ),
which means that the svec transformation preserves inner products.
A lower triangular Toeplitz matrix whose first column is given by a0 , … , an−1 can be expressed as
∑
n−1
L= ak S k . (2.26)
k=0
It then follows from the fact that det (L) = an0 that L is nonsingular if and only if a0 ≠ 0. Moreover,
a matrix of the form (2.26) commutes with S, which follows from the fact that S commutes with
Sk , i.e. SSk = Sk S. It can also be shown that the inverse of a nonsingular lower triangular Toeplitz
matrix is another lower triangular Toeplitz matrix, see Exercise 2.6.
Toeplitz matrices are closely related to another class of matrices called Hankel matrices. Like a
Toeplitz matrix, a Hankel matrix is a square matrix that is uniquely determined by its first row
and column. However, unlike a Toeplitz matrix, a Hankel matrix has constant entries along the
anti-diagonal, the anti-subdiagonals, and anti-superdiagonals. As a consequence, the exchange
matrix J maps a Toeplitz matrix to a Hankel matrix and vice versa, i.e. if T is a Toeplitz matrix,
then JT and TJ are both Hankel matrices. We note that Hankel matrices are symmetric, whereas
Toeplitz matrices are persymmetric, i.e. symmetric about the anti-diagonal. Finally, we note
that the notion of Toeplitz (Hankel) matrices can be extended to include nonsquare matrices
with diagonal-constant (anti-diagonal-constant) structure, and such matrices may be viewed as
submatrices of square Toeplitz (Hankel) matrices.
zero. The lower bandwidth of B is a nonnegative integer l such that bij = 0 if j < i − l, and similarly,
the upper bandwidth of B is a nonnegative integer u such that bij = 0 if j > i + u. A bandwidth of
0 corresponds to a diagonal matrix, whereas a bandwidth of 1 is a tridiagonal matrix or, if l = 0 or
u = 0, an upper or a lower bidiagonal matrix.
xT Ax = xT AT x = (1∕2)xT (A + AT )x,
which implies that only the symmetric part of A contributes to the value of xT Ax. We therefore limit
our attention to the case where A ∈ 𝕊n .
A matrix A ∈ 𝕊n is positive semidefinite if and only if xT Ax ≥ 0 for all x ∈ ℝn , and it is positive
definite if and only if xT Ax > 0 for all x ≠ 0. Similarly, A is negative (semi)definite if −A is positive
(semi)definite, and it is indefinite if it is neither positive semidefinite nor negative semidefinite. We
will use the notation 𝕊n+ for the set of positive semidefinite matrices in 𝕊n , the interior of which is
the set of positive definite matrices, denoted by 𝕊n++ .
Given two matrices A, B ∈ 𝕊n , the generalized inequality A ⪰𝕊n+ B, which is a partial ordering on
𝕊 , is defined as
n
To simplify notation, we will omit the subscript S+n and simply write A ⪰ B and A ≻ B when there
is no danger of ambiguity. We return to generalized inequalities in Section 4.2.
We end this section by deriving some useful properties of positive semidefinite matrices. To this
end, we consider a matrix X ∈ 𝕊n+ , which we partition as
[ ]
A B
X= T , (2.29)
B C
where A ∈ 𝕊n1 , B ∈ ℝn1 ×n2 , and C ∈ 𝕊n2 with n1 + n2 = n. Positive semidefiniteness implies that
zT Xz ≥ 0 for all z ∈ ℝn or, equivalently, for all z = (u, 𝑣) ∈ ℝn1 × ℝn2 ,
f (u, 𝑣) = uT Au + 𝑣T C𝑣 + 2uT B𝑣 ≥ 0.
Thus, f (u, 0) = uT Au ≥ 0 for all u and f (0, 𝑣) = 𝑣T C𝑣 ≥ 0 for all 𝑣, so both A and C must be positive
semidefinite. This holds for any partition of the form (2.29) and any symmetric permutation of X
which, in turn, implies that every principal submatrix of X must be positive semidefinite.
Positive semidefiniteness of X also implies that
which is easily proven by contradiction: if (B) ⊊ (A), then there exists a vector 𝑣 such that B𝑣 ≠ 0
and B𝑣 ∉ (A), and hence, f (−tB𝑣, 𝑣) = 𝑣T C𝑣 − 2t||B𝑣||22 tends to −∞ as t → ∞. This is a contra-
diction since X is positive semidefinite. The condition (BT ) ⊆ (C) can be proven in a similar
manner. An immediate consequence of the range conditions (2.30) is that the ith row and column
of X must be zero if Xii = 0.
A = QΛQT ,
Analogously, A is positive definite if and only if λmin (A) > 0, which implies that A has full rank, and
it is indefinite if it has both positive and negative eigenvalues. A positive definite matrix A of order
n defines a weighted inner product ⟨y, x⟩A = ⟨y, Ax⟩ = yT Ax, which induces the quadratic norm
√ √
||x||A = ⟨x, x⟩A = xT Ax.
The symmetric square root of a matrix A ∈ 𝕊n+ is the unique symmetric positive semidefinite
matrix A1∕2 that satisfies A = A1∕2 A1∕2 . Given a spectral decomposition A = QΛQT , the symmetric
square root of A may be expressed as
( )
1∕2 1∕2
A1∕2 = QΛ1∕2 QT = Qdiag λ1 , … , λn QT .
This implies that transformations of the form A → F T AF with F ∈ ℝn×k preserve positive semidef-
initeness, i.e. we have that
Moreover, if A is positive definite and rank(F) = k, then F T AF is also positive definite. We note that
A and B = F T AF are said to be congruent if F is square and nonsingular.
The eigenvalues and eigenvectors of a symmetric matrix are related to the so-called Rayleigh
quotient which, for a given matrix A ∈ 𝕊n and a nonzero vector x ∈ ℝn , is defined as
xT Ax
RA (x) = , x ≠ 0. (2.31)
xT x
A stationary point of RA must satisfy
xT Ax
∇RA (x) = 0 ⟺ Ax = x,
xT x
2.8 Singular Value Decomposition 23
λmax (A) = max {RA (x)}, λmin (A) = min {RA (x)}.
x≠0 x≠0
A singular value decomposition (SVD) of a matrix A ∈ ℝm×n is a factorization of the form A = UΣV T ,
where U ∈ ℝm×m and V ∈ ℝn×n are orthogonal matrices, and Σ ∈ ℝm×n is a matrix with the singular
values of A on its main diagonal and zeros elsewhere, i.e. the diagonal entries of Σ are Σii = 𝜎i ,
i ∈ ℕmin (m,n) where we use the convention that 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 0. If we let r denote the rank of A,
then 𝜎i = 0 for i > r, and hence, we can partition an SVD of A as
[ ][ ]
[ ] S 0 V1T
A = U1 U2 = U1 SV1T , (2.32)
0 0 V2T
where U1 ∈ ℝm×r , V1 ∈ ℝn×r , and S = diag(𝜎1 , … , 𝜎r ) is the square matrix with the nonzero singu-
lar values of A on its diagonal. This shows that an SVD is a so-called rank-revealing factorization,
and A = U1 SV1T is commonly referred to as a thin or reduced SVD of A. We note that the largest
integer k such that 𝜎k > 𝜖 for a given 𝜖 > 0 is referred to as the numerical rank or the 𝜖-rank of A.
The 𝜖-rank of A may also be defined as
which allows us to interpret the numerical rank of A as the smallest attainable rank for matrices
within a neighborhood of A.
The partition (2.32) can be linked to the four subspaces introduced in Section 2.2 and illustrated
in Figure 2.1. Specifically, we have that
The matrix P = U1 U1T is a projection matrix, and it is also an idempotent matrix since P2 = P. More-
over, I − P is also a projection matrix, which follows from the fact that I − P = U2 U2T , and it defines
a projection onto (AT ) = (A)⟂ .
An SVD can also be used to construct useful upper and lower bounds on the trace inner product
of two matrices. A notable example is von Neumann’s trace inequality, which states that
∑
min (m,n)
|tr(AT B)| ≤ tr(ΣT Γ) = Σii Γii , (2.33)
i=1
where A, B ∈ ℝm×n are matrices with SVDs A = UΣV T and B = PΓQT , see, e.g. [76] for the a proof.
The singular values of an m × n matrix A can be used to define a family of matrix norms on ℝm×n
that are known as Schatten norms. For p ∈ [1, ∞), the Schatten p-norm is defined as
(min (m,n) )1∕p
∑
||A||(p) = 𝜎i (A)p
, (2.34)
i=1
24 2 Linear Algebra
i.e. it may be viewed as the p-norm of a vector that contains the singular values of A. The parentheses
in the subscript is not standard notation, but we include them here to avoid confusion with the
induced matrix p-norm defined in (2.14). The Schatten 1-norm, which we will denote by ||A||∗ , is
also known as the nuclear norm or the trace norm. It is straightforward to verify that the matrix
norms ||A||F and ||A||2 are both special cases of the Schatten p-norm, see Exercise 2.5.
The Moore–Penrose pseudoinverse provides a convenient way to express projections onto the
four subspaces (A), (AT ), (AT ), and (A). This follows from (2.32) and (2.36) by noting that
AA† and A† A are projection matrices, i.e.
AA† = U1 U1T , A† A = V1 V1T .
Thus, projections onto the four subspaces can be expressed in terms of the projection matrices
included in Table 2.2.
Table 2.2 The four subspaces and the
corresponding projection matrices.
2.11.1 LU Factorization
A by-product of Gaussian elimination is a factorization of the form
A = PLU,
where A is nonsingular and square, P is a permutation matrix, L is unit lower triangular, and U
is upper triangular. This is known as a PLU factorization or simply an LU factorization of A. The
factorization requires roughly (2∕3)n3 FLOPs, and it only requires additional storage for a permuta-
tion vector if A is overwritten by the factors L and U. The factorization allows us to find the solution
to Ax = b by solving the three simpler systems of equations,
Pz = b, Ly = z, Ux = y.
The first system, Pz = b, has the solution z = PT b, which is simply a permutation of b, and the
two triangular systems can be solved by means of forward and backward substitution in roughly
2n2 FLOPs. Thus, the total cost of the factorization step and the solve step is roughly (2∕3)n3 +
2n2 FLOPs. Note that the solve step is significantly cheaper than the factorization step, so it is
generally advantageous to reuse the factorization of A if several systems with this coefficient matrix
must be solved. We note that A−1 can be computed by solving the matrix equation AX = I, which
is equivalent to solving the n systems Axi = ei , i ∈ ℕn . This costs approximately (2∕3)n3 + 2nn2 =
(8∕3)n3 FLOPs if a single PLU factorization is computed and reused. Thus, the cost of solving Ax = b
by explicitly computing A−1 followed by the matrix–vector product A−1 b is several times higher
than that of the factor-solve approach.
where D = diag(L211 , … , L2nn ) and L̄ = LD−1∕2 is unit lower triangular. The cost of computing the
Cholesky factorization is roughly (1∕3)n3 FLOPs.
is invariant under symmetric permutations. One approach to overcoming this issue, which due to
[25], is to allow 1-by-1 and 2-by-2 pivot blocks. The resulting factorization is of the form
PAPT = LDLT , (2.46)
where D is a block diagonal matrix that contains the pivot blocks, and the cost is roughly (1∕3)n3
FLOPs.
2.11.4 QR Factorization
A matrix A ∈ ℝm×n with linearly independent columns can be decomposed into the product of an
orthogonal matrix Q ∈ ℝm×m and a matrix Rm×n with zeros below its diagonal, i.e.
[ ]
[ ] R1
A = QR = Q1 Q2 = Q1 R1 , (2.47)
0
where Q1 ∈ ℝm×n is the first n columns of Q, and R1 ∈ ℝn×n is upper triangular with nonzero diago-
nal elements. Such a factorization is called a QR factorization and can be computed in several ways,
e.g. by applying a sequence of n − 1 Householder transformations to A. The cost is approximately
2mn2 − (2∕3)n3 FLOPs or simply O(mn2 ), and it requires very little additional storage if A is over-
written by R and the vectors that define the n − 1 Householder transformations. Another benefit
of storing Q implicitly in this way is that matrix–vector products with Q and QT can be computed
in O(mn) FLOPs rather than O(m2 ) FLOPs if Q is formed and stored explicitly. We note that Q1 R1
is referred to as a thin or reduced QR factorization of A, and the matrix R1 is unique if we require
the diagonal of R1 to be positive.
A factorization of the form (2.47) yields an orthogonal basis for both the range of A and the
nullspace of AT . Specifically, the columns of Q1 form an orthogonal basis for the range of A,
whereas the columns of Q2 form an orthogonal basis for the nullspace of AT . This observation
allows us to characterize the solution set to a system of underdetermined equations. Specifically,
if A ∈ ℝm×n with rank(A) = m, then the set of solutions to Ax = b may be expressed in terms of a
QR factorization AT = QR as
1 b + Q2 z | z ∈ ℝ
= {Q1 R−T n−m
}.
This follows directly from (2.40) by noting that
1 .
A† = AT (AAT )−1 = Q1 R1 (RT1 R1 )−1 = Q1 R−T
The QR factorization can also be used to compute a Cholesky factorization of a matrix of the form
AT A, where A ∈ ℝm×n and rank(A) = n. Indeed, if A = Q1 R1 , then AT A = RT1 R1 is the Cholesky
factorization of AT A, provided that R1 is chosen such that its diagonal is positive.
Finally, by combining the QR factorization with column pivoting, it can be applied to general
matrices A ∈ ℝm×n with linearly dependent columns. The result is a factorization of the form
A = QRP,
where Q ∈ ℝm×m is orthogonal, R ∈ ℝm×n is a matrix with zeros below the diagonal, and P is a per-
mutation matrix, which is commonly chosen so that the diagonal elements of R are nonincreasing.
The QR factorization with column pivoting is sometimes called a QRP factorization, and it is usu-
ally the first step of a so-called rank-revealing QR factorization, which can be used to compute the
numerical rank of a matrix.
where the permutation matrices P1 and P2 affect both the stability of the factorization and the spar-
sity of the triangular matrices L and U. To illustrate the basic principle, we consider the somewhat
simpler case where A is symmetric and positive definite. This implies that we can use a sparse
Cholesky factorization
PT AP = LLT , (2.48)
where P is a permutation matrix that determines the elimination order and affects the sparsity of
L. Since A ∈ 𝕊n++ is strongly factorizable, P can be chosen without taking numerical stability into
account, and hence, it can be chosen based on the sparsity pattern of A. Figure 2.5 shows an example
of a sparsity pattern and the corresponding sparsity graph, which is a graph with n nodes and an
edge for every off-diagonal nonzero element.
The sparsity pattern of L can be determined by means of a symbolic factorization of the sparsity
pattern of PT AP. Only the location of the nonzero entries of PT AP are needed for this step. Nonzero
entries in L that are not present in PT AP are referred to as fill-in. The example in Figure 2.6 illus-
trates this for two different symmetric permutations of an “arrow” sparsity pattern. It is clear from
the figure that the elimination order has a significant effect on the amount of fill-in. A large amount
of fill-in is undesirable, since additional nonzero entries in L lead to additional FLOPs. Unfortu-
nately, the problem of computing the minimum fill-in is NP-complete , but several fill-in reducing
heuristics exist that often work well in practice. We note that there exists a zero fill-in elimination
order if and only if the sparsity graph is a so-called chordal graph [106]. An elimination order with
zero fill-in is referred to as a perfect elimination order, and the corresponding symmetric permuta-
tion PT AP has the same sparsity pattern as that of L + LT .
1 2 3 4 5 2 3 4 5 1
1 2
2 3
3 ⇝ 4 ⇝
4 5
5 1
𝑃1𝑇 𝐴𝑃1 𝐿 𝑃2𝑇 𝐴𝑃2 𝐿
Figure 2.6 Symbolic Cholesky factorizations of two different symmetric permutations of A. The entries in L
that are marked by ⊠ are fill-in.
30 2 Linear Algebra
This is a block LDU factorization of M, i.e. M is expressed as a a product of a block unit lower
triangular matrix, a block diagonal matrix, and a block unit upper triangular matrix. Similarly, if
D is nonsingular, then M may be expressed as
[ ] [ ][ ][ ]
A B I BD−1 A − BD−1 C 0 I 0
M= = , (2.50)
C D 0 I 0 D D−1 C I
which is a block UDL factorization.
The determinant of M can be expressed in terms of the block factorizations (2.49) and (2.50) if A
or D is nonsingular. Using the fact that the determinant of a triangular matrix with a unit diagonal
is equal to 1 and the fact that det (bdiag(A1 , A2 )) = det (A1 ) det (A2 ) if A1 and A2 are square matrices,
we have that
A nonsingular ⟹ det (M) = det (A) det (D − CA−1 B), (2.51a)
D nonsingular ⟹ det (M) = det (D) det (A − BD−1 C). (2.51b)
It follows directly from (2.51a) that if A is nonsingular, then D − CA−1 B is nonsingular if and
only if M is nonsingular. Similarly, if D is nonsingular, then the Schur complement of D in M,
A − BD−1 C, is nonsingular if and only if M is nonsingular. These observations can be used to derive
the Weinstein–Aronszajn identity, also known as Sylvester’s determinant identity, which states that
det (I + BC) = det (I + CB). (2.52)
Indeed, this identity follows directly from (2.51) by letting A = I and D = −I such that
det (M) = det (A) det (D − CA−1 B) = (−1)n2 det (I + CB)
and
det (M) = det (D) det (A − BD−1 C) = (−1)n2 det (I + BC).
The identity is particularly useful if n1 ≫ n2 or n2 ≫ n1 so that either I + CB or I + BC is much
smaller than the other. For example, in the special case, where BC = u𝑣T is a rank-1 matrix, the
identity reduces to det (I + u𝑣T ) = 1 + 𝑣T u.
The block factorizations (2.49) and (2.50) can also be used to derive explicit expressions for the
blocks of the inverse of M. It follows from (2.49) that if M and A are nonsingular, then M −1 is
given by
[ ]−1 [ ]−1 [ ]−1 [ ]−1
A B I A−1 B A 0 I 0
= ,
C D 0 I 0 D − CA−1 B CA−1 I
[ ][ ][ ]
I −A−1 B A−1 0 I 0
= , (2.53)
0 I 0 (D − CA−1 B)−1 −CA−1 I
[ −1 ]
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
= .
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
Similarly, if M and D are nonsingular, then (2.50) implies that M −1 can be expressed as
[ ]−1 [ ]−1 [ ]−1 [ ]−1
A B I 0 A − BD−1 C 0 I BD−1
= −1 ,
C D D C I 0 D 0 I
[ ][ ][ ]
I 0 (A − BD−1 C)−1 0 I −BD−1
= , (2.54)
−D−1 C I 0 D−1 0 I
[ ]
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
= .
−D−1 C(A − BD−1 C)−1 D−1 + D−1 C(A − BD−1 C)−1 BD−1
2.11 Factorization Methods 31
Now, if both A and D are nonsingular, then the (1,1) block of (2.53) and that of (2.54) must be equal,
i.e.
(A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 . (2.55)
Substituting (W, −U, V, Z −1 )
for (A, B, C, D), where W and Z are nonsingular, yields the so-called
Sherman–Morrison–Woodbury (SMW) identity. This is also known as the matrix inversion lemma:
(W + UZV T )−1 = W −1 − W −1 U(Z −1 + V T W −1 U)−1 V T W −1 . (2.56)
This identity is often useful when solving a system of equations of the form (2.44), where the
coefficient matrix is a Schur complement. For example, using (2.56), the solution to the system
(A − BD−1 C)x = r can be expressed as
x = A−1 r + A−1 B(D − CA−1 B)−1 CA−1 r,
and can be computed as follows:
1. Compute u and Y by solving Au = r and AY = B.
2. Form S = D − CY and compute 𝑣 by solving S𝑣 = Cu.
3. Compute x = u + Y 𝑣.
This approach is particularly advantageous when the order of S is much smaller than that of A and
A is simple or, cheap to factorize, e.g. a diagonal matrix.
The approach that we have just outlined relies on the assumption that H is positive definite, and
hence, another approach is needed if H is only positive semidefinite.
Now, given a matrix-valued function F ∶ ℝ → ℝm×n that is differentiable, we define the matrix
of first-order derivatives
⎡ dF11 dF1n ⎤
⎢ dx · · · dx ⎥
dF(x) dF ⎢ ⎥
⋮ ⋱ ⋮ ⎥,
dx ⎢⎢
= = (2.67)
dx ⎥
dF
⎢ m1 · · · dF mn ⎥
⎣ dx dx ⎦
where the function Fij (x) is the (i, j)th element of F(x). Similarly, given a differentiable function
f ∶ ℝm×n → ℝ, we define the matrix of first-order partial derivatives
⎡ 𝜕f · · · 𝜕f ⎤
⎢ 𝜕X11 𝜕X1n ⎥
𝜕f (X) 𝜕f ⎢ ⎥
⋮ ⋱ ⋮ ⎥.
𝜕X ⎢⎢
= = (2.68)
𝜕X
𝜕f 𝜕f ⎥
⎢ ··· ⎥
⎣ 𝜕Xm1 𝜕Xmn ⎦
The partial derivatives of a composite function f = g ∘ h, where g ∶ ℝp → ℝm and h ∶ ℝn → ℝp
are differentiable functions, can be expressed in terms of the chain rule as
𝜕f 𝜕g(h(x)) 𝜕g(y) || 𝜕h(x)
= = . (2.69)
𝜕xT 𝜕xT 𝜕yT ||y=h(x) 𝜕xT
For differentiable functions f ∶ ℝn → ℝ and g ∶ ℝn → ℝ, the product rule takes the form
𝜕f (x)g(x) 𝜕f 𝜕g
= g(x) + f (x) = ∇f (x)g(x) + f (x)∇g(x), (2.70)
𝜕x 𝜕x 𝜕x
whereas for f ∶ ℝn → ℝm and g ∶ ℝn → ℝ, we have that
𝜕f (x)g(x) 𝜕f 𝜕g
= T g(x) + f (x) T . (2.71)
𝜕x T 𝜕x 𝜕x
Example 2.3 We now illustrate the use of the chain rule for the special case where f = g ∘ h is
a composition of a differentiable function g ∶ ℝp → ℝm and a linear function h(x) = Ax, where
A ∈ ℝp×n . We have that 𝜕h∕𝜕xT = A, and hence, the chain rule yields
𝜕f 𝜕g(y) ||
= A.
𝜕xT 𝜕yT ||y=Ax
If f is real-valued, i.e. m = 1, then this is equivalent to the identity ∇f (x) = AT ∇g(Ax), and if g is
also twice differentiable, then the Hessian of f is
∇2 f (x) = AT ∇2 g(Ax)A.
which is known as the log-sum-exp function. To derive the partial derivatives of f using the chain
rule, we start by expressing f as f = g ∘ h where g(y) = ln(𝟙T y) and h(x) = (ex1 , … , exn ). We have that
𝜕g∕𝜕yT = 𝟙1T y 𝟙T and 𝜕h∕𝜕xT = diag(h(x)), and hence, it follows that
𝜕f 1 1
= T 𝟙T diag(h(x)) = T h(x)T .
𝜕xT 𝟙 h(x) 𝟙 h(x)
Exercises 35
The Hessian of f now follows from the product rule (2.71), i.e.
1 1
∇2 f (x) = diag(h(x)) − T h(x)h(x)T ,
𝟙T h(x) (𝟙 h(x))2
= diag(∇f (x)) − ∇f (x)∇f (x)T .
Example 2.5 Recall the Laplace expansion (2.8) of the determinant of a square matrix A ∈ ℝn×n
along the jth column, i.e.
∑
n
det (A) = cij aij ,
i=1
where cij = (−1)i+j Mij is the cofactor of the (i, j)th entry of A. None of the cofactors c1j , … , cnj are a
function of aij , and hence, it follows that
( n )
𝜕 𝜕 ∑
det (A) = c a = cij .
𝜕aij 𝜕aij k=1 kj kj
This result can be used to derive an expression for the derivative of det (A(t)), where A ∶ ℝ → ℝn×n ,
i.e.
d ∑∑ 𝜕 det (A(t)) daij (t) ∑∑ daij (t)
n n n n
det (A(t)) = = cij .
dt i=1 j=1
𝜕aij (t) dt i=1 j=1
dt
Exercises
2.2 Show that the operator norm and the Frobenius norm on ℝm×n are orthogonally invariant,
i.e. if Q1 ∈ ℝm×m and Q2 ∈ ℝn×n are orthogonal matrices, then it holds that
2.3 Show that the infinity norm of a matrix A ∈ ℝm×n may be expressed as
2.4 Let x ∈ ℝn be a nonzero vector. Find a vector 𝑣 ∈ ℝn such that the Householder matrix
H = I − 2 𝑣𝑣
T
𝑣T 𝑣
maps x to ||x||2 e1 , i.e.
𝑣T x
Hx = x − 2 𝑣 = ||x||2 e1 .
𝑣T 𝑣
36 2 Linear Algebra
2.6 Show that the inverse of a nonsingular, lower-triangular Toeplitz matrix whose first column
is given by a0 , … , an−1 is another lower-triangular Toeplitz matrix.
2.7 Show that a lower-triangular Toeplitz matrix T of order n commutes with the lower shift
matrix S of order n, i.e. ST = TS.
2.8 Show that the trace of a symmetric matrix A ∈ 𝕊n is equal to the sum of its eigenvalues, i.e.
∑
n
tr(A) = λi ,
i=1
2.9 Show that the smallest eigenvalue of X ∈ 𝕊n is greater than or equal to t ∈ ℝ if and only if
X − tI ∈ 𝕊n+ .
2.11 Show that a matrix Z ∈ ℝm×n has rank at most r if and only if there exist matrices X ∈ 𝕊m
and Y ∈ 𝕊n such that
[ ]
X Z
rank X + rank Y ≤ 2r, ⪰ 0.
ZT Y
(a) Show that the product can be expressed as the Hankel matrix
⎡ h1 h2 h3 … hN−1 hN ⎤
⎢ ⎥
⎢ h2 h3 h4 … hN hN+1 ⎥
⎢ h h4 h5 … hN+1 hN+2 ⎥
= ⎢ 3 ,
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎥
⎢hN−1 hN hN+1 … h2N−3 h2N−2 ⎥
⎢ h hN+1 hN+2 … h2N−2 h2N−1 ⎥⎦
⎣ N
where hk = CAk−1 B for k ∈ ℕ2N−1 are the first 2N − 1 so-called impulse response coeffi-
cients or Markov-parameters of the dynamical system.
(b) Show that the Markov parameters are invariant under a state-transformation as in
Exercise 2.12
(c) Suppose has full column rank, in which case, the dynamical system is said to be
observable. Moreover, assume that has full row rank, which means that the dynamical
system is controllable. What is then the rank of ?
(d) Suppose that instead of the matrices (A, B, C), we are given the Markov parameters
(h1 , h2 , … , h2N−1 ) for some N > n. Describe a numerical procedure for finding matri-
ces (A, B, C) such that hk = CAk−1 B for k ∈ ℕ2N−1 . How can you find (A, B, C) such that
the corresponding dynamical system is both observable and controllable?
Y = X + U,
Y Π⟂ = XΠ⟂ .
38 2 Linear Algebra
[ ]
X
(b) Assume that has full row rank and that has full column rank. Show that
U
( )
rank Y Π⟂ = n,
( )
Y Π⟂ = .
where L11 ∈ ℝq×q , L22 ∈ ℝp×p , and Q1 Q2 T = 0. Such a factorization can be obtained from
[ ]
a QR-factorization of U T Y T . Show that Y Π⟂ = L22 Q2 .
2.18 Let X ∈ ℝm×m and assume that det X > 0. Show that
𝜕
ln det X = X −T .
𝜕X
Hint: Use the chain rule and Jacobi’s formula.
2.22 Given a matrix A ∈ ℝm×n , the projection of a point x ∈ ℝm onto (AT ) can be expressed as
Px, where P = I − AA† is a projection matrix. If we assuming that rank(A) = n, then P can
be expressed as
P = I − A(AT A)−1 AT .
Now, suppose that A is a function of a scalar parameter t.
(a) Show that
dP dA dAT
= −P A† − (A† )T P.
dt dt dt
(b) Suppose rank(A) = n and let A = Q1 R1 be a reduced QR decomposition of A, i.e.
Q1 ∈ ℝm×n and R1 ∈ ℝn×n . Show that the projection Px and the derivative dP∕dt can be
evaluated efficiently using such a QR decomposition of A without explicitly computing
A† = (AT A)−1 AT .
40
Probability Theory
In this chapter, we will discuss the basics of probability theory. It is a branch of mathematics where
uncertain events are given a number between zero and one to describe how likely they are. Loosely
speaking this number should be close to the relative frequency with which the event occurs when
it is repeated many times. As an example we may consider throwing a fair dices one hundred
times and recording how many times a one occurs. If it occurs 18 times, the relative frequency
is 18∕100 = 0.18. This is close to the theoretical value of the probability which is 1/6. The reason
we know it should be 1/6 is that all the possible six outcomes of the experiment should have the
same probability if the dice is fair. In case a probability is one, we are almost sure the event will
occur, and if it is zero, we are almost sure it will not occur.
The roots of probability theory go back to the Arab mathematician Al-Khalil who studied
cryptography. Initially, probability theory only considered combinatorial problems. The theory is
much easier for this case as compared to the case when the number of events is not countable.
Mathematicians struggled for many years to provide a solid foundation, and it was not until in
1933 when Andrey Nikolaevich Kolmogorov made an axiomatic definition of probabilities that
the problem was resolved, and modern probability theory was born. We are however not going
to provide the details of the measure theoretic foundations of probability theory in this chapter;
the interested reader is referred to, e.g. [98]. The presentation given here is more in line with [48].
Probability theory is the foundation for statistics and learning and used in many other branches of
science.
A probability space is defined by a triplet (Ω, , ℙ). Here Ω is called the sample space, and it is
a set that contains all possible outcomes of an experiment. When throwing a dice, we can take
Ω = {1, 2, 3, 4, 5, 6}. However, if we are only interested in if the number is odd or even, we could
instead take the sample space to be Ω = {odd, even}. For other experiments, it may be more appro-
priate to have Ω = ℝ. An example of this is when the experiment is the error with which we measure
something. The sample space could also contain vectors. If we throw two dices, it is appropri-
ate to consider Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}. We may sometimes have infinite-dimensional
vectors, e.g. Ω = ℝℤ , i.e. the set of real-valued functions defined on the integers, cf . the nota-
tion section. Here the dimensions are countable, and ℤ might be interpreted as a set of discrete
time indices. It could also be the case that Ω = ℝℝ , in which case, the sample space contains all
real-valued functions of a real variable. Sample spaces containing functions have applications in
control and signal processing. We also encounter examples, where Ω = ℝℝ , i.e. the sample space
n
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
3.1 Probability Spaces 41
contains all real-valued functions defined on ℝn , which has applications in so-called “Gaussian
processes” which we will discuss in more detail in Chapter 9.
The second element of a probability space should contain all events that we are interested in
assigning probabilities to. This is a set of subsets of Ω, and it should be a so-called 𝜎-algebra, i.e. the
following properties must hold
1. Ω ∈
2. Ω∖A ∈ , ∀A ∈
⋃∞
3. ∀A1 , A2 , … ∈ ⇒ i=1 Ai ∈ .
The latter two conditions say that is closed under complement and under countable unions.
The difference between algebra and 𝜎-algebra is that for an algebra, only finite unions are consid-
ered in the last condition. For finite sample spaces, there is no difference since is then also a finite
set. It can then at most contain all subsets of Ω. The smallest possible 𝜎-algebra is = {Ω, ∅}.
Example 3.1 Let us again consider the example of throwing a fair dice, and we take
Ω = {1, 2, 3, 4, 5, 6}. We are then interested in the events odd and even. Hence, we define
the sets Aodd = {1, 3, 5}, and Aeven = {2, 4, 6}, which are the sets containing the odd and even num-
bers, respectively. We may then take the 𝜎-algebra to be = {Aodd , Aeven , Ω, ∅}. It is straightforward
to verify that this is a 𝜎-algebra.
When we carry out an experiment, like throwing a fair dice, we say that we observe an outcome
of the experiment, and this will be an element of the 𝜎-algebra, e.g. the number is either odd or even
as in the example above. We then say that this is the outcome of the experiment or the observation
of the experiment.
The first condition is that ℙ is normalized, and the second condition is that ℙ is what is called
𝜎-additive. Well-known properties such as ℙ[A ∪ B ] = ℙ[A] + ℙ[B ] − ℙ[A ∩ B ] all follow from
the above axioms, see Exercise 3.1. Notice that when Ω contains uncountably many elements, say
Ω = ℝ, it is not possible to take to contain all subsets of Ω and define ℙ to satisfy the second
condition above. It means that it is not possible to consider any subsets of Ω as the events of the
experiment in a meaningful way. This puzzled the mathematicians for many years. One usually
restricts oneself to the so-called Borel 𝜎-algebra, which is the smallest 𝜎-algebra that contains all
intervals of ℝ. All sets in the Borel 𝜎-algebra can be formed from open sets, or equivalently, from
closed sets, through the operations of countable union, countable intersection, and relative com-
plement. It then follows that it is possible to satisfy the second condition in the definition of the
probability measure. We will not discuss how to generalize this to more complicated sample spaces
like Ω = ℝℝ .
Example 3.2 We consider the case of Ω = ℕn , and we define the probability function p ∶ ℕn →
[0, 1] as
⎧ ezk
⎪ ∑n−1 zl , k ∈ ℕn−1 ,
p(k) = ⎨ 1+ l=1 e
⎪ ∑1n−1 zl , k = n,
⎩ 1+ l=1 e
for given zk ∈ ℝ, k ∈ ℕn−1 , which is called the categorical probability function. We will see that this
is used in what is called logistic regression analysis. It is straightforward to verify that this is a valid
probability function for any values of zk .
The categorical probability function is an example of a finite probability function since the set ℕn
is finite. When p is a function of an integer, we often write pk instead of p(k). Then we may also use
a vector p = (p1 , … , pn ) ∈ [0, 1]n instead of a function p to describe the probability function.
ℙ[A] = dF(𝜔).
∫A
When F is differentiable1 the derivative f ∶ ℝ → ℝ+ of F exists, and we may write
ℙ[A] = f (𝜔)d𝜔.
∫A
The function f is called the probability density function (pdf). It is straightforward to generalize the
results to Ω = ℝn .
We defer giving examples of pdfs until we have introduced random variables.
We are now interested in the probability of an event A if we know that another event B has
happened. As an example, we may be interested in the probability that we obtain a one when
we throw a fair dice if we already know that the outcome of the experiment was an odd number,
i.e. either one, three, or five. In this example, A = {1} and B = {1, 3, 5}. The whole sample space is
Ω = {1, 2, 3, 4, 5, 6}. Clearly, we can look at a smaller sample space defined by B, i.e. what we know
1 It is actually enough to assume that F is absolutely continuous, and then the derivative exists almost everywhere.
3.2 Conditional Probability 43
has happened, and then we investigate how frequent A is in this sample space, and we obtain the
probability 1∕3, i.e. A contains one of the three elements in B, which all have equal likely outcome.
Another way of computing this value is to go back to the original sample space Ω and compute
ℙ[A ∩ B ] 1∕6 1∕6 1
= = = ,
ℙ[B ] 3∕6 1∕2 3
i.e. we normalize the probability of both events occurring with the probability of the event that we
know have occurred. If P(B) > 0, we define the conditional probability that A occurs given that B
has occurred as
ℙ[A ∩ B ]
ℙ[A ∣ B ] = .
ℙ[B ]
From this it immediately follows that
it follows that
ℙ[A] ℙ[B ∣ A]
ℙ[A ∣ B ] = , (3.2)
ℙ[B ]
for any A, B ∈ such that ℙ[A] > 0 and ℙ[B ] > 0. This is called Bayes’ theorem.
Let Ai ∈ , i ∈ ℕn be pairwise disjoint sets such that Ω = ∪ni=1 Ai . Then for any X ∈ , it holds
that
∑
n
[ ] [ ]
ℙ[ X ] = ℙ Ai ℙ X|Ai , (3.3)
i=1
which is called the formula of total probability. This is proven in Exercise 3.3.
[ ] [ ] [ ]
Moreover, by (3.1), we have ℙ Ai ∩ X = ℙ Ai ℙ X ∣ Ai . From what has been said above and
from (3.2) it follows that
[ ] [ ]
[ ] ℙ Ai ℙ X ∣ Ai
ℙ Ai ∣ X = ∑n [ ] [ ] , i ∈ ℕn .
j=1 ℙ Aj ℙ X ∣ Aj
Here we have tacitly assumed that all involved events have nonzero probability.
Example 3.3 In a factory, the same items are manufactured at three different machines with a
proportion given by 15% for machine 1, 45% for machine 2 and 40% for machine 3. The differ-
ent machines produce defect items with probabilities 0.05, 0.04, and 0.03, respectively. Customers
obtain a perfect mix of items from the different machines. We denote by A1 , A2 , and A3 the events
that an item is manufactured by machine 1, 2, and 3, respectively, and we denote by X the event
44 3 Probability Theory
that an item in the mix sent to customers is defect. Then by the formula of total probability, we have
that
∑
3
[ ] [ ]
ℙ[X ] = ℙ Ai ℙ X ∣ Ai = 0.15 × 0.05 + 0.45 × 0.04 + 0.40 × 0.03 = 0.0375
i=1
3.3 Independence
If it had not been for the concept of independence, probability theory would not have been a separate
branch of mathematics, but instead just an example of measure theory. We have that the occurrence
of the event B ∈ changes the probability of the event A ∈ to occur from ℙ[A] to ℙ[A ∣ B ].
However, if ℙ[A ∣ B ] = ℙ[A], then this is not the case. An equivalent condition to this is by (3.1)
that
ℙ[A ∩ B ] = ℙ[A] ℙ[B ] ,
and this is what we take as definition of independence of A and B. Notice that this also implies that
ℙ[B ∣ A] = ℙ[B ]. The definition of independence is valid also when ℙ[A] and/or ℙ[B ] are zero.
The relation to conditional probabilities requires that they are positive.
We can also consider a family of events, i.e. = {A1 , … , An } ⊆ . We say that this family is
independent if
[ ] [ ]
ℙ ∩i∈J Ai = Πi∈J ℙ Ai ,
for all J ⊆ ℕn . If the family has the property that
[ ] [ ] [ ]
ℙ Ai ∩ Aj = ℙ Ai ℙ Aj , ∀i ≠ j,
we say that the family is pairwise independent, or that the events in the family are pairwise
independent. Independence of the family implies pairwise independence, but the converse is not
necessarily true.
For a family of independent events, the probability that at least one of them happens is given by
[ ] ∏
n
( [ ])
ℙ ∪ni=1 Ai = 1 − 1 − ℙ Ai , (3.4)
i=1
as the distribution function associated with (Ω, , ℙ). Since the sample space can be countable, we
realize that we have now actually defined distribution functions for countable sample spaces as
well. Notice that we never did this explicitly when we talked about countable sample spaces above
in this chapter. The reason was that we had not made any assumption on partial ordering of Ω,2
cf . (4.23a).
For the special case, where = Ω and X(𝜔) = 𝜔, it is in some sense unnecessary to consider X
to be a function, and then we will often just write X ∈ . We will then most often not make any
specific reference to the 𝜎-algebra or the probability measure ℙ either. Instead, we just specify
the distribution function F, or the probability function p or probability density function f defined
on . We tacitly assume that there is a well defined underlying probability measure and 𝜎-algebra.
This will in most cases be sufficient for our purposes.
Example 3.4 We say that a random variable X ∶ ℝn → ℝn defined as X(𝜔) = 𝜔 with pdf
f ∶ ℝn → ℝ+ given by
( )
1 1
f (x) = √ exp − (x − 𝜇)T Σ−1 (x − 𝜇) , (3.5)
(2𝜋)n det (Σ) 2
with 𝜇 ∈ ℝn and Σ ∈ 𝕊n++ has a Gaussian or normal distribution.3 When we want to empha-
size the dependence on the parameters 𝜇 and Σ we use ∶ ℝn × ℝn × 𝕊n++ → ℝ+ defined as
(x, 𝜇, Σ) = f (x).
where F is the distribution function for X = (X1 , X2 ), sometimes called the joint distribution func-
tion. This trivially generalizes to higher dimensions. When F is differentiable, it can be shown that
the marginal probability density functions satisfy
2 For any countable sample space, we can always introduce a partial ordering by associating each element of Ω with
an element in ℤn+ for some n.
3 We remark that it is the random variable that has a Gaussian distribution and that f is not a Gaussian distribution
but a Gaussian pdf.
46 3 Probability Theory
where f is the joint probability density function. For discrete-valued random variables, similar
formulas hold, but then involving summations instead of integrals. For n-dimensional random
variables, it is possible to look at marginal pdfs of dimension n1 < n by integrating or summing
over the remaining n2 = n − n1 variables.
Example 3.5 Consider a Gaussian random variable Z = (X, Y ) with pdf (z, 𝜇, Σ) for which
[ ] [ ]
𝜇x Σx Σxy
𝜇= ∈ℝ m+n
, Σ= ∈ 𝕊++
m+n
.
𝜇y ΣTxy Σy
It is straightforward to show by integration that X and Y are also Gaussian random variables with
marginal pdfs (x, 𝜇x , Σx ) and (y, 𝜇y , Σy ), respectively.
Here F is the joint distribution function for (X, Y ), and FX and FY the marginal distribution
functions. Similarly, f is the joint pdf, and fX and fY are the marginal pdfs. This result also
trivially generalizes to discrete random variables. When we discuss independence of n > 2 random
variables, we realize that we define this as independence of n events, and that the criteria in terms
of distribution functions and probability density functions are that we can factorize them in n
factors, where these factors are the marginals.
When g is not invertible, obtaining the distribution function for g(X) is more cumbersome.
For continuous random variables, e.g. when = = ℝ, it holds that
[ ]
FY (y) = ℙ g(X) ≤ y = f (x)dx,
∫{x|g(x)≤y}
for a small dx > 0, where fX,Y ∶ × → ℝ+ is the joint pdf for (X, Y ) and where fX ∶ → ℝ+ is
the marginal pdf for X. As dx goes to zero, we obtain ℙ[Y ≤ y ∣ X = x ], and hence, we define the
conditional distribution function FY |X ∶ → [0, 1] as
y fX,Y (x, 𝑣)
FY |X (y|x) = d𝑣.
∫−∞ fX (x)
The conditional probability density function fY |X ∶ → ℝ+ is given by
fX,Y (x, y)
fY |X (y|x) = . (3.7)
fX (x)
Hence, we also obtain the same formula for continuous random variables.
Example 3.6 Consider the case when (X, Y ) is jointly normal, i.e. the pdf is given by
1 1 T Σ−1 (z−𝜇)
fX,Y (z) = √ exp− 2 (z−𝜇) ,
(2𝜋)m+n det (Σ)
where z = (x, y), 𝜇 = (𝜇x , 𝜇y ), and where
[ ]
Σ Σ
Σ = Tx xy .
Σxy Σy
From (2.49) we have that
[ ] [ ][ ][ ]
Σx Σxy I 0 Σx 0 I Σ−1
x Σxy
= .
ΣTxy Σy ΣTxy Σ−1
x I 0 Σy − ΣTxy Σ−1
x Σxy 0 I
Notice that
[ ]−1 [ ]
I Σ−1
x Σxy I −Σ−1
x Σxy
= .
0 I 0 I
This factorizes the above pdf as fX,Y (z) = fX (x)fY |X (y|x), where
1 1 T Σ−1 (x−𝜇 )
fX (x) = √ e− 2 (x−𝜇x ) x x
,
(2𝜋)m det Σx
and where
1 1 T Σ−1 (y−𝜇 )
fY |X (y|x) = √ e− 2 (y−𝜇y|x ) y|x y|x
,
(2𝜋)n det Σy|x
where
𝜇y|x = 𝜇y + ΣTxy Σ−1
x (x − 𝜇x ), x Σxy .
Σy|x = Σy − ΣTxy Σ−1
From Example 3.5, we see that fX is the marginal pdf for X. Hence, it holds by (3.7) that fY |X (y|x) is
the conditional pdf for Y , given X = x. Notice that Σy|x is the Schur complement of Σx in Σ.
3.6 Expectations
Let us assume that we are interested in estimating how frequent a certain event A ⊆ Ω is when
the experiment is repeated. Hence, it is enough to define = {Ω, ∅, A, Ac }, where Ac = Ω∖A. Let
us assume that ℙ[A] = p and that ℙ[Ac ] = 1 − p. We then define the random variable X ∶ Ω →
{0, 1} as X(𝜔) = 1 when 𝜔 ∈ A and X(𝜔) = 0 when 𝜔 ∉ A. If we repeat the experiment N times,
3.6 Expectations 49
it is reasonable to estimate the relative frequency of A with the sample average approximation
(SAA)
1∑
N
X(𝜔i ), (3.8)
N i=1
where 𝜔i is the outcome of the ith experiment. We realize that this quantity is very close to
[ ]
0 × ℙ[{𝜔 ∶ X(𝜔) = 0}] + 1 × ℙ[{𝜔 ∶ X(𝜔) = 1}] = 0 × ℙ Ac + 1 × ℙ[A] = p.
Inspired by this, we define the expected value of any discrete random variable X ∶ Ω → ⊆ ℝ as
∑
𝔼[X ] = xp(x),
x∈
where p(x) = ℙ[{𝜔 ∶ X(𝜔) = x}].4 The expected value is a functional from Ω to the real numbers.
We understand that the expected value is close to the sample average of the random variable for
large values of N. For this reason, we sometimes call the expected value of a random variable the
mean of the random variable. Sometimes, we write 𝔼X instead of 𝔼[X ] to ease the notation.
For continuous random variables X ∶ Ω → ℝ, we define the expected value as
𝔼[X ] = xf (x)dx,
∫ℝ
where f is the pdf of the random variable. The generalization to vector-valued random variables is
straightforward. We remark that expected values might be infinite.
3.6.1 Moments
For any scalar-valued random variable X, we define the kth moment of X as
[ ]
mk = 𝔼 X k ,
The moment m1 is the expectation, also called the mean, of X and 𝜇2 is called the variance of X.
The variance measures the amount by which X tends to deviate from its average. It is often also
denoted by 𝜎 2 or Var[X]. We sometimes write VarX to ease the notation. The positive square root
of the variance, 𝜎, is called the standard deviation. It is straightforward to show that 𝜇2 = m2 − m21 .
4 We need that the sum in the definition of the expectation is absolutely convergent, since we do not want the result
to depend on the order in which we carry out the summation.
50 3 Probability Theory
For two scalar-valued random variables X and Y with joint pdf fX,Y , we may consider the function
g ∶ ℝ2 → ℝ defined as g(x, y) = x. It then holds that
[ ]
𝔼 g(X, Y ) = xfX,Y (x, y)dx dy = xfX (x)dx = 𝔼[X ] ,
∫ℝ2 ∫ℝ
with fX being the marginal pdf for X. The result trivially generalizes to several random variables.
This shows that we do not need to define the marginal pdf to carry out the expectation computation.
This also means that we never have to distinguish between different definitions of the expectation
functional. It is sufficient to consider the joint pdf for all relevant random variables involved when
defining the expectation functional, independent of how many random variables there are.
3.6.3 Covariance
For two scalar-valued random variables X and Y , the product XY is a special case of a function g of
the two-dimensional random variable (X, Y ). Hence, the expected value of XY is given by
Example 3.7 For a random vector with a Gaussian pdf as in Example 3.4, it holds that the
expected value is 𝜇 and that the covariance is Σ. This result is shown in Exercise 3.8.
( )
∑ ∑ ∑ ∑
𝔼[Ψ(X)] = ypY |X (y|x) pX (x) = y pX,Y (x, y),
x∈ y∈ y∈ x∈
∑
= ypY (y) = 𝔼[Y ] ,
y∈
where pX,Y ∶ × → [0, 1] is the joint probability function for (X, Y ), and where pX ∶ → [0, 1]
is the marginal probability function for X. The same result holds for continuous random variables.
We often write Ψ(X) = 𝔼[Y | X], and hence, the result can be summarized as
𝔼[𝔼[Y | X]] = 𝔼[Y ] .
This formula is sometimes useful, when 𝔼[Y ] is difficult to compute directly.
Given a probability space (Ω, , ℙ), we consider random variables X1 , X2 , … and X defined on this
probability space and are interested in investigating convergence of Xk to X as k → ∞. To this end,
we will define four different modes of convergence:
a.s. [ ]
(a) Xk → X almost surely (Xk −−−→ X) if ℙ 𝜔 ∈ Ω ∣ Xk (𝜔) → X(𝜔), k → ∞ = 1.5
r [ ]
(b) Xk → X in rth mean (Xk → X) if 𝔼 |Xk − X|r → 0 as k → ∞.6
P [ ]
(c) Xk → X in probability (Xk → X) if ℙ |Xk − K| > 𝜖) → 0 as k → ∞ for all 𝜖 > 0.
D [ ]
(d) Xk → X in distribution (Xk → X) if ℙ Xk ≤ x → ℙ[X ≤ x ] as k → ∞ for all points x at which
FX (x) = ℙ[X ≤ x ] is continuous.
The following implications hold for the different modes of convergence:
1. (a) ⇒ (c)
2. (b) ⇒ (c)
3. (c) ⇒ (d)
r s
4. If r > s ≥ 1, then Xk → X ⇒ Xk → X
Without any further assumptions, no other implications hold.
Let X1 , X2 , … be independent identically distributed (i.i.d.) random variables with mean m. Then
1∑
n
a.s.
Xi −−−→ m, n → ∞,
n i=1
[ ]
if and only if 𝔼 |X1 | < ∞. This result is known as the strong law of large numbers. If we assume that
[ 2]
𝔼 X1 < ∞, then convergence holds both almost surely and in mean square. This assumption is a
sufficient condition for the strong law of large numbers. There is also a weak law of large numbers,
which is related to convergence in probability; we refer the reader to [48] for details.
a.e.
5 This mode of convergence has two other notations which are Xk → X almost everywhere (Xk −−−→ X) and Xk → X
with probability one (w.p. 1.).
m.s.
6 When r = 1 we say that Xk → X in mean and when r = 2 we say that Xk → X in mean square (Xk −−−−→ X).
52 3 Probability Theory
X = (X(𝜔, 0), X(𝜔, 1), …) and use the notation Xk (𝜔) = X(k, 𝜔) to ease the notation. We may inter-
pret the random variable X as an infinite sequence of random variables Xk ∶ Ω → , k ∈ ℤ+ , and
we may interpret k ∈ ℤ+ as a discrete time-index. Such a random variable X is often called a
discrete-time random process. Sometimes one says stochastic process instead of random process.
The set could be a finite set like ℕn , a countable set like ℤ or an uncountable set like ℝ. It may
also be vector-valued.
We may observe the evolution of a random process in two different ways. For each fixed outcome
𝜔 ∈ Ω, we obtain a realization or sample path X(𝜔) of X at 𝜔. We can study the properties of this
sample path. Another way of viewing the random process is to investigate a finite subset of compo-
nents of the infinite-dimensional vector X, say K = {k1 , k2 , … , kn } ⊂ ℤ+ . We then look at the joint
distribution function FK ∶ n → [0, 1] defined as
[ ]
FK (x) = ℙ Xk1 ≤ x1 , … , Xkn ≤ xn .
The collection {FK } where K ranges over all finite-dimensional K ⊂ ℤ+ is called the collection of
finite-dimensional distributions (fdds) of X or the name of X. This contains all the information that
is available about X from finitely many components Xk . We mention that knowing the fdds does
not in general provide a complete information about the sample paths. However, we will be content
by studying only properties of the sample path that can be deduced from the fdds.
If we define X ∶ Ω → ℝ+ , we obtain a continuous-time random process. We then often write X(t)
when it is convenient to make the dependence on t ∈ ℝ+ explicit. The fdds are often denoted FT ,
where T is now a finite subset of ℝ+ . An even more general concept is a random field, which is
obtained when X ∶ Ω → ℝ .
n
We say that a discrete-time random process is strongly stationary if {Xk1 , … , Xkn } and
{Xk1 +l , … , Xkn +l } have the same joint distribution for all k1 , … , kn and l > 0. We say that a
[ ] [ ]
discrete-time random process is weakly stationary if 𝔼 Xk1 = 𝔼 Xk2 and Cov[Xk1 , Xk2 ] =
Cov[Xk1 +l , Xk2 +l ] for all k1 , k2 and l > 0. In other words, a random process is weakly stationary if
and only if it has a constant mean and the autocovariance function c ∶ ℤ2+ → ℝ given by
c(k, k + l) = Cov[Xk , Xk+l ],
satisfies
c(k, k + l) = c(0, l),
for all k and l ≥ 0. Thus, for weakly stationary processes, we may define the autocovariance function
as a function of only l and write c ∶ ℤ+ → ℝ.
Strong stationarity implies weak stationarity, but the converse is not true in general. One example
where strong stationarity is equivalent to weak stationarity is when the fdds are all Gaussian.
The definitions of weak and strong stationarity for a continuous-time random process are similar
as for a discrete-time random process, and this also goes for a random field.
We will now discuss a generalization of the law of large numbers where the sequence of random
variables Xk are a stationary process, not necessarily i.i.d. If Xk , k ≥ 1, is a strongly stationary process
with 𝔼|X1 | < ∞, then there exists a random variable Y with the same mean as X1 such that
1∑
n
X → Y a.s. and in mean.
n k=1 k
If instead Xk , k ≥ 1, is a weakly stationary process with 𝔼|X1 | < ∞, then there exists a random
variable Y with the same mean as X1 such that
1∑
n
X → Y in mean square.
n k=1 k
These results are called the strong ergodic theorem and the weak ergodic theorem, respectively.
3.11 Hidden Markov Models 53
a filtration problem or state estimation problem. We will do the derivation for = ℝn and = ℝp .
The derivation for finite sets is similar and obtained by considering probability functions instead
of pdfs and by replacing integrals with summations.
Let X̄ k = (X0 , … , Xk ) and let pX̄ k ,Ȳ k ∶ k+1 × k+1 → ℝ+ be the joint pdf for (X̄ k , Ȳ k ). We also need
the conditional pdf for Ȳ k given X̄ k : pȲ k |X̄ k ∶ k+1 × k+1 → ℝ+ . Using the conditional indepen-
dence assumption, this can be expressed as
∏
k
pȲ k |X̄ k (̄y|̄x) = pYi |Xi (yi |xi ),
i=0
where pYi |Xi ∶ × → ℝ+ are the conditional pdfs for Yk given Xk . We also define the marginal
pdf for X̄ as pX̄ ∶ N+1 → ℝ+ .
We start by obtaining an expression for pX0 |Y0 (x0 |y0 ), i.e.
pY0 |X0 (y0 |x0 )pX0 (x0 )
pX0 |Y0 (x0 |y0 ) = ,
pY0 (y0 )
where
pY0 (y0 ) = p (x , y )dx = p (y |x )p (x )dx .
∫ X0 ,Y0 0 0 0 ∫ Y0 |X0 0 0 X0 0 0
We do in this section not write out the set over which we integrate when it is the whole domain
of the functions involved. Now, assume that we know pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ), where ȳ k = (y0 , … , yk ).
This assumption is true for k = 1. It then follows that
pXk , Xk−1 |Ȳ k−1 (xk , xk−1 |̄yk−1 ) = pXk |Xk−1 ,Ȳ k−1 (xk |xk−1 , ȳ k−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ),
= pXk |Xk−1 (xk |xk−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ),
where we have made use of the conditional independence property
pXk |Xk−1 ,Ȳ k−1 (xk |xk−1 , ȳ k−1 ) = pXk |Xk−1 (xk |xk−1 ).
This is proven in Exercise 3.16. Integrating over xk−1 results in the following Chapman–Kolmogorov
equation:
pYk |Ȳ k−1 (yk |̄yk−1 ) = pYk , Xk |Ȳ k−1 (yk , xk |̄yk−1 )dxk ,
∫
pȲ k , Xk (̄yk , xk )
= dxk ,
∫ pȲ k−1 (̄yk−1 )
3.11 Hidden Markov Models 55
= pYk |Xk ,Ȳ k−1 (yk |xk , ȳ k−1 )pXk |Ȳ k−1 (xk |̄yk−1 )dxk ,
∫
= pYk |Xk (yk |xk )pXk |Ȳ k−1 (xk |̄yk−1 )dxk .
∫
The last inequality follows from the conditional independence property. Thus, we may summarize
the optimal filtering equations as
pXk |Ȳ k−1 (xk |̄yk−1 ) = pXk |Xk−1 (xk |𝜉)pXk−1 |Ȳ k−1 (𝜉|̄yk−1 )d𝜉, (3.11)
∫
where H ̄ −1
̄ k = R1 + AΣk AT and K̄ k = Σk AT H k . From the last equality, we obtain
pXk |Xk−1 (xk |xk−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 )dxk−1 = (xk , Axk−1
a ̄ k−1 ),
,H
∫
and we obtain from (3.13) and (3.11), the update formula
f a
xk = Axk−1 ,
(3.15)
f ̄ k−1 = R1 + AΣk−1 AT .
Σk = H
f
We can if we like eliminate xk and Σk from (3.14) and (3.15) to obtain
xk+1 = (A − K̃ k C)xk + K̃ k yk ,
f f
f f f f f
Σk+1 = AΣk AT + R1 − AΣk CT (R2 + CΣk CT )−1 CΣk AT ,
where K̃ k = AΣk CT (R2 + CΣk CT )−1 . The initial values are x0 = x̄ 0 and Σ0 = R0 . The Kalman filter
f f f f
is summarized in Algorithm 3.1. We see that the algorithm consists of two main steps. In the first
one the measurement yk , the matrix C, and the covariance R2 are used to update the old predicted
f
estimate xk and its covariance. In the second step, the matrix A and the covariance matrix R1 are
used to predict the state and its covariance for the next value of k.
A prime example of a Gaussian process is the Wiener process W ∶ Ω → ℝℝ+ with W(0) = 0,
m(T) = 0, and Σ(T) having entries Σi,j (T) = 𝜎 2 min (ti , tj ), for some 𝜎 2 > 0.
A Gaussian process is weakly and strongly stationary if and only if 𝔼[X(t)] is constant for all t
and Σ(T + h) = Σ(T), where T + h = {t1 + h, … , tn + h}, for all T and h > 0. A Gaussian process is
a Markov process if and only if
[ ] [ ]
𝔼 X(tn ) | X(t1 ), … , X(tn−1 ) = 𝔼 X(tn ) | X(tn−1 ) ,
for all t1 < · · · < tn . An example of a stationary Gaussian Markov process is the Ornstein–Uhlenbeck
process which has zero mean and autocovariance function c(t) = c(0)e−𝛼|t| for some 𝛼 > 0 and
c(0) > 0. It is also possible to define Gaussian processes X ∶ Ω → ℝℝ , and we will return to them
n
Exercises
3.1 We are given a probability space (Ω, , ℙ).
(a) Show that ℙ[Ac ] = 1 − ℙ[A] for any A ∈ , where Ac = Ω∖A.
[ ]
(b) Show that ℙ ∅ = 0.
(c) Show that ℙ[A ∪ B ] = ℙ[A] + ℙ[B ] − ℙ[A ∩ B ] for any A, B ∈ .
3.2 In a collection of 100 items delivered by a company, there are five defect items. We pick
randomly first one item and then out of the remaining 99, we pick another item randomly.
What is the probability that both items are defect?
3.4 Consider Example 3.3. What is the probability that, if a customer has found a defect item,
that it has been manufactured at machine 1?
3.6 A person is crossing the street 300 000 times in his/her lifetime. The probability of being hit
by a car is 1∕300 000. We consider the different crossings to be independent events. What is
the probability of being hit by a car at least once in a lifetime?
3.7 Consider throwing a fair dice repeatedly many times. We define a random variable
Xi ∶ ℕ6 → ℕ6 with value equal to the value of the dice for the ith experiment. Since the dice
is fair, we have that the probability function pXi ∶ ℕ6 → [0, 1] is defined as pXi (k) = 1∕6.
Define the random variable X ∶ ℕN6 → , where = {1, 1 + 1∕N, 1 + 2∕N, … , 6} via
1∑
N
X= X.
N i=1 i
(a) Compute the expected value and the variance of the random variable Xi .
(b) Compute the probability function for (X1 , X2 ).
(c) Compute the probability function for X when N = 2.
(d) Compute the expected value of the random variable X for N = 2 directly using the prob-
ability function above and indirectly by using the probability function for (X1 , X2 ) and
the formula for expected values of functions of random variables in Section 3.6.
(e) Compute the variance of X when N = 2.
58 3 Probability Theory
3.8 Show that the for a random vector X ∶ ℝn → ℝn with Gaussian distribution given by the
pdf
1 1 T Σ−1 (x−𝜇)
f (x) = √ e− 2 (x−𝜇) ,
(2𝜋)n det Σ
it holds that the expected value is 𝜇 and that the variance is Σ.
3.9 Consider a random variable X with Gaussian distribution with zero mean and variance
I ∈ 𝕊n++ . Let the random variable Y be defined as
Y = AX + b.
Show that this ransom variable is also Gaussian with mean b and variance AAT .
3.10 Consider a scalar-valued random variable X with a Gaussian distribution with zero mean
and variance 𝜎 2 . Show that
[ (p )] ( 2)
1 𝛾m
𝔼 exp (X + m)2 = √ exp ,
2 𝜎 𝛽 2
3.11 Consider two random variables X ∶ Ω → ℝ and Y ∶ Ω → ℝ. Let Z = (X, Y ). We know the
expected value and the variance of Z, i.e.
[ ] [ 2 ]
𝜇x 𝜎x 𝜎xy
𝔼[Z ] = , Var[Z] =
𝜇y 𝜎xy 𝜎y2
are known. Consider the random variable defined as M = X + c(Y − 𝜇y ) for an arbitrary
c ∈ ℝ.
(a) Show that 𝔼[M ] = 𝜇x .
(b) Show that Var[M] = Var[X] + c2 Var[Y ] + 2cCov[X, Y ] = 𝜎x2 + c2 𝜎y2 + 2c𝜎xy .
(c) Show that Var[M] is minimized for c⋆ = −𝜎xy ∕𝜎y2 and that the minimal value is given
by (1 − 𝜌2 )𝜎x2 , where
√
𝜎xy
𝜌= .
𝜎x 𝜎y
3.12 Consider two scalar-valued random variables (X, Y ) with joint pdf
{
2, if 0 < y ≤ x < 1,
fX,Y (x, y) =
0, otherwise.
Compute the conditional expectation for X given Y = y.
3.13 Let X be a random variable taking the values 0 and 1 with equal probability 1∕2. Also, define
D
the random variables Xk = X for all k ∈ ℤ+ . Clearly, Xk → X, since all random variables
defined have the same distribution. Also define the random variable Y = 1 − X.
D
(a) Show that Xk → Y .
(b) Show that Xk cannot converge to Y in any other mode.
Exercises 59
3.14 Consider the Markov process defined in Example 3.8. Assume that Ek are zero mean
with Var(Xk ) = Σ. Furthermore, assume that X0 has mean m0 and variance P0 . Show that
mk = 𝔼Xx and Pk = VarXk for k ≥ 1 satisfies the recursions
mk+1 = Amk , Pk+1 = AT Pk A + Σ.
(y, Xx, R) (x, 𝜇, P) = (y, X𝜇, H) (x, 𝜇 + Y (y − X𝜇), P − XHX T ).
where is defined as in Example 3.4.
Hint: The formula in Exercise 2.10 is useful.
3.16 Consider a HMM (X, Y ) as defined in Section 3.11. Show that the conditional independence
assumption
∏
k
pȲ k |X̄ k (̄y|̄x) = pYi |Xi (yi |xi ),
i=0
implies that
(a)
pXk |Xk−1 ,Ȳ k−1 (xk |xk−1 , ȳ k−1 ) = pXk |Xk−1 (xk |xk−1 ),
(b)
pYk |Xk ,Ȳ k−1 (yk |xk , ȳ k−1 ) = pYk |Xk (yk |xk ).
Part II
Optimization
63
Optimization Theory
Mathematical optimization is an indispensable tool in learning and control. Its history goes back
to the early seventeenth century with the work by Pierre de Fermat who obtained calculus-based
formulae for identifying optima. This work was later developed further by Joseph-Louis Lagrange.
In this chapter, we will present the foundations of optimization theory. We will start by defining
what constitutes an optimization problem and introduce some basic concepts and terminology.
We will also introduce the notion of convexity, which allows us to distinguish between convex and
general nonlinear optimization problems. The motivation behind this distinction is that convex
problems are, roughly speaking, easier to solve than general nonlinear ones. We will pay special
attention to properties that are useful for recognizing convexity. Finally, we will also discuss
the concept of duality and see how it is used to derive optimality conditions for optimization
problems. In Chapter 6, we will see how duality also plays an important role in some optimization
methods.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
64 4 Optimization Theory
𝑥
𝑥1 𝑥2 𝑥3 𝑥4
and the feasible set or feasible region is the subset of points in that satisfy all constraints, i.e.
= {x ∈ | f (x) ⪯ 0, h(x) = 0}. (4.6)
A point x is called feasible if it belongs to the feasible set , and x is strictly feasible if it is both
feasible and all inequality constraints are inactive at x, i.e. if fi (x) < 0, i ∈ ℕm . The optimization
problem is said to be feasible if ≠ ∅, and otherwise, the problem is infeasible.
The optimal value or the minimum value of the optimization problem (4.4) is defined as
p⋆ = inf f0 (x). (4.7)
x∈
saddle point
local minimum
local/global minimum
otherwise, the optimal value is unattained. A point x⋆ is called an optimal point, a minimizer, or
a solution if it is feasible and p⋆ = f0 (x⋆ ), and the set of all optimal points is called the optimal set.
A feasible point x that satisfies f0 (x) ≤ p⋆ + 𝜖 for some 𝜖 > 0 is called 𝜖-suboptimal.
A feasible point x is said to be locally optimal if there exists a constant r > 0 such that
{ }
f0 (x) = inf f0 (z) | x ∈ ∩ B2 (x, r) , (4.8)
z
where
is the Euclidean ball with center c and radius r. Thus, an optimal point is also locally optimal, but
the converse is not true in general. To emphasize the difference between optimal points and locally
optimal points, we sometimes say that an optimal point is globally optimal. Similarly, we refer to
f0 (x) as a local minimum value if x is locally optimal. The notion of local and global optimality is
illustrated in Figure 4.2 for an unconstrained problem with a continuously differentiable objec-
tive function f0 ∶ ℝ → ℝ. The local extrema of such a function are stationary points, but not all
stationary points are local extrema.
The special case of the optimization problem (4.4), where f0 (x) = 0 for all x ∈ is called a
feasibility problem since solving such a problem amounts to finding any feasible point. The optimal
value is p⋆ = 0 if the problem is feasible, and otherwise, p⋆ = ∞.
A point 𝜃x + (1 + 𝜃)y for some 𝜃 ∈ [0, 1] is called a convex combination of x ∈ ℝn and y ∈ ℝn , and
the set of all convex combinations of x and y is the line segment between x and y. In other words,
the definition of convexity requires that the line segment between every x, y ∈ C is contained in C,
as illustrated in Figure 4.3. We note that a linear combination of the form
𝜃1 x1 + · · · + 𝜃k xk , k ≥ 2, (4.11)
● an affine combination if 𝟙T 𝜃 = 1,
● a conic combination if 𝜃 ∈ ℝk+ ,
● a convex combination if 𝜃 ∈ ℝk+ and 𝟙T 𝜃 = 1.
which is the standard simplex in ℝk . Figure 4.4 illustrates the set of conic and convex combinations
of two points in ℝ2 . A direct consequence of (4.10) is that all convex combinations of any k points
𝑦
𝑦
𝜃𝑥 + (1 − 𝜃)𝑦
𝑥 𝑥 𝜃𝑥 + (1 − 𝜃)𝑦
Figure 4.3 A set is convex if the line segment between two points x and y is contained in the set for every
x and y in the set.
𝑥1
𝜃 = (1 , 0 )
𝑥1
𝜃1 𝑥1 + 𝜃 2 𝑥2 𝜃 = (0.6 , 0 .4 )
𝜃1 𝑥1 + 𝜃 2 𝑥2
𝑥2
𝜃 = (0 , 1 )
𝑥2
0
(a) (b)
Figure 4.4 Convex and conic combinations of two points x1 and x2 . (a) Conic combinations: 𝜃 ∈ ℝ2+ .
(b) Convex combinations: 𝜃 ∈ Δ2 .
4.2 Convex Sets 67
where we define
∑ 𝜃̄ 𝜃̄ i
k
z= 𝜃̃ i xi , 𝜃̃ i = ∑k i = , i ∈ ℕk .
i=1 ̄
i=1 𝜃 i
1 − 𝜃̄ k+1
This shows that z is a convex combination of x1 , … , xk , and hence, z belongs to C by assumption.
It follows that y ∈ C since it is a convex combination of z and xk+1 .
The dimension of a convex set C ⊆ ℝn is the dimension of its affine hull, i.e.
dim C = dim (aff C). (4.13)
The relative interior of C is the interior of C within the affine hull of C,
relint C = {x ∈ C | ∃ 𝜖 > 0 such that B2 (x, 𝜖) ∩ aff C ⊆ C}, (4.14)
where B2 (x, 𝜖) is the Euclidean ball centered at x and with radius 𝜖.
The convex hull of a set C ⊆ ℝn is the smallest convex set that contains C. Equivalently, it is the
intersection of all convex sets that contain C, i.e.
conv C = ∩{D ⊆ ℝn | D is convex and C ⊆ D}. (4.15)
We note that it can be difficult to identify the convex hull of a set by using this definition.
Carathéodory’s theorem provides a characterization that is often more useful in practice.
In states that every point in conv C can be expressed as a convex combination of at most n + 1
points in C, i.e.
{ n+1 }
∑
conv C = 𝜃i xi | x1 , … , xn+1 ∈ C, 𝜃 ∈ Δn+1
. (4.16)
i=1
4.2.1.1 Intersection
A fundamental property of convex sets is that the intersection of any number of convex sets is itself
a convex set, i.e. if C𝜏 is a convex set for every 𝜏 ∈ T, then
⋂
C= C𝜏 ,
𝜏∈T
is convex. This follows directly from the definition of a convex set by noting that any two points x
and y in C also belong to C𝜏 for all 𝜏 ∈ T, and moreover,
{𝜃x + (1 − 𝜃)y | 𝜃 ∈ [0, 1]} ∈ C𝜏 ,
since C𝜏 is convex.
68 4 Optimization Theory
𝑎 𝑎𝑇 𝑥 > 𝑏
𝑎𝑇 𝑥 ≤ 𝑏
𝑎𝑇 𝑥 = 𝑏
𝑎𝑇 𝑥 = 𝑏
(a) (b)
Figure 4.5 A hyperplane and a halfspace in ℝ2 : (a) hyperplane and (b) halfspace.
Figure 4.7 Examples of norm balls in ℝ2 . The 1-norm ball and the ∞-norm ball are polyhedral sets,
whereas the 2-norm ball and the quadratic norm ball are ellipsoids.
B2 (0, 1) = {x ∈ ℝn | ||x||2 ≤ 1}, but it is only a norm ball if it is an injective linear transformation
of B2 (0, 1). To see this, suppose f ∶ ℝn → ℝn is an injective affine transformation, i.e. f (x) = Cx + d
for some nonsingular C ∈ ℝn×n and d ∈ ℝn . We then have that
f (B2 (0, 1)) = {Cx + d | ||x||2 ≤ 1} = {y | ||C−1 (y − d)||2 ≤ 1}
= {y | ||y − d||A ≤ 1},
which shows that f (B2 (0, 1)) is a norm ball induced by the quadratic norm || ⋅ ||A with A = C−T C−1 .
Figure 4.7 shows some examples of norm balls in ℝ2 .
𝑥1
𝑥2
4.2 Convex Sets 71
𝑥1
𝑥2
names in the literature, e.g. the Lorentz cone, the quadratic cone, and the more casual name, the
ice-cream cone. Figure 4.9 shows the second-order cone in ℝ3 .
The cone of symmetric, positive semidefinite matrices of order n is the set
S+n = {X ∈ 𝕊n | uT Xu ≥ 0, ∀ u ∈ ℝn }. (4.21)
It is easy to verify that it is indeed a cone, and convexity follows by noting that for a given u ∈ ℝn ,
the set
{X ∈ 𝕊n | uT Xu ≥ 0}
is a closed halfspace, and hence, S+n can be expressed as the intersection of infinitely many
halfspaces.
The dual cone of a convex cone K ⊆ ℝn is the set
K ∗ = {y ∈ ℝn | xT y ≥ 0, ∀ x ∈ K}. (4.22)
In other words, K ∗ is the set of vectors that form a nonnegative inner product with all vectors in K.
This is illustrated in Figure 4.10. We note that it can be shown that the dual of the dual cone K ∗ is
the closure of K, i.e. K ∗∗ = cl K. Thus, we have that K ∗∗ = K if K is a proper convex cone. A convex
cone K is called self-dual if K = K ∗ . The nonnegative orthant, the second-order cone, and the cone
of positive semidefinite matrices are examples of self-dual cones. Finally, we note that −(K ∗ ) is
called the polar cone of K.
x ≻K y ⟺ x − y ∈ int K. (4.23b)
𝐾
𝐾∗
72 4 Optimization Theory
𝑓(𝑦) 𝑓( 𝜃 𝑥 + (1 − 𝜃 )𝑦 )
𝜃
0 𝜃 1
We note that the notion of convexity can be generalized to vector-valued function by replacing
the scalar inequality in (4.24) by a generalized inequality. Specifically, we say that f ∶ ℝn → ℝm is
convex with respect to a proper convex cone K ⊂ ℝm , or simply K-convex, if dom f is a convex set
and for all x, y ∈ dom f ,
f (𝜃x + (1 − 𝜃)y) ⪯K 𝜃f (x) + (1 − 𝜃)f (y), ∀ 𝜃 ∈ [0, 1]. (4.26)
The inequality (4.24) can also be expressed in terms of the epigraph of f . Indeed, using the defi-
nition of the epigraph, we may express (4.24) as
[ ] [ ]
x y
𝜃 + (1 − 𝜃) ∈ epi f , ∀ 𝜃 ∈ [0, 1].
f (x) f (y)
A direct consequence of this is that the function f is convex if and only if its epigraph is a convex
set, as illustrated in Figure 4.12. This implies that all sublevel sets of a convex function are convex,
but the converse is not true in general.
epi 𝑓
𝑓
74 4 Optimization Theory
𝑓(𝑥)
f (y) ≥ f (x) for all y ∈ dom f whenever ∇f (x) = 0. Moreover, if f is strictly convex, then f has at
most one stationary point since ∇f (x) = 0 implies that f (y) > f (x) for all y ≠ x.
To prove (4.27), we start by rewriting (4.24) as
𝜃f (y) ≥ 𝜃f (x) + f (x + 𝜃(y − x)) − f (x).
The inequality (4.27) then follows by dividing both sides by 𝜃 ≠ 0, i.e.
f (x + 𝜃(y − x)) − f (x)
f (y) ≥ f (x) + ,
𝜃
and taking the limit as 𝜃 goes to 0, which yields the directional derivative
f (x + 𝜃(y − x)) − f (x)
lim = ∇f (x)T (y − x).
𝜃→0 𝜃
The function f is 𝜇-strongly convex if and only if
𝜇
f (y) ≥ f (x) + ∇f (x)T (y − x) + ||y − x||22 , ∀ x, y ∈ dom f , (4.28)
2
for some 𝜇 > 0. To see this, recall that f is 𝜇-strongly convex if and only if g(x) = f (x) − (𝜇∕2)||x||22 is
convex, and note that (4.28) is equivalent to g(y) ≥ g(x) + ∇g(x)T (y − x) for all x, y ∈ dom f , which is
the convexity condition (4.27). The condition (4.28) implies that the sublevel sets of f are bounded,
which follows by noting that
St = {y | f (y) ≤ t} ⊆ {y | f (x) + ∇f (x)T (y − x) + (𝜇∕2)||y − x||22 ≤ t}, ∀ x ∈ dom f ,
where the set on the right-hand side of the inclusion operator is either a Euclidean ball or the empty
set. Strong convexity also allows us to bound the distance from any x to the optimal point x⋆ , if it
exists, in terms of ∇f (x),
2
||x − x⋆ ||2 ≤ ||∇f (x)||2 . (4.29)
𝜇
This inequality can be derived from (4.28) by substituting x⋆ for y, i.e.
𝜇
f (x⋆ ) ≥ f (x) + ∇f (x)T (x⋆ − x) + ||x⋆ − x||22
2
𝜇
≥ f (x) − ||∇f (x)||2 ||x⋆ − x||2 + ||x⋆ − x||22 ,
2
and since f (x⋆ ) ≤ f (x) for all x, we see that
𝜇
0 ≥ −||∇f (x)||2 ||x⋆ − x||2 + ||x⋆ − x||22 .
2
Moreover, substituting x⋆ for y on the left-hand side of (4.28) and minimizing the right-hand side
with respect to y, we find that for all x ∈ dom f ,
1
f (x) − p⋆ ≤ ||∇f (x)||22 . (4.30)
2𝜇
4.3 Convex Functions 75
which shows that epi f is a convex set, and hence, f is convex. This is illustrated in Figure 4.14,
which shows the pointwise maximum of four affine functions and its epigraph. The result can
be extended to the pointwise supremum of uncountably many convex functions. Specifically, if
f ∶ ℝn × ℝp → (−∞, +∞] and f (x, y) is convex in x for every y ∈ Y ⊆ ℝp , then h ∶ ℝn → (−∞, +∞]
defined as
h(x) = sup f (x, y) (4.36)
y∈Y
is a convex function.
epi 𝑓
𝑓2
𝑓4
𝑓1
𝑓3
𝑥
4.3 Convex Functions 77
This is a convex function if f is a convex function. This result follows by noting that the strict
epigraph of h, defined as
{(x, t) | h(x) < t},
is the image of the strict epigraph of f
{(x, y, t) | f (x, y) < t}
under the linear transformation (x, y, t) → (x, t).
where A − BC−1 BT
is the Schur complement of C in X. Convexity of h implies that this Schur
complement is positive semidefinite.
4.3.3.1 Norms
All norms are convex function, which is an immediate consequence of the triangle inequality and
(4.24). Moreover, the epigraph of a norm is a norm cone, which is a convex set.
We note that this is always a convex function since it is the supremum of affine functions.
4.3.4 Conjugation
̄ is the function
The Legendre–Fenchel transformation or conjugate of a function f ∶ ℝn → ℝ
∗ n ̄
f ∶ ℝ → ℝ defined as
f ∗ (y) = sup {yT x − f (x)}. (4.42)
x
The conjugate function f ∗ is always a convex function, which follows by noting that it is the point-
wise supremum of affine functions. Furthermore, epi f ∗ is the intersection of closed halfspaces,
and hence, it is a closed set, i.e. f ∗ is closed. Evaluating f ∗ at 0 yields
f ∗ (0) = sup {−f (x)} = −inf f (x),
x x
and, generally speaking, this means that evaluating the conjugate function is as hard as finding a
global minimum of f . For a fixed y, the conjugate function f ∗ (y) may be interpreted graphically as
the largest signed vertical distance from f to the linear function yT x. This is illustrated in Figure 4.15
for a univariate function.
The definition of the conjugate function implies that if f ∶ ℝn → (−∞, +∞] is a proper function,
then for every x, y ∈ ℝn ,
f (x) + f ∗ (y) ≥ xT y.
4.3 Convex Functions 79
𝑦𝑥 − 𝑓 ∗ (𝑦 )
𝑦𝑥
𝑥
𝑥̄
This is known as the Fenchel–Young inequality. Equality holds if the supremum of yT x − f (x) is
attained at x.
The conjugate of f ∗ is called the biconjugate of f and is denoted f ∗∗ = (f ∗ )∗ . An immediate con-
sequence of the Fenchel–Young inequality is that f ∗∗ ≤ f , i.e.
Moreover, the Fenchel–Moreau theorem states that if f is proper, then f ∗∗ = f if and only if f is
convex and closed. More generally, the biconjugate f ∗∗ is the lower convex envelope of f , which is
the supremum of all closed, convex functions that lie below f . In other words, epi f ∗∗ is the closed
̄ and f ≤ g,
convex hull of epi f . To see this, we first note that if f and g are functions from ℝn to ℝ
then
which implies that f ∗ ≥ g∗ . Using the same argument, we also see that f ∗∗ ≤ g∗∗ . Thus, if f is closed
and convex, then f = f ∗∗ ≤ g∗∗ ≤ g. Taking the supremum of all closed, convex functions f ≤ g,
we conclude that g∗∗ is the lower convex envelope of g.
i.e. it is the support function of the set C. In the special case, where C is a nonempty convex cone
K ⊂ ℝn , the conjugate function is
where B = {x ∈ ℝn | ||x|| ≤ 1} denotes the unit norm ball for the norm || ⋅ ||. Thus, the dual norm || ⋅
||∗ is the conjugate function of the indicator function of the norm ball B. We invite the reader to ver-
ify that the dual norm is indeed a norm, see Exercise 4.11. Since IB is convex and closed, we conclude
80 4 Optimization Theory
that the conjugate of the dual norm || ⋅ ||∗ is IB∗∗ = IB . The definition of the dual norm can also be
expressed as
xT y
||y||∗ = sup .
x≠0 ||x||
This readily implies that for all x, y ∈ ℝn ,
which may be viewed as a generalization of the Cauchy–Schwartz inequality. We note that the dual
norm of || ⋅ ||∗ is the norm || ⋅ ||∗∗ = || ⋅ ||.
which follows from the fact that x = y∕||y||2 achieves the supremum if y ≠ 0.
Similarly, the dual norm of || ⋅ ||1 is || ⋅ ||∞ since || ⋅ ||∗∗ = || ⋅ ||. More generally, the dual norm of
|| ⋅ ||p with p ≥ 1 is || ⋅ ||q , where 1∕p + 1∕q = 1.
Example 4.5 The dual norm of the matrix 2-norm on ℝm×n may be expressed as
∑
k
tr(Y T X) ≤ 𝜎i = tr(S).
i=1
This upper bound is attained at X = U1 V1T if Y ≠ 0, and hence, the dual norm of || ⋅ ||2 is the nuclear
norm.
4.4 Subdifferentiability
Such a vector g is called a subgradient of f at x, and the set of all subgradients at x, i.e.
is the subdifferential of f at x. Note that the subdifferential is a set-valued map, and it may be an
empty set. We use the convention that 𝜕f (x) = ∅ if x ∉ dom f , and hence, its effective domain is
the set
which shows that 𝜕f (x) is the intersection of closed halfspaces, and hence, 𝜕f (x) is closed and con-
vex. Figure 4.16 illustrates the definition of the subdifferential. We note that global information is
necessary to determine if a nonconvex function is subdifferentiable at a point x. In other words, the
plot in Figure 4.16b alone is insufficient to determine where f is subdifferentiable. For example,
if the function approaches some constant as |x| tends to infinity, then f is only subdifferentiable at
its minima.
The set of global minimizers of a proper function f ∶ → (−∞, +∞] may be characterized in
terms of the subdifferential of f using Fermat’s rule, which states that
This property is easily verified using the definition of the subdifferential (4.44), i.e.
However, this property is primarily useful when f is a convex function, because it is generally
difficult to characterize the subdifferential of a nonconvex function. We note that if f is convex
𝑥
𝑎 𝑏 𝑐 𝑑
(b)
82 4 Optimization Theory
and continuously differentiable at x ∈ dom f , then 𝜕f (x) = {∇f (x)}. Conversely, if f is convex and
𝜕f (x) is the singleton {g}, then f is differentiable at x and ∇f (x) = g. This is not true for nonconvex
functions, as is evident from the example in Figure 4.16b.
4.4.1.2 Summation
Given two closed proper convex functions f1 ∶ ℝn → (−∞, +∞] and f2 ∶ ℝn → (−∞, +∞], it holds
that
𝜕f (x) ⊇ 𝜕f1 (x) + 𝜕f2 (x) = {u + 𝑣 | u ∈ 𝜕f1 (x), 𝑣 ∈ 𝜕f2 (x)}. (4.47)
The right-hand side of the inclusion is the Minkowski sum of 𝜕f1 (x) and 𝜕f2 (x). It is easy to verify
(4.48) by noting that if u ∈ 𝜕f1 (x) and 𝑣 ∈ 𝜕f2 (x), then
f1 (y) + f2 (y) ≥ f1 (x) + f2 (x) + (u + 𝑣)T (y − x), ∀ y ∈ dom f ,
which implies that u + 𝑣 ∈ 𝜕f (x). Moreover, if f1 and f2 satisfy
relint dom f1 ∩ relint dom f2 ≠ ∅,
then it can be shown that
𝜕f (x) = 𝜕f1 (x) + 𝜕f2 (x), ∀ x ∈ ℝn . (4.48)
The result is readily extended to sums of more than two functions by means of induction. We
note that the Minkowski sum of a set and the empty set is itself empty, which implies that dom
𝜕f = dom 𝜕f1 ∩ dom 𝜕f2 .
where (x) = {i ∈ ℕk | fi (x) = f (x)} is the set of active indices. We note that this can also be gener-
alized to the supremum of uncountably many proper convex functions.
Now, suppose that f is also closed. This means that f ∗∗ = f , and hence,
f ∗∗ (x) + f ∗ (y) = xT y, x ∈ 𝜕f ∗ (y).
Thus, we may conclude that
x ∈ 𝜕f ∗ (y) ⟺ y ∈ 𝜕f (x) (4.51)
if f is proper convex and closed.
84 4 Optimization Theory
Example 4.6 The subdifferential of the indicator function of a nonempty closed convex set
C ⊆ ℝn is a closed proper convex function. This means that
y ∈ 𝜕IC (x) ⟺ x ∈ 𝜕SC (y),
where SC = IC∗ is the support function of C. We note that the subdifferential 𝜕IC (x) is also referred
to as the normal cone of C at x, and it may also be expressed as
NC (x) = 𝜕IC (x) = {g ∈ ℝn | 0 ≥ gT (y − x), ∀ y ∈ C}. (4.52)
It is easy to verify that NC (x) = {0} if x ∈ int C, and furthermore, if g ∈ NC (x) and g ≠ 0, then C is
contained in the halfspace {y | gT y ≤ gT x}.
−∇𝑓0 (𝑥 ⋆ )
∇𝑓0 (𝑥 ⋆ ) 𝑇 (𝑥 − 𝑥 ⋆ ) = 0
If x is an the optimal point in the interior of the feasible set, there exists a ball B2 (x, 𝜖) ⊆ with
radius 𝜖 > 0. The optimality condition (4.54) requires that
∇f0 (x)T (y − x) ≥ 0, ∀ y ∈ B2 (x, r),
which can also be expressed as ∇f0 (x)T u ≥ 0 for all u such that ||u||2 ≤ 𝜖. Clearly, this is only possible
if ∇f0 (x) = 0, which means that x is a stationary point of f0 .
Next, we consider the more general case, where f0 is convex but not necessarily continuously
differentiable. First, we note that (4.53) is equivalent to the problem
minimize f0 (x) + I (x), (4.55)
where I is the indicator function for the feasible set . Fermat’s rule then implies that x is optimal
if and only if
0 ∈ 𝜕f0 (x) + N (x),
where N = 𝜕I is the normal cone of . In other words, x is optimal if and only if there exists a
subgradient g ∈ 𝜕f0 (x) such that
−g ∈ N (x) ⟺ gT (y − x) ≥ 0, ∀ y ∈ . (4.56)
Note that if f0 is continuously differentiable at x, then 𝜕f0 (x) = {∇f0 (x)} and the optimality condition
(4.56) reduces to (4.54).
This optimality condition can also be expressed in terms to the so-called Lagrangian L ∶ ℝn × ℝp →
ℝ defined as
∇x L(x, 𝜇) = 0 (4.58a)
h(x) = 0. (4.58b)
As we will see in Section 4.7, it turns out that these conditions are necessary conditions for opti-
mality even when h is not an affine function.
4.6 Duality
We now return to the general optimization problem (4.4) without making any assumptions about
convexity, and we are interested in computing nontrivial lower bounds on the optimal value p⋆ .
We will do this by constructing a so-called dual problem, and such a problem can be constructed
in several ways. We start out with Lagrangian duality, which emerged in the 1940s but is based on
techniques pioneered by Lagrange in the 1780s. We then look at Fenchel duality, which is closely
related to Lagrangian duality but offers a different perspective.
with dom L = × ℝm × ℝp . The auxiliary variables λ and 𝜇 are called Lagrange multipliers or dual
variables. The variable λi is associated with the inequality constraint fi (x) ⪯ 0, and 𝜇i is associated
with the equality constraint hi (x) = 0.
The Lagrangian function can be used to construct lower bounds on p⋆ by noting that if x is feasible
and λ ∈ ℝm + , then
L(x, λ, 𝜇) ≤ f0 (x).
Now, taking the infimum over x ∈ on the left-hand side of the inequality and the infimum over
x ∈ ⊆ on the right-hand side, we see that
inf L(x, λ, 𝜇) ≤ p⋆ , ∀ λ ⪰ 0.
x∈
The left-hand side is called the Lagrange dual function or simply the dual function. It is a function
g ∶ ℝm × ℝp → ℝ ̄ of λ and 𝜇, i.e.
Now, noting that L is an affine function of λ and 𝜇, we see that the dual function g is the pointwise
infimum of a family of affine functions, and hence, g is a concave function. This is true regardless
of whether or not (4.4) is a convex optimization problem. We remark that if L is unbounded from
below for some (λ, 𝜇), then g(λ, 𝜇) = −∞. We will say that (λ, 𝜇) is a dual feasible point if λ ⪰ 0 and
(λ, 𝜇) ∈ dom(−g), where
dom(−g) = {(λ, 𝜇) | g(λ, 𝜇) > −∞}
is the effective domain of −g.
Example 4.7 We now derive the Lagrange dual function for the special case of (4.4), where both
the inequality and equality constraints are affine, i.e. the problem takes the form
minimize f0 (x)
subject to Ax ⪯ b (4.61)
Cx = d
where A ∈ ℝm×n , b ∈ ℝm , C ∈ ℝp×n , and d ∈ ℝp are given. Using the definition of the Lagrangian,
we find that
L(x, λ, 𝜇) = f0 (x) + λT (Ax − b) + 𝜇 T (Cx − d),
and the dual function may be expressed as
g(λ, 𝜇) = inf L(x, λ, 𝜇)
x∈
( ( )T )
= −λT b − 𝜇 T d + inf
f0 (x) + AT λ + CT 𝜇 x
x∈
( ( )T )
= −λ b − 𝜇 d − sup − AT λ + CT 𝜇 x − f0 (x)
T T
x∈
= −λ b − 𝜇 d −
T T
f0∗ (−AT λ − CT 𝜇),
where the last step follows from the definition of the conjugate function. Recalling that conjugate
functions are convex by construction, we immediately see that g is indeed a concave function, and
moreover, dom(−g) = {(λ, 𝜇) | − AT λ − CT 𝜇 ∈ dom f0∗ }.
Moreover, for any dual feasible (λ, 𝜇) and primal feasible x, it holds that g(λ, 𝜇) ≤ d⋆ ≤ p⋆ ≤ f0 (x),
and as a consequence,
p⋆ − d⋆ ≤ f0 (x) − g(λ, 𝜇).
The difference f0 (x) − g(λ, 𝜇) is called the duality gap at (x, λ, 𝜇), whereas p⋆ − d⋆ is the optimal
duality gap.
We say that strong duality holds if d⋆ = p⋆ , i.e. the optimal duality gap is zero. Strong duality
does not hold in general, but it can be shown to hold for a convex optimization problem of the form
(4.53) if it satisfies certain constraint qualifications. One example is Slater’s constraint qualification
or Slater’s condition, which states that strong duality holds if there exists a point x ∈ relint that
is strictly feasible in the sense that
fi (x) < 0, i ∈ ℕm , Ax = b.
Slater’s condition also implies that the dual optimal value is attained whenever it is finite, i.e. there
exists a dual point (λ⋆ , 𝜇 ⋆ ) such that g(λ⋆ , 𝜇 ⋆ ) = d⋆ . We note that the duality gap can be used as a
stopping criteria for optimization methods when strong duality holds. Indeed, if (x, λ, 𝜇) is primal
and dual feasible, then
f0 (x) − p⋆ ≤ f0 (x) − g(λ, 𝜇), d⋆ − g(λ, 𝜇) ≤ f0 (x) − g(λ, 𝜇). (4.64)
Thus, if the duality gap is 𝜖 = f0 (x) − g(λ, 𝜇), then x is 𝜖-suboptimal for the primal problem and
(λ, 𝜇) is 𝜖-suboptimal for the dual problem.
we immediately see that strong duality holds if 𝜈 is a closed convex function since this implies that
𝜈 ∗∗ = 𝜈. The conjugate of 𝜈(y) = inf x h(x, y) can also be expressed in terms of h∗ by noting that
𝜈 ∗ (z) = sup {zT y − inf h(x, y)}
y x
= h∗ (0, z).
Thus, the dual problem (4.66) depends on the choice of perturbation function.
4.6 Duality 89
Example 4.8 Let 𝜙 in (4.65) be defined as 𝜙(x) = f0 (x) + Iℝn+ (b − Ax) with A ∈ ℝm×n and b ∈ ℝm .
This corresponds to an optimization problem of the form (4.4) with the objective function f0 (x)
and affine inequality constraints Ax ⪯ b. Now, suppose that we define the perturbation function
h ∶ ℝn × ℝn → ℝ as
h(x, y) = f0 (x) + Iℝn+ (b − Ax − y). (4.67)
The conjugate of h is then
h∗ (𝑣, z) = sup {𝑣T x + zT y − f0 (x) − Iℝn+ (b − Ax − y)}
x,y
We end this section by outlining Fenchel’s duality theorem. Specifically, we will consider the opti-
mization problem
minimize f (x) + g(Ax) (4.69)
where f ∶ ℝn → ℝ ̄ and g ∶ ℝm → ℝ ̄ are proper convex functions, and A ∈ ℝm×n is given. The con-
jugate of the perturbation function h(x, y) = f (x) + g(Ax + y) can be expressed as
{ }
h∗ (𝑣, z) = sup 𝑣T x + zT y − f (x) − g(Ax + y)
x,y
{ }
= sup 𝑣T x + zT (̃y − Ax) − f (x) − g(̃y)
x,̃y
{ } { }
= sup (𝑣 − AT z)T x − f (x) + sup zT ỹ − g(̃y)
x ỹ
= f ∗ (𝑣 − AT z) + g∗ (z),
and hence, the dual problem is
maximize −f ∗ (−AT z) − g∗ (z) (4.70)
with variable z ∈ ℝm . We note that this is equivalent to the Lagrange dual problem associated with
the problem
minimize f (x) + g(y)
subject to Ax = y
with variables x ∈ ℝn and y ∈ ℝm and which is equivalent to (4.69). From weak duality, we have
that
p⋆ = inf {f (x) + g(Ax)} ≥ sup {−f ∗ (−AT z) − g∗ (z)} = d⋆ ,
x z
and it can be shown that strong duality holds if f and g satisfy certain conditions. For example, the
condition relint (dom f ) ∩ relint (dom g) ≠ ∅ is a sufficient condition for strong duality.
90 4 Optimization Theory
We note that for convex optimization problems, Slater’s condition implies the existence of primal
and dual optimal points x⋆ and (λ⋆ , 𝜇 ⋆ ) with zero duality gap.
Exercises
4.1 Consider the optimization problem
minimize f (x, y)
where f ∶ ℝm × ℝn → ℝ. Show that the minimal value of f can be obtained by first mini-
mizing over x and then minimizing over y. Specifically, let g ∶ ℝn → ℝ be defined as
g(y) = min f (x, y)
x
92 4 Optimization Theory
and let x̄ (y) ∈ argminx f (x, y) denote a minimizer of f (x, y) for a given y. Similarly, we define
ȳ ∈ argminy g(y). Show that (̄x(̄y), ȳ ) is a minimizer of f .
4.5 Let g ∶ ℝ → ℝ be defined as g(t) = f (x + 𝑣t), where f ∶ ℝn → ℝ and 𝑣 ∈ ℝn are given, and
dom g = {t | x + t𝑣 ∈ dom f }.
(a) Show that f is a convex function if and only if g is a convex function for all x ∈ dom f
and 𝑣 ∈ ℝn .
(b) Show that f ∶ 𝕊n → ℝ defined as f (X) = − ln det X with dom f = 𝕊n++ is a convex
function.
Hint: Show that g(t) = − ln det (X + Vt) is convex for all X ∈ 𝕊n++ and V ∈ 𝕊n .
4.6 Let f ∶ ℝn → (−∞, +∞] be a convex function. Show that the perspective function Pf ∶ ℝn ×
ℝ → (−∞, +∞] defined as
{
tf (x∕t), t > 0,
Pf (x, t) =
∞, otherwise.
is a convex function.
4.9 Let f ∶ ℝm × ℝn → ℝ be a convex function, and let C ⊆ ℝn be a convex set. Show that the
function g ∶ ℝm → ℝ defined as
g(x) = inf f (x, y)
y∈C
is a convex function.
4.11 Let || ⋅ || be a norm on ℝn . Show that the dual norm, which is defined as
||y||∗ = sup {yT x | ||x|| ≤ 1},
x
is indeed a norm.
√
4.12 Derive the dual norm of the quadratic norm || ⋅ ||P ∶ ℝn → ℝ+ defined as ||x||P = xT Px
for some P ∈ 𝕊n++ .
4.14 [22, Exercise 2.31] Let C∗ be the dual cone of a convex cone C ⊆ ℝn . Prove the following
statements:
(a) C∗ is a convex cone.
(b) C∗∗ is closed.
(c) C1 ⊆ C2 implies C2∗ ⊆ C1∗ .
(d) The interior of C∗ is given by int C∗ = {y | yT x > 0 ∀ x ∈ C}.
(e) If C has a nonempty interior, then C⋆ is pointed.
(f) C∗∗ is the closure of C.
(g) If the closure of C is pointed, then C∗ has a nonempty interior.
94
Optimization Problems
Applications in learning and control give rise to a wide range of optimization problems, and we
will now discuss different classes of such optimization problems. Our starting point will be the
classes of linear and nonlinear least-squares problems, instances of which occur frequently in, e.g.
supervised learning. It was studied already by Carl Friedrich Gauss in the eighteenth century in
order to calculate the orbits of celestial bodies. We then discuss quadratic programs, which are often
encountered as local surrogate models of general optimization problems and are a component of
many optimization methods. Another important class of problems is the class of conic optimization
problems, and we will see that any convex optimization problem can, in principle, be cast as a conic
optimization problem.
Problems that involve the rank of some matrix variable as part of the objective function or a
constraint function are called rank optimization problems. We will see that some special cases of
these can be solved to global optimality using techniques from linear algebra. However, in general,
rank optimization problems are difficult nonlinear optimization problems, and we will introduce
some heuristics that often produce good approximate solutions. We will also discuss partially sep-
arable optimization problems, which are problems with a special kind of structure that can be
exploited computationally. Several examples of such problems appear later in the book, e.g. optimal
control problems as well as hidden Markov processes. We also consider multiparametric optimiza-
tion, which is about finding a parametric solution to a family of parameterized optimization prob-
lems, and finally, we discuss stochastic optimization problems, which arise, e.g. when the objective
function involves a random variable in addition to the optimization variable.
1∑
m
minimize f (x)2 , (5.1)
2 k=1 k
with variable x ∈ ℝn and where fk ∶ ℝn → ℝ, k ∈ ℕm . It may also be written more compactly as
1
minimize ||f (x)||22 ,
2
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
5.1 Least-Squares Problems 95
where we define f (x) = (f1 (x), … , fm (x)). If f is continuously differentiable, then the necessary
optimality condition associated with the nonlinear LS problem may be expressed as
1∑ 𝜕 ∑
m m
fk (x)2 = fk (x)∇fk (x) = 0. (5.2)
2 k=1 𝜕x k=1
This is a set of n nonlinear equations in x, which are not easy to solve in general. We discuss
optimization methods for this type of problem in Chapter 6.
A linear LS problem is the special case where f is an affine function, i.e. f (x) = Ax − b for some
A ∈ ℝm×n and b ∈ ℝm . The resulting LS problem is a convex optimization problem. This follows
by noting that the Hessian of (1∕2)||Ax − b||22 is AT A, which is positive semidefinite. The necessary
optimality condition is therefore also sufficient, and it can be expressed as
AT Ax = AT b. (5.3)
This system of equations is called the normal equations. We will encounter several LS problems,
e.g. linear regression, see Section 10.1.
It is sometimes useful to augment the LS problem by adding constraints. For example, if we add
affine equality constraints to a linear LS problem, we obtain a problem of the form
1
minimize ||Ax − b||22
2 (5.4)
subject to Cx = d,
with C ∈ ℝp×n and d ∈ ℝp . Like the linear LS problem, this is clearly also a convex optimization
problem. The Karush–Kuhn–Tucker (KKT) conditions associated with (5.4) may be expressed as
[ T ][ ] [ T ]
A A CT x A b
= , (5.5)
C 0 𝜇 d
where 𝜇 ∈ ℝp is the vector of Lagrange multipliers associated with the equality constraints. This
is an indefinite system of equations, and it is sometimes referred to as the KKT equations since
it represents the KKT conditions. We note that the linear independence constraint qualification
(LICQ) is independent of x and holds if rank(C) = p. Moreover, as we saw in Section 2.11, the system
has a unique solution if and only if rank(C) = p and AT A is positive definite on the nullspace of
C, i.e.
rank(C) = p, (A) ∩ (C) = {0}.
Interested readers may find more information about LS problems in [21], which is a comprehensive
reference on the topic.
with variables xi ∈ ℝD , i
∈ ℕn . If the measurement errors are independent and normally distributed
with zero mean and the same variance, then the above problem is equivalent to a maximum like-
lihood problem for estimating the positions; cf . Section 10.1.
96 5 Optimization Problems
The linear LS problem and its linearly constrained variant are instances of a more general class of
optimization problems, namely quadratic programs (QP). We define a QP as a problem of the form
1 T
minimize x Px + qT x
2
subject to Ax ⪰ b (5.6)
Cx = d,
with variable x ∈ ℝn and problem data P ∈ 𝕊n , q ∈ ℝn , A ∈ ℝm×n , b ∈ ℝm , C ∈ ℝp×n , and d ∈ ℝp .
This is a convex optimization problem according to our definition if and only if P is positive
semidefinite. We note that the problem is equivalent to a convex optimization problem in the
event that P is indefinite, but positive semidefinite on the nullspace of C. The special case of (5.6)
where P = 0 is called a linear program (LP). We will see an example of an LP in Section 10.5.
QPs of the form (5.6) generally do not have a closed-form solution. The KKT conditions may be
expressed as
Px + AT λ + CT 𝜇 = −q
Ax ⪯ b, Cx = d
λ⪰0
diag(λ)(Ax − b) = 0,
where λ ∈ ℝm and 𝜇 ∈ ℝp and the Lagrange multiplies associated with the inequality and equality
constraints, respectively.
To derive the Lagrange dual of (5.6), we introduce the Lagrangian
1 T
L(x, λ, 𝜇) = x Px + qT x + λT (Ax − b) + 𝜇 T (Cx − d).
2
We immediately see that L is unbounded below if P 0, in which case, the dual function is
g(λ, 𝜇) = −∞, and hence, d⋆ = −∞. A more useful lower bound can be obtained if P ⪰ 0, in which
case, we find that
{ T
−b λ − dT 𝜇 − 𝜓(λ, 𝜇), q + AT λ + CT 𝜇 ∈ (P),
g(λ, 𝜇) = (5.7)
−∞, otherwise,
where
1
𝜓(λ, 𝜇) = (q + AT λ + CT 𝜇)T P† (q + AT λ + CT 𝜇).
2
5.3 Conic Optimization 97
We end this section by noting that the QP in (5.6) can be generalized by allowing quadratic
constraints. The resulting optimization problem is a so-called quadratically constrained quadratic
program (QCQP), which can be expressed as
minimize (1∕2)xT Px + qT x
(5.11)
subject to (1∕2)xT Ai x + bTi x + ci ≤ 0, i ∈ ℕm ,
with variable x ∈ ℝn and where P ∈ 𝕊n , q ∈ ℝn , and (Ai , bi , ci ) ∈ 𝕊n × ℝn × ℝ, i ∈ ℕm . This is a
convex problem if and only if P and A1 , … , Am are positive semidefinite.
which follows by squaring both sides of (5.15). Thus, to express (5.14) as (5.15), we need to find A,
b, c, and d such that
P = AT A − ccT ⪰ 0, q = 2(AT b − cd), s = b T b − d2 .
We start by considering the case where q ∈ (P). This implies that there exists a full-rank matrix
A such that P = AT A and q ∈ (AT ), and hence, we can choose c = 0 and solve for b and d, i.e.
1 †T
b= (A ) q, d = |(1∕4)bT b − s|1∕2 .
2
Now, suppose q ≠ (P) and let P = BT B, where B ∈ ℝr×n has rank r < n. We may then decom-
pose q as q = q̄ + c, where q̄ = B† Bq ∈ (BT ) and c = (I − B† B)q ∈ (B), and hence, we have that
[ ]
P = AT A − ccT ⪰ 0 if we let AT = BT c . Moreover, q = 2(AT b − cd) is satisfied if we take
[ ]
1 (B† )T q̄
b= ,
2 1 + 2d
and finally, we find d by solving
1
s = b T b − d2 ⟺ ̄ 22 − 1).
d = s − (||(B† )T q||
4
Following the approach in Section 4.7, we may derive the KKT conditions for optimality, which
can be expressed as
Ax = b, x ∈ K,
A 𝜇 + λ = c, λ ∈ K ∗ ,
T
diag(λ)x = 0.
These conditions are both necessary and sufficient for optimality if either the primal problem or
the dual problem is strictly feasible, i.e. if there exists a vector x ∈ int K such that Ax = b, or if there
exists a dual point (λ, 𝜇) such that λ ∈ int K ∗ and AT 𝜇 + λ = c.
1 1 1
𝑧
𝑧
0 0 0
0 −0.5 −0.5
0.6 0 0 0 0
0.4 −0.5 0.5 0.5
0.2 0 −1 0.5
1 0.5 1
𝑦 𝑥 𝑦 𝑥 𝑦 𝑥
(a) (b) (c)
Figure 5.2 The exponential cone Kexp and examples of the power cone K𝛼 : (a) exponential cone, (b) power
cone: 𝛼 = 2∕3, and (c) power cone: 𝛼 = 4∕5.
Yet another example of a cone of the form (5.18) is a so-called power cone. Letting h(x) = |x|1∕𝛼
for some 𝛼 ∈ (0,1), we obtain the three-dimensional cone
K𝛼 = cl{(x, y, z) ∈ ℝ3 | y > 0, y|x∕y|1∕𝛼 ≤ z}
= {(x, y, z) ∈ ℝ3 | y ≥ 0, z ≥ 0, |x| ≤ y1−𝛼 z𝛼 }. (5.21)
The reader may verify that this is a proper cone. The dual cone may be expressed as
{ ( )1−𝛼 ( )𝛼 }
𝑣 𝑤
K𝛼 = (u, 𝑣, 𝑤) ∈ ℝ | 𝑣 ≥ 0, 𝑤 ≥ 0, |u| ≤
∗ 3
, (5.22)
1−𝛼 𝛼
which is the image of K𝛼 under the linear transformation (x, y, z) → (x, (1 − 𝛼)y, 𝛼z). It is also possi-
ble to define a power cone in ℝn+1 . Specifically, given 𝛼 ∈ int Δn , we define the (n + 1)-dimensional
power cone as
𝛼 𝛼
K𝛼 = {(x, y) ∈ ℝ × ℝn+ | |x| ≤ y11 · · · ynn },
and its dual may be expressed as
{ ( )𝛼 ( )𝛼 }
K𝛼∗ = (u, 𝑣) ∈ ℝ × ℝn+ | |u| ≤ 𝑣1 ∕𝛼1 1 · · · 𝑣n ∕𝛼n n .
Figure 5.2 shows the exponential cone and examples of the power cone in ℝ3 .
We end this section with some examples of constraints that can be expressed as conic constraints.
Example 5.4 From the definition of the exponential cone Kexp and the power cone K𝛼 , we imme-
diately see that
ex ≤ t ⟺ x ≤ ln(t) ⟺ (x, 1, t) ∈ Kexp ,
and for p ≥ 1,
|x|p ≤ t ⟺ |x| ≤ t1∕p ⟺ (x, 1, t) ∈ K1∕p .
This observation allows us to transform a number of constraints that involve exponential functions,
logarithms, and/or powers into conic constraints. For example, the epigraph of the log-sup-exp
function f (x) = ln(ex1 + · · · + exn ) can be expressed as (x, t) such that
ln(ex1 + · · · + exn ) ≤ t ⟺ ex1 + · · · + exn ≤ et
⟺ ex1 −t + · · · + exn −t ≤ 1
⟺ u1 + · · · + un ≤ 1, exi −t ≤ ui , i ∈ ℕn
⟺ 𝟙T u ≤ 1, (xi − t, 1, ui ) ∈ Kexp , i ∈ ℕn .
Another example of a function whose epigraph can be expressed in terms of the exponential cone
is the negative entropy function, i.e. f (x) = x ln(x) with dom f = ℝ+ , where we use the convention
102 5 Optimization Problems
that f (0) = 0. Specifically, the constraint f (x) ≤ t can be expressed as (−t, x, 1) ∈ Kexp . Indeed, for
x > 0, it holds that
f (x) ≤ t ⟺ x ≤ et∕x ⟺ (−t, x, 1) ∈ Kexp ,
and for x = 0, we have that 0 ≤ t ⟺ (−t, 0,1) ∈ Kexp .
The geometric mean of x1 , … , xn is f (x) = (x1 · · · xn )1∕n with dom f = ℝn++ . This is a concave
function, and the constraint |t| ≤ f (x) can be expressed in terms the (n + 1)-dimensional power
cone defined by 𝛼 = (1∕n)𝟙, i.e.
|t| ≤ (x1 x2 · · · xn )1∕n ⟺ (t, x) ∈ K𝛼 , 𝛼 = (1∕n)𝟙.
It can also be expressed in terms of n − 1 power cones in ℝ3 . To see this, first note that |t| ≤ f (x)
can be expressed as
1∕n (n−1)∕n
|t| ≤ xn un−1 , |un−1 | ≤ (x1 · · · xn−1 )1∕(n−1) , x ⪰ 0, un−1 ≥ 0.
Using this observation recursively, we see that the constraint |t| ≤ f (x) is equivalent to
t = un , x1 = u1 , (ui , ui−1 , xi ) ∈ K1∕i , i ∈ {2, … , n}.
Example 5.5 Let (Ω, , P) be a probability space, and suppose that X ∶ Ω → ℝ is a random vari-
[ ]
able with moments mk = 𝔼 X k , k ∈ ℕ2n , cf . Section 3.6. We define the zeroth moment as m0 = 1
for convenience, and we let H ∶ ℝ2n+1 → 𝕊n+1 be the function that maps the moment sequence to
the Hankel matrix
⎡ m0 m1 m2 … mn−1 mn ⎤
⎢ ⎥
⎢ m1 m2 m3 … mn mn+1 ⎥
⎢ m m3 m4 … mn+1 mn+2 ⎥
H(m0 , … , m2n ) = ⎢ 2 .
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎥
⎢mn−1 mn mn+1 … m2n−2 m2n−1 ⎥
⎢ m m ⎥
⎣ n n+1 mn+2 … m2n−1 m2n ⎦
The matrix H(m0 , … , m2n ) is positive semidefinite since for all x ∈ ℝn+1 ,
[ n ]2
∑ ∑
n n
∑
x H(m0 , … , m2n )x =
T
xi xj 𝔼[X ] = 𝔼
i+j
xk X k
≥ 0.
i=0 j=0 k=0
There is a partial converse of these results which states that if m0 = 1 and (m1 , … , m2n ) are such
that H(m0 , … , m2n ) ≻ 0, then there exists a probability space and a random variable X such that
mk = 𝔼[X k ] for k ∈ ℕ2n ; see [22, Section 4.6.3 and Exercise 2.37]. More generally, if m0 = 1 and
(m1 , … , m2n ) are such that H(m0 , … , m2n ) ⪰ 0, then there is a sequence of random variables that
converges to a random variable with the given moments. This allows us to pose certain moment
constraints as conic constraints involving the positive semidefinite cone. For example, suppose
that we are given upper and lower bounds on the moments of a random variable X, i.e. we know that
lk ≤ 𝔼[X k ] ≤ uk , k ∈ ℕ2n .
The problem of finding a random variable X that satisfies these constraints and minimizes 𝔼[p(X)]
∑2n
where p ∶ ℝ → ℝ is a polynomial defined as p(x) = k=0 ck xk can then be expressed as the problem
∑
2n
minimize ck mk
k=0
subject to lk ≤ mk ≤ uk , k ∈ ℕ2n
H(1, m1 , … , m2n ) ∈ 𝕊+n+1 ,
with variables m1 , … , m2n . Note that the pdf of the random variable does not appear in this
problem formulation.
5.4 Rank Optimization 103
is a minimizer of (5.23) for both the Frobenius norm and the 2-norm. This follows from the
Eckart–Young–Mirsky theorem, which states that if k < min(m, n), then
√(
√ min (m,n) )
√ ∑
||A − Ak ||F = min {||A − Z||F | rank(X) ≤ k} = √ 𝜎i2 (5.24a)
Z
i=k+1
To prove (5.24a), suppose rank(Z) ≤ k and let Z = PΓQT be an SVD of a matrix Z, where Γ is the
matrix with the singular values of Z on its diagonal and in decreasing order. Moreover, we decom-
̃ where Σk is the matrix with the singular values of Ak . It then follows from
pose Σ as Σ = Σk + Σ,
von Neumann’s trace inequality in (2.33) that |tr(AT Z)| ≤ tr(ΣTk Γ). As a result, we have that
||A − X||2F = ||A||2F + ||Z||2F − 2tr(AT Z)
≥ ||Σ||2F + ||Γ||2F − 2tr(ΣTk Γ)
̃ 2 + ||Γ||2 − 2tr(ΣT Γ)
= ||Σk ||2 + ||Σ||
F F F k
̃ 2 + ||Σk − Γ||2
= ||Σ|| F F
∑
min (m,n)
̃ 2=
≥ ||Σ|| 𝜎i2
F
i=k+1
which proves (5.24a). To verify (5.24b), we first assume that (5.24b) is false, i.e. there exists a matrix
Z such that ||A − Z||2 < 𝜎k+1 and rank(Z) ≤ k. An immediate consequence is that
u ∈ (Z) ⟹ ||Au||2 = ||(A − Z)u||2 ≤ ||A − Z||2 ||u|| < 𝜎k+1 ||u||2 .
On the other hand, if 𝑣1 , … , 𝑣k+1 are the leading k + 1 right-singular vectors of A and
𝑣 ∈ span(𝑣1 , · · · , 𝑣k+1 ), then ||A𝑣||2 ≥ 𝜎k+1 ||𝑣||2 . Noting that rank(Z) ≤ k ⟺ dim (Z) ≥ n − k,
we see that (A) ∩ span(𝑣1 , · · · , 𝑣k+1 ) ≠ {0}, and hence, we have a contradiction.
104 5 Optimization Problems
‖𝐴 − 𝐴𝑘 ‖2 𝑟2 (𝛿)
5
6
5 4
4 3
3 2
2
1 1
𝑘 𝛿
1 2 3 4 5 1 2 3 4 5 6 7
(a) (b)
Figure 5.3 The optimal value associated with the rank-constrained problem (5.23) and the
rank-minimization problem (5.25) for the 2-norm and a matrix A ∈ ℝ10×5 with the singular values
(6,5, 2,1, 0). (a) Rank-constrained minimization and (b) rank minimization.
5.4 Rank Optimization 105
Example 5.6 Given a partial impulse response h ∈ ℝn , the matrices (A, B, C) ∈ ℝr×r × ℝr×1 ×
ℝ1×r are said to be a minimal state-space realization of h if r is the smallest possible natural number
such that hk = CAk−1 B, k ∈ {1, … , n}. A well-known property of linear systems is that r is given by
̃ | h̃ = h , i ∈ ℕ },
r = min {rankH(h) i i n
h̃
We end this section by briefly considering a more general formulation of the rank-constrained
problem (5.23), namely, the problem
minimize f (Z)
(5.31)
subject to rank(Z) ≤ k,
106 5 Optimization Problems
with variable Z ∈ ℝm×n , and where f ∶ ℝm×n → ℝ. ̄ This problem can be reformulated by introduc-
ing two new variables U ∈ ℝ m×k
and V ∈ ℝ and eliminating the rank constraint by substituting
n×k
UV T for Z. However, the reformulated problem is generally still a nonconvex one, but it is often a
convenient form for the use of local optimization methods.
where Ak ∈ ℝnk ×n , k ∈ ℕm , are matrices with a high-dimensional nullspace. The notion of partial
separability can also be extended to functions with more general domains, e.g. a discrete set. Partial
separability arises frequently in both estimation and control because of the fact that descriptions of
dynamical systems often involve difference equations that only introduce coupling between states
that are adjacent in time.
We will focus on the special case where the range of ATk is spanned by a small number of standard
basis vectors. This implies that each function fk (Ak x) only depends on a small subset of the entries
of x. The partial separability structure can be expressed in terms of index sets 𝛽k ⊂ ℕn , k ∈ ℕm ,
such that fk depends on xi if and only if i ∈ 𝛽k . We will assume that the index sets are maximal,
i.e. 𝛽i ⊈ 𝛽j if i ≠ j, and their union is ∪m 𝛽 = ℕn . Moreover, we define Ak to be a matrix with rows
k=1 k
{eTi | i ∈ 𝛽k }. We note that f is said to be separable in the special case, where 𝛽i ∩ 𝛽j = ∅ for i ≠ j.
∑n
For example, this is the case if f is of the form f (x) = i=1 fi (xi ).
The partial separability structure of f can also be characterized in terms of an undirected graph
with vertex set ℕn and an edge between nodes i and j, i ≠ j, if and only if {i, j} ∈ 𝛽k for some k ∈ ℕm .
We will refer to this graph as the interaction graph associated with (5.32). The interaction graph is
closely related to the sparsity graph associated with the Hessian matrix ∇2 f (x) provided that f is
twice continuously differentiable at x. This follows by noting that
∑
m
∇2 f (x) = ATk ∇2 fk (Ak x)Ak ,
k=1
and hence, eTi ∇2 f (x)ej = 0 if Ak ei = 0 or Ak ej = 0 for all k ∈ ℕm . Each index set 𝛽k is a so-called
clique of the interaction graph, i.e. a set of pairwise adjacent vertices. The coupling between the
functions f1 , … , fm can be described in terms of a clique intersection graph, which is an undirected
graph with vertex set {𝛽1 , … , 𝛽m } and an edge between vertices 𝛽i and 𝛽j if and only if i ≠ j and
𝛽i ∩ 𝛽j ≠ ∅.
5 𝛽4 = {1,4,5}
Figure 5.4 The interaction and clique intersection graphs associated with the partially separable function
(5.33).
corresponding to the four index sets 𝛽1 = {1}, 𝛽2 = {1, 2}, 𝛽3 = {2, 3} and 𝛽4 = {1, 4, 5}. Figure 5.4
shows the interaction and the clique intersection graphs associated with (5.33). Partial separability
allows us to compute inf x f (x) recursively by noting that
{ { } }
inf f (x) = inf f1 (x1 ) + inf f2 (x1 , x2 ) + inf f3 (x2 , x3 ) + inf f4 (x1 , x4 , x5 ) .
x x1 x2 x3 x4 ,x5
we may compute V3 and V4 in parallel. Once V3 has been obtained, we can compute
V2 (x1 ) = inf {f2 (x1 , x2 ) + V3 (x2 )},
x2
Assuming that the infimum is attained, we can also compute an optimal point x⋆ as follows:
x1⋆ ∈ argmin{f1 (x1 ) + V2 (x1 ) + V4 (x1 )}
x1
The computations can be represented by a tree with four nodes as illustrated in Figure 5.5. This is a
so-called spanning tree of the clique intersection graph, and it is often referred to as an elimination
tree because of its connection to the order in which variables are eliminated in Gauss elimination
for sparse linear systems of equations; see, e.g. [106]. The computations start at the leaves of the
tree, possibly in parallel, and then proceed up the tree by adding the functions Vk to parent nodes.
At each node, partial minimization is carried out with respect to the variables that are not shared
with the parent node. Once the optimization problem at the root of the tree has been solved, an
optimal point x⋆ can be computed using the functions 𝜇k by passing the optimal value downward
from the root of the tree toward the leaves. We note that it is also possible to define any other node
108 5 Optimization Problems
R oot node
𝑓 1̄ (𝑥 1 )
𝑉 1 (𝑥 1 ) 𝑉 4 (𝑥 1 )
𝑓 2̄ (𝑥 1 , 𝑥 2 ) 𝑓 4̄ (𝑥 1 , 𝑥 4 , 𝑥 5 )
𝑉 3 (𝑥 2 )
𝑓 3̄ (𝑥 2 , 𝑥 3 )
Figure 5.5 A computational tree associated with the optimization problem (5.33).
of the tree to be the root, which results in a different organization of the computations. In this
example, computations cannot be carried out in parallel if one of the leaves of the tree in Figure 5.5
is used as the root.
Given a minimization problem with a partially separable objective function, it is generally diffi-
cult to find an elimination order that minimizes some notion of computational cost. However, in
practice, it is possible to use heuristics like the nested dissection algorithm to obtain a good elimi-
nation tree; see, e.g. [60]. Optimization over a tree is sometimes called message passing since the
functions Vk can be thought of as messages from a child node to its parent node, while the functions
𝜇k are used to pass information about optimizers in the opposite direction.
Another difficulty with the recursive approach to minimizing a partially separable function is
that it is generally hard or even impossible to obtain analytical expressions for the functions Vk .
A notable exception is when dom f is a finite set, and the values of the functions f1 , … , fm can be
tabulated. Another exception is when f is a quadratic function, in which case, the functions Vk
are also quadratic functions. The optimality condition ∇f (x) = 0 corresponds to a system of linear
equations, which has a unique solution if f is strongly convex. In this special case, the elimination
tree characterizes a partial order in which a sparse Cholesky factorization is computed and the
solution is found.
The optimization over the tree is also often called dynamic programming over trees. The motiva-
tion for this name comes from the fact that dynamical systems with difference equations couple
time-adjacent states. In these types of applications, it turns out that the computational tree is often
a chain, i.e. it is possible to choose the root of the tree in such a way that no node of the tree has
more than one child. This is also the case for the function (5.33), but it is not true in general.
where Vk ∶ ℝn → ℝ ̄ is defined as
{m−1 }
∑
Vk (xk ) = min fi (xi , xi+1 ) + fm (xm ) , k ∈ ℕm−1 , (5.36)
xk+1 ,…,xm
i=k
and Vm (xm ) = fm (xm ). This is often called the principle of optimality, which refers to the fact that
⋆ ⋆ ) in (5.36) is a subvector of an optimal point x ⋆ in (5.34). It is a direct
an optimal point (xk+1 , … , xm
consequence of partially separability and the fact that for any function F ∶ ℝp × ℝq → ℝ, ̄ it holds
that
inf F(z1 , z2 ) = inf V(z1 )
z1 ,z2 z1
This recursive definition of the functions Vk is often called the dynamic programming recursion, and
the functions Vk are called value functions. The recursion starts with Vm (xm ) = fm (xm ) and proceeds
with Vm−1 (xm−1 ), Vm−2 (xm−2 ), and so on. We see from (5.36) that p⋆ = min x1 V1 (x1 ) and
x1⋆ ∈ argmin V1 (x1 ).
x1
such that an optimal point x⋆ is can be computed using the recursion xk⋆ = 𝜇k (xk−1
⋆
) for 2 ≤ k ≤ m.
The approach can readily be extended to the more general case, where xk is a vector instead of a
scalar. Moreover, dynamic programming can also be done over general trees, and although this
is a straightforward generalization, the proof is somewhat messy from a notational point of view,
and hence, we omit it. We will apply the results presented in this section to optimal control prob-
lems in Chapter 8. Dynamic programming over trees also has applications to probabilistic graphical
models, and we will see how it can be used for maximum likelihood estimation for hidden Markov
processes in Chapter 9.
are convex in x for every fixed value of 𝜃. We will use f (x, 𝜃) as shorthand for (f1 (x, 𝜃), … , fm (x, 𝜃)).
The KKT conditions may then be expressed as follows: there exist x(𝜃) ∈ ℝn and λ(𝜃) ∈ ℝm such
that
∑
m
∇x f0 (x(𝜃), 𝜃) + λi (𝜃)∇x fi (x(𝜃), 𝜃) = 0 (5.38a)
i=1
f (x(𝜃), 𝜃) ⪯ 0 (5.38b)
λ(𝜃) ⪰ 0 (5.38c)
λi (𝜃)fi (𝜃) = 0, i ∈ ℕm . (5.38d)
Assuming that Slater’s condition is satisfied for all 𝜃 ∈ Θ, the KKT conditions are necessary and
sufficient conditions for optimality.
As an example of a multiparametric program, we now consider a multiparametric quadratic
program (mpQP), where we let f0 (x, 𝜃) = 12 xT Hx with H ∈ 𝕊n++ and f (x, 𝜃) = Gx − 𝑤 − S𝜃 with
G ∈ ℝm×n and S ∈ ℝm×r . The resulting KKT conditions are
Hx + GT λ = 0 (5.39a)
Gx − 𝑤 − S𝜃 ⪯ 0 (5.39b)
λ⪰0 (5.39c)
diag(λ)(Gx − 𝑤 − S𝜃) = 0. (5.39d)
Now, suppose that x⋆ (𝜃)̄ and λ⋆ (𝜃)̄ satisfy the KKT conditions for a given parameter vector 𝜃̄ ∈ Θ.
̄
We will drop the argument 𝜃 when it is obvious from the context and simply write x⋆ and λ⋆ .
In order to express the KKT conditions in terms of the active and inactive constraints, we introduce
the index sets
{ }
(𝜃) ̄ =0 ,
̄ = i ∈ ℕm | fi (x⋆ , 𝜃)
̄ = ℕm ∖(𝜃),
(𝜃) ̄
and define G to be the matrix with the rows of G that correspond to the active constraints, whereas
G contains the rows that correspond to inactive constraints. We use the same notation for S, 𝑤,
and λ⋆ . It then follows from (5.39d) that λ⋆ = 0, and from (5.39a) we find that
x⋆ = −H −1 GT λ⋆ . (5.40)
Moreover, from the definition of active constraints, we see that
G x⋆ − 𝑤 − S 𝜃̄ = 0, (5.41)
and combining (5.40) and (5.41), we conclude that
( )−1
λ⋆ = − G H −1 GT (𝑤 + S 𝜃) ̄ (5.42a)
( )−1
̄
x⋆ = H −1 GT G H −1 GT (𝑤 + S 𝜃). (5.42b)
Here, we have tacitly assumed that G has full row rank. If this is not the case, we may redefine
by removing some of its elements such that G has full row rank and GT spans the same subspace.
̄ they are optimal for all 𝜃 that satisfy the KKT
Notice that x⋆ and λ⋆ are not only optimal for 𝜃 = 𝜃;
conditions (5.39), i.e. all 𝜃 such that
( )−1
GH −1 GT G H −1 GT (𝑤 + S 𝜃) − 𝑤 + S𝜃 ⪯ 0
( )−1
− G H −1 GT (𝑤 + S 𝜃) ⪰ 0.
5.7 Stochastic Optimization 111
1 T
p⋆ (𝜃) = f (x⋆ (𝜃), 𝜃) = 𝜃 (H22 − H21 H11
−1 T
H21 )𝜃.
2
Note that p⋆ (𝜃) is a quadratic function of 𝜃, and ∇2 p⋆ (𝜃) is the Schur complement of H11 in the
block matrix in (5.43).
1∑
m
𝔼[F(x, 𝜉)] ≈ F(x, ai ) (5.45)
m i=1
112 5 Optimization Problems
where a1 , … , am are m independent observations of the random variable 𝜉.1 This is motivated by the
close connection between expected values and averages as discussed in Section 3.6. The resulting
problem is deterministic and of the form
1 ∑
m
minimize f (x), (5.46)
m i=1 i
where fi ∶ ℝn → ℝ ̄ are defined as fi (x) = F(x, ai ). Problems of this form arise naturally in applica-
tions where a finite set of training examples is available. An example is empirical risk minimization
where the functions f1 , … , fm often take the form
fi (x) = l(aTi x, bi )
where l ∶ ℝ2 → ℝ is a loss function and (ai , bi ) ∈ ℝn × ℝ is one of m observations. We will see a
number of applications that involve optimization problems of this form in Chapter 10, including
linear regression, logistic regression, support vector machines, and artificial neural networks.
Multistage stochastic optimization differs from the single-stage problem in that not all variables
are included for optimization at the same time; the optimization is performed in a sequential man-
ner in so-called stages, see, e.g. [34]. The partial optimization that is carried out at a stage is called a
decision. Usually, the decision and random outcomes at the current stage affect the value of future
decisions. To illustrate the basic principle, we now consider a random process 𝜉 ∶ Ω → ℤN , where
we assume that the set is finite,2 and we partition 𝜉 as 𝜉 = (𝜉0 , … , 𝜉N ), where 𝜉k is a random
variable associated with stage k. The decision variable xk at stage k is actually a function of the
random process and defined as xk ∶ ℤk → with (𝜉0 , … , 𝜉k ) → xk (𝜉0 , … , 𝜉k ), and where is also
a finite set. We let x = (x0 , … , xN ). We now realize that x is also a random process, and the way it
is defined it is said to be adapted to the random process 𝜉. We then define F ∶ ℤN × ℤN → ℝ and
the optimization problem
minimize 𝔼[F(x, 𝜉)],
where 𝔼[•] denotes expectation with respect to the random process 𝜉. This problem looks similar
to the single-stage stochastic optimization problem. However, because of the constraints imposed
on the decision variables, it is also possible to state the problem as
[ [ [ [ ]]]]
𝔼𝜉0 min 𝔼𝜉1 min 𝔼𝜉2 · · · 𝔼𝜉N min F(x, 𝜉) (5.47)
x0 (𝜉0 ) x1 (𝜉0 ,𝜉1 ) xN (𝜉0 ,…,𝜉N )
where 𝔼𝜉k [ ] denotes conditional expectation with respect to the probability distribution for 𝜉k given
•
(𝜉0 , … , 𝜉k−1 ) for k ∈ ℕN , and where 𝔼𝜉0 is expectation with respect to 𝜉0 . Here, we have made use of
the multiplication theorem discussed in Section 3.2. We realize that we may just as well consider 𝜉0
to be known and remove the expectation with respect to 𝜉0 , and this is often the way the problem
is stated. Also, notice that the innermost minimization has to be carried out parametrically with
respect to all variables except xN , and then expectation is taken with respect to 𝜉N conditioned on
given values of the random variables (𝜉0 , … , 𝜉N−1 ). One technicality is that the resulting random
variable after we carry out the parametric optimization might not be measurable in case we consider
nonfinite sets, in which case the expectation is not well defined. Another problem can be that the
parametric minimum does not exist. However, even for finite sets, the parametric optimization can
be very cumbersome to carry out, but we will see that there are applications of multistage stochas-
tic optimization to so-called “Markov decision processes,” for which the computational burden is
manageable; see Section 8.9
1 To be more precise, we let 𝜉1 , … , 𝜉m be independent identically distributed random variables, and we let
a1 , … , am be observations of each of these random variables, i.e. we repeat the same experiment m times.
2 The reason we restrict ourselves to a finite set is that the minima and random variables we implicitly define below
are not necessarily well-defined otherwise.
Exercises 113
Exercises
5.3 Write a MATLAB script that uses YALMIP to solve the realization problem in Example 5.6
using the nuclear norm heuristic in (5.27). You can directly use norm(H,’nuclear’),
where H is the Hankel matrix, as your objective function in YALMIP, in which case you do
not need to include any constraints.
(a) Try your code on some problems generated with drss using a system order of
three. Neglect the direct term, and discard the system if it is close to being uncon-
trollable or unobservable. You can check this by computing the eigenvalues of the
controllability and observability Gramians of the system, i.e., using the commands
eig(dlyap(A,B*B’)) and eig(dlyap(A’,C’*C)). If the eigenvalues are close
to zero, then discard the system.
(b) Next, suppose that you know the first five Markov parameters, which you can easily
compute. You can then compute the 𝜖-rank of the optimal Hankel matrix by computing
its singular values. In this context, a reasonable threshold for considering a singular
value to be negligible is 𝜖 = 10−4 . Does the 𝜖-rank agree with the true order? Check at
least ten examples to get a fair statistic before you draw any conclusions.
(c) Use the result of Exercise 2.13 to obtain matrices (A, B, C). Do these matrices agree with
the ones you generated randomly, and if not, why? Do the Markov parameters agree?
(b) Show that p(x) is nonnegative for all x ∈ ℝ if and only if there exists a matrix X ∈ 𝕊+m+1
such that
∑
ak = Xi+1,j+1 , 0 ≤ k ≤ 2m.
0≤i,j≤m,i+j=k
5.6 Recall that the nuclear norm of a matrix Z ∈ ℝm×n can be expressed as
||Z||∗ = sup {tr(W T Z) | ||W||2 ≤ 1},
W∈ℝm×n
5.7 [14, Exercise 1.16] Suppose that we would like to compute the product of N matrices
M1 M2 · · · MN
where Mk ∈ ℝnk ×nk+1 , k ∈ ℕN , are given. Recall that matrix multiplication is associative, and
hence, there are (N − 1)! ways of carrying out the N − 1 matrix–matrix multiplications.
As a simple example, note that the product M1 M2 M3 may be computed as either (M1 M2 )M3
or M1 (M2 M3 ), where the parentheses indicate the order of operations. The result is always
the same in exact arithmetic, but the number of FLOPs is generally different.
Now, suppose that we would like to find a multiplication order that minimizes the total
number of FLOPs, which is known as the so-called matrix chain multiplication problem.
Derive a dynamic programming recursion using the following value functions or messages
V ∶ ℕN × ℕN → ℕ, where V(j, k) is the minimum number of FLOPs required to compute the
product Mj Mj+1 · · · Mk , where j ≤ k. Apply the resulting dynamic programming recursion
to the case, where N = 3 and (n1 , n2 , n3 , n4 ) = (2,10, 5,1).
Exercises 115
5.8 Let A ∈ ℝ15×3 , b ∈ ℝ15 , and E ∈ ℝ15×2 , and consider the following multiparametric
quadratic program
minimize ||x − 𝟙||22
subject to Ax ⪯ b + E𝜃,
with variable x ∈ ℝ3 and parameter 𝜃 ∈ [−1, 1] × [−1, 1]. Solve an instance of this problem
using YALMIP and the multiparametric toolbox MPT3. The problem data (A, b, E) may be
generated randomly.
s1 = a1 x1
s2 = s1 + a2 x2
⋮
sm−1 = sm−2 + am−1 xm−1
b = sm−1 + am xm .
(b) Can the problem still be reformulated as a partially separable problem if some element
of a are equal to zero?
with variables xi ∈ ℝD , i ∈ ℕn . Let dij = ||xi − xj ||2 and 𝛿ij = d2ij for (i, j) ∈ and, similarly,
[ ]
let eik = ||xi − ak ||2 and 𝜖ik = e2ik for (i, k) ∈ a . Moreover, let X = x1 · · · xn ∈ ℝD×n and
Y = X T X ∈ 𝕊n+ .
(a) Show that the localization problem is equivalent to the optimization problem
∑ ( )2 ∑ ( )2
minimize dij − rij + eik − 𝑣ik
(i,j)∈ (i,k)∈a
subject to Yii + Yjj − 2Yij = 𝛿ij
𝛿ij = d2ij , dij ≥ 0, (i, j) ∈
Yii + ||ak ||22 − 2xiT ak = 𝜖ik
𝜖ik = e2ik , eik ≥ 0, (i, k) ∈ a
Y = X T X,
with variables X, Y , dij and 𝛿ij for (i, j) ∈ , and eik and 𝜖ik for (i, k) ∈ a . Notice that the
objective function is convex, but many of the constraints are not.
116 5 Optimization Problems
where x1⋆ , … xn⋆ is the positions you obtain from the regularized relaxation and x1 , … xn
are the positions you obtain using the Levenberg–Marquardt algorithm.
(e) Compute and report the mean square errors with respect to the true positions for
each of the three investigated algorithms (Levenberg–Marquardt, convex relaxation
with/without regularization). Moreover, report the value of the original objective
Exercises 117
function for all three methods. To do this you have to compute the true distances
between the estimated positions, i.e., ||xi⋆ − xj⋆ ||2 and ||xi⋆ − ak ||2 , where (x1⋆ , … xn⋆ ) is
a solution obtained with one of the three methods. Notice that these distances are not
equal to dij and eij when you use the relaxed formulations. What method performs the
best in terms of the mean square error criterion, and what method performs the best in
terms of the original objective function?
118
Optimization Methods
We now turn our attention to numerical methods that can be used to solve different classes of
optimization problems. The methods are mostly iterative: given some initial point x0 , they gen-
erate a sequence of points xk , k ∈ ℕ, that converges to a local or global minimum. Such methods
were proposed already by Isaac Newton and Carl Friedrich Gauss. We will start by reviewing some
basic principles and properties that we will make use of throughout this chapter. We then discuss
first-order methods for unconstrained optimization, which are methods that make use of first-order
derivatives of the objective function. Second-order methods require that the Hessian of the objec-
tive function exists and is available, and we will see that the use of second-order information can
dramatically reduce the number of iterations required to find a solution. However, this typically
comes at the expense of more costly iterations, and we will explore the trade-off between the cost
per iteration and the number of iterations through the lens of variable metric methods, which use
first-order derivatives to approximate second-order derivatives. We will also consider methods for
nonlinear least-squares problem and methods for optimization problems that involve nonsmooth
functions and/or different types of constraints.
Many large-scale learning problems involve an objective function that is a sum of terms, and for
such problems, it is if often very costly to compute the full gradient at each step. To overcome this
obstacle, we will discuss stochastic optimization methods that replace full gradients with stochastic
gradients that are cheap to compute. Moreover, many large-scale problems in learning are also
partially separable, and we will demonstrate how this can be utilized.
6.1.1 Smoothness
A function f ∶ ℝn → ℝ ̄ is said to be smooth if it is continuously differentiable on dom f , which is an
open set. Moreover, f is L-smooth on dom f if ∇f is Lipschitz continuous on dom f with Lipschitz
constant L > 0, i.e. ∇f satisfies
||∇f (y) − ∇f (x)||2 ≤ L||y − x||2 , ∀ x, y ∈ dom f . (6.1)
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
6.1 Basic Principles 119
where B denotes the unit norm ball {p ∈ ℝ | ||p|| ≤ 1}. Moreover, the directional derivative in
n
For the Euclidean norm, −∇f (x)∕||∇f (x)||2 is the unique normalized steepest descent direction
provided that ∇f (x) ≠ 0.
120 6 Optimization Methods
Before we turn our attention to any specific method, we will outline two different approaches to
the problem of finding a suitable step size and/or a descent direction, namely line search methods
and surrogation methods.
where 𝜙 ∶ ℝ → ℝ defined as 𝜙(t) = f (x + tΔx) is the restriction of f to the line defined by x and
the search direction Δx. This is a so-called exact line search, and although it is appealing to make
the most out of a descent direction, the minimization (6.6) can be expensive if a minimizer even
exists. We note that the exact line search is also called Cauchy’s rule in the special case, where
Δx = −∇f (x). An alternative to Cauchy’s rule is Curry’s rule, which can be stated as
t = min {𝜏 | 𝜙′ (𝜏) = 0}, (6.7)
𝜏≥0
i.e. the step size is the smallest nonnegative t such that x + tΔx is stationary point of f . This is a
root-finding problem, which can also be expensive to solve.
A more practical approach to the problem of finding a suitable step size is to choose t such that it
satisfies a sufficient descent condition known as the Armijo condition. This condition requires that
t > 0 satisfies
𝜙(t) ≤ 𝜙(0) + 𝛼1 t𝜙′ (0), (6.8)
where 𝛼1 ∈ (0, 1∕2) is a parameter. In other words, the reduction in the objective value must be
proportional to t𝜙′ (0), which is negative since Δx is assumed to be a descent direction. The Armijo
condition is illustrated in Figure 6.1. Note that it is always satisfied for some sufficiently small t > 0
since 𝜙(t) ≈ 𝜙(0) + 𝜙′ (0)t when t is close to zero. Thus, to ensure that the step makes a reason-
able amount of progress, we need to avoid short steps. One way to do this is to impose a so-called
curvature condition of the form
𝛼2 𝜙′ (0) ≤ 𝜙′ (t), (6.9)
where 𝛼2 ∈ (𝛼1 , 1) is a parameter. Roughly speaking, this means that a step size t > 0 is inadmissible
if the slope of 𝜙 at t is still downward and relatively steep. A step size t > 0 is said to satisfy the
Wolfe conditions if it satisfies the Armijo condition (6.8) as well as the curvature condition (6.9).
𝜙( 𝑡)
𝜙 (0) + 𝑡 𝜙 (0)
𝑡
6.1 Basic Principles 121
𝜙 (𝑡 )
−𝛼 2 𝜙 (0)
𝑡
𝛼 2 𝜙 (0)
𝜙 (0)
Figure 6.2 Illustration of the curvature conditions (6.9) and (6.10). The hatched regions correspond to step
sizes that violate the strong curvature condition, whereas step size in the gray regions violate both the
weak and strong curvature conditions.
Notice that the condition (6.9) does not rule out that 𝜙′ (t) may be large and positive. Thus, if we
would like t to be close to a stationary point of 𝜙, we may modify the curvature condition (6.9) to
include an upper bound on 𝜙′ (t), i.e.
𝛼2 𝜙′ (0) ≤ 𝜙′ (t) ≤ −𝛼2 𝜙′ (0). (6.10)
This may also be expressed as |𝜙′ (t)| ≤ 𝛼2 |𝜙′ (0)| and is often referred to as the strong curvature
condition. Figure 6.2 illustrates the curvature condition (6.9) and the stronger version (6.10).
Collectively, the Armijo condition and the strong curvature condition (6.10) comprise the strong
Wolfe conditions.
We end our discussion of line search methods by outlining two practical algorithms. The first
one is based on the Armijo condition, whereas the second one is based on the Wolfe conditions.
and the algorithm is summarized in Algorithm 6.1. The parameter 𝛽 controls how aggressively the
step size is reduced, whereas 𝛼1 controls the sufficient descent condition. To avoid unnecessarily
short steps, we require that 𝛼1 ∈ (0, 1∕2]. This bound can be motivated by considering the case
where 𝜙(t) is a quadratic function of t, i.e.
t2 ′′
𝜙(t) = 𝜙(0) + t𝜙′ (0) + 𝜙 (0).
2
The exact minimizer of 𝜙(t) is then t⋆ = −𝜙′ (0)∕𝜙′′ (0), and the Armijo condition (6.8) reduces to
t ≤ 2(1 − 𝛼1 )t⋆ . It follows that t = t⋆ does not satisfy the descent condition if 𝛼1 > 1∕2.
and the trivial lower bound l = 0 and upper bound u = +∞. The Armijo condition is checked at
the beginning of each loop iteration. If it is violated, then the upper bound u is reduced to t and
the midpoint of the interval [l, u] is used as a new candidate step size. Otherwise, the curvature
condition is checked, and if it is violated, then the lower bound is increased to t, and a new candidate
step size is then the midpoint of [l, u] if u is finite and otherwise, 2t. Note that once the upper bound
u is finite, the width of the interval [l, u] is reduced by a factor of two in each loop iteration. This
observation can be used to show that the algorithm either terminates after finitely many iterations,
or alternatively, the upper bound u remains infinite and t doubles in every loop iterations, which
implies that 𝜙(t) → ∞ as t → ∞.
some order. We will introduce two kinds of surrogation methods, namely trust-region methods and
majorization minimization methods.
This implies that the sequence converges sublinearly. The sequence xk = 2−k , k ∈ ℕ, also converges
to x∗ = 0, and we find that
|xk+1 − x∗ | 2−k−1 1
lim = lim = ,
k→∞ |xk − x ∗ | k→∞ 2−k 2
which shows that this sequence converges linearly. Finally, the sequence defined recursively as
xk+1 = 𝛼xk2 , k ∈ ℕ, with x1 = 1∕2 and |𝛼| ∈ (0, 2) also converges to x∗ = 0. It satisfies
|xk+1 − x∗ | |𝛼xk2 |
lim = lim = |𝛼|,
k→∞ |xk − x ∗ |2 k→∞ |xk |2
and hence, the sequence converges quadratically. We note that the sequence diverges if |𝛼| > 2,
and xk = 1∕2 for all k ∈ ℕ if |𝛼| = 2.
where tk > 0 is the step size at iteration k. The step size can be chosen in several ways, e.g. using
some form of line search. We will analyze the behavior of the iterates generated by (6.13) with
different assumptions on f and the step size sequence.
see Exercise 6.1. However, without additional assumptions on f , there is no guarantee that the
gradient descent method (6.14) converges to neither a global nor local minimum.
where we have used the fact that ||xk − x⋆ ||2 is a nonincreasing function of k if tk ∈ (0, 2∕L)
for all k. Assuming that x0 ≠ x⋆ , we may rewrite this inequality as
f (xk ) − p⋆
≤ ||∇f (xk )||2 .
||x0 − x⋆ ||2
Combining this with (6.17), we have that
( ) 𝛿2
L
𝛿k+1 ≤ 𝛿k − tk 1 − tk k2 ,
2 R
where 𝛿k = f (xk ) − p⋆ and R = ||x0 − x⋆ ||2 . Dividing by 𝛿k 𝛿k+1 on both sides and rearranging the
terms, we arrive at
( ) 𝛿
1 1 L k 1
≥ + tk 1 − tk
𝛿k+1 𝛿k 2 𝛿k+1 R2
( )
1 L 1
≥ + tk 1 − tk
𝛿k 2 R2
( )
1∑
k
1 L
≥ + 2 ti 1 − ti ,
𝛿0 R i=0 2
where the last inequality follows by recursive application of the previous inequality. Choosing a
constant step size ti = 𝜌∕L with 𝜌 ∈ (0, 2) leads to the bound
𝛿0 LR2
𝛿k+1 ≤ ,
LR2 + (k + 1)𝛿0 𝜌(1 − 𝜌∕2)
where 𝜌 = 1 minimizes the right-hand side. Thus, it follows that when f is L-smooth and convex,
then the gradient descent iteration as outlined in Algorithm 6.4 satisfies
f (xk ) − p⋆ = O(1∕k),
which means that the worst-case rate of convergence is sublinear. However, note that this does not
guarantee that xk converges unless p⋆ is attained.
Example 6.2 Consider the function f ∶ ℝ2 → ℝ defined as f (x) = g(Ax + b), where g ∶ ℝ3 → ℝ
is the log-sum-exp function
g(z) = ln (ez1 + ez2 + ez3 ) ,
and
⎡2 1⎤ ⎡−2⎤
A = ⎢−1 1 ⎥ , b = ⎢−1⎥ .
⎢ ⎥ ⎢ ⎥
⎣−1 −2⎦ ⎣0⎦
6.2 Gradient Descent 127
1
101
⋆
100
0.5
10−1
(𝑥𝑘 ) − 𝑝⋆
𝑥2
0 10−2
10−3
−0.5 10−4
−0.5 0 0.5 1 0 2 4 6 8 10
𝑥1 𝑘
Figure 6.3 Examples of the gradient descent iteration with three different starting points. The starting
points and the first ten iterations are shown.
||∇2 f (x)||2 = ||AT ∇2 g(Ax + b)A||2 ≤ ||A||22 ||∇2 g(Ax + b)||2 ≤ ||A||22 ∕2.
The last inequality follows from the fact that ||∇2 g(z)||2 ≤ 1∕2, which can be shown using the
so-called “Gershgorin circle theorem”. Figure 6.3 shows the level curves of the function f along
with the iterates of the gradient descent iterations for three different starting points and the con-
stant step size tk = 2∕||A||22 . Observe that the starting point has a significant effect on the practical
performance but not the asymptotic behavior.
where x⋆ is the unique minimizer, and hence, ∇f (x⋆ ) = 0. Using (6.4), it follows that
Using this inequality recursively and employing a constant step size tk = t ∈ (0, 2∕(𝜇 + L)], we con-
clude that
( )k
2t𝜇L
||xk − x⋆ ||22 ≤ 1 − ||x0 − x⋆ ||22 . (6.19)
𝜇+L
The best upper bound is obtained with t = 2∕(𝜇 + L), in which case
( )2 ( )
2t𝜇L L−𝜇 𝜅−1 2
1− = = ,
𝜇+L L+𝜇 𝜅+1
where 𝜅 = L∕𝜇 may be viewed as a condition number. Indeed, if f is twice continuously differen-
tiable, 𝜇-strongly convex, and L-smooth, then the eigenvalues of ∇2 f (x) belong to the interval [𝜇, L]
for all x, or equivalently, 𝜇I ⪯ ∇2 f (x) ⪯ LI. Thus, the ratio 𝜅 = L∕𝜇 may be viewed as an upper
bound on the condition number of the Hessian.
With additional assumptions on f , a pure Newton step can be shown to yield a descent if xk is
sufficiently close to a stationary point of f . However, a full step does not necessarily yield a descent
if xk is far away from a stationary point, in which case a damped Newton step should be used.
whenever ||∇f (xk )||2 ≤ 𝛿. The Lipschitz condition (6.25) implies that
1 L
f (y) ≤ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + 2 ||y − x||32 ,
2 6
and substituting xk for x and xk+1 = xk + Δxnt for y, we find that
L
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T Δxnt + Δxnt
T 2
∇ f (xk )Δxnt + 2 ||Δxnt ||32
6
1 L2
≤ f (xk ) − λ(xk ) + λ(xk ) + 3∕2 λ(xk ) ,
2 2 3
2 6𝜇
where we have used the result that 𝜇||Δxnt ||22 ≤ λ(xk )2 , which follows from (6.24). Rewriting this
inequality as
( )
1 L2 λ(xk )
f (xk+1 ) ≤ f (xk ) − − λ(xk )2 ,
2 6𝜇 3∕2
we see that the pure Newton step satisfies the backtracking line search condition (6.28) if
1 L2 λ(xk )
− ≥ 𝛼,
2 6𝜇 3∕2
or, equivalently, if
3𝜇 3∕2
λ(xk ) ≤ (1 − 2𝛼).
L2
Combining this inequality with the bound λ(xk ) ≤ 𝜇 −1∕2 ||∇f (xk )||2 , which is readily obtained from
(6.23), we conclude that a sufficient condition for the line search condition to be satisfied is that
3𝜇 2
||∇f (xk )||2 ≤ 𝛿 ≤ (1 − 2𝛼). (6.29)
L2
Next, to derive the bound (6.26), we start by showing that the pure Newton update
xk+1 = xk + Δxnt satisfies
L
||∇f (xk+1 )||2 ≤ 22 ||∇f (xk )||22 . (6.30)
2𝜇
This follows from the Lipschitz condition (6.25) by noting that
1
||∇f (xk+1 )||2 = ||∇f (xk ) + ∇2 f (xk + 𝜏Δxnt )Δxnt d𝜏)||2
∫0
1 [ ]
= || ∇2 f (xk + 𝜏Δxnt ) − ∇2 f (xk ) Δxnt d𝜏||2
∫0
L2
≤ ||Δxnt ||22 ,
2
and using the result that ||Δxnt ||22 = ||∇2 f (xk )−1 ∇f (xk )||22 ≤ 𝜇 −2 ||∇f (xk )||22 , we arrive at the inequality
(6.30). Applying (6.30) recursively, we find that after l pure Newton steps,
( )2l −1
L2
||∇f (xk+l )||2 ≤ ||∇f (x )||
k 2 ||∇f (xk )||2 . (6.31)
2𝜇 2
Thus, to satisfy both (6.29) and L2 ∕(2𝜇 2 )||∇f (xk )||2 ≤ 1∕2, we take
𝜇2
𝛿 = min (1, 3(1 − 2𝛼)) , (6.32)
L2
which results in the bound
( )2l −1
1
||∇f (xk+l )||2 ≤ 𝛿, l ∈ ℤ+ .
2
This means that the sequence ||∇f (xk+l )||2 , l ∈ ℤ+ , converges at least quadratically to 0.
132 6 Optimization Methods
The strong convexity assumption can also be used to derive upper bounds on ||xk − x⋆ ||2 and the
suboptimality f (xk ) − p⋆ . Indeed, using (6.26) and the fact that ||x − x⋆ ||2 ≤ (2∕𝜇)||∇f (x)||2 for all
x ∈ S when f is 𝜇-strongly convex on S, we see that
( )2l −2
2 1
||xk+l − x⋆ ||2 ≤ ||∇f (xk+l )||2 ≤ 𝜇 −1 ||∇f (xk )||2 , l ∈ ℤ+ .
𝜇 2
Moreover, using the fact that f (x) − p⋆ ≤ 1∕(2𝜇)||∇f (x)||22 for all x ∈ S, cf . (4.30), we find that
( )2l+1 −1
1
f (xk+l ) − p⋆ ≤ 𝜇 −1 , l ∈ ℤ+ .
2
where we have used the lower bound λ(x)2 ≥ 𝜇||Δxnt ||22 and the fact that ∇f (xk )T Δxnt = −λ(xk )2 .
Substituting 𝜇∕L for tk , which minimizes the right-hand side, yields the bound
𝜇
f (xk+1 ) ≤ f (xk ) − λ(x )2
2L k
𝜇
< f (xk ) − 𝛼1 λ(xk )2 ,
L
since 𝛼1 ∈ (0, 1∕2). As a result, the step size 𝜇∕L satisfies the Armijo condition (6.8), which must
terminate with a step size that satisfies tk ≥ 𝛽 𝜇L . Now, using the bound λ(xk )2 ≥ L−1 ||∇f (xk )||22 , which
follows (6.23), and the assumption that ||∇f (xk )||2 ≤ 𝛿, we arrive at
𝜇
f (xk+1 ) − f (xk ) < −𝛼1 𝛽 λ(xk )2
L
𝜇
≤ −𝛼1 𝛽||∇f (xk )||22 2
L
𝜇
≤ −𝛼1 𝛽𝛿 2 2 .
L
We conclude that (6.27) is satisfied with 𝜎 = 𝛼1 𝛽𝛿 2 𝜇∕L2 .
Example 6.3 To illustrate the typical behavior of Newton’s methods, we now revisit the smooth
function f ∶ ℝ2 → ℝ from Example 6.2, i.e.
Figure 6.4 shows the iterates obtained using Newton’s method with a backtracking line search and
for three different starting points. Unlike the gradient method, Newton’s method clearly converges
very rapidly in the vicinity of the minimizer.
6.3 Newton’s Method 133
1 102
⋆
0.5 10−4
⋆∕
𝑥2
)−
0 10−10
(
−0.5 10−16
−0.5 0 0.5 1 0 1 2 3 4 5
𝑥1
As an alternative to the second-order Taylor approximation in (6.20) that we used to derive the
Newton direction, we now consider a more general convex quadratic approximation mk ∶ ℝn → ℝ
of f at xk of the form
1
mk (y) = f (xk ) + ∇f (xk )T (y − xk ) + (y − xk )T Bk (y − xk ), (6.36)
2
where Bk ≻ 0. The function mk can be viewed as a local surrogate of the function f at xk , and it is
easy to check that it satisfies
mk (xk ) = f (xk ), ∇mk (xk ) = ∇f (xk ).
The approximation mk may be used to define a search direction, i.e.
Δx = argmin mk (xk + p) = −B−1
k
∇f (xk ),
p
which is the steepest descent direction in the quadratic norm || ⋅ ||Bk . This motivates an iteration of
the form
xk+1 = xk − tk B−1
k
∇f (xk ), k ∈ ℤ+ , (6.37)
where the step size is chosen using some form of line search. Note that the update corresponds to
the gradient descent method in (6.13) if we let Bk = I, and the choice Bk = ∇2 f (xk ) corresponds
to Newton’s method.
Generally speaking, the Newton direction is a better search direction than the negative gradient,
but it is also more expensive to compute. Variable metric methods can be viewed as a compromise
between the two methods in that they maintain an approximation of the Hessian or its inverse and
update the approximation without computing second-order derivatives. The main condition that
is used to update Bk or Hk = B−1k
is the so-called secant equation
Bk+1 (xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ). (6.38)
Recall that the definition of mk implies that ∇mk+1 (xk+1 ) = ∇f (xk+1 ), and the secant equation is
simply the additional condition that ∇mk+1 (xk ) = ∇f (xk ). We will define yk = ∇f (xk+1 ) − ∇f (xk ) and
sk = xk+1 − xk so that the secant equation can be expressed Bk+1 sk = yk . We will henceforth drop the
iteration index k from yk and sk to simplify the notation.
6.4 Variable Metric Methods 135
makes the method impractical for problems with large n. The limited-memory BFGS method,
which is also known as L-BFGS, addresses this issue by storing a limited history of, say, m BFGS
update pairs (sk−l , yk−l ), l ∈ ℕm , that are used to implicitly define Hk for k ≥ m. Specifically, Hk is
defined as a sequence of m BFGS updates starting with an initial approximate inverse Hessian Hk0 ,
which is typically chosen as a diagonal matrix or a scaled identity matrix such as Hk0 = 𝛾k I, where
𝛾k = yTk−1 sk−1 ∕||yk−1 ||22 . Note that rather than explicitly applying the m BFGS updates to form
Hk , the m update pairs are used to recursively compute matrix-vector products with Hk without
forming it. We note that L-BFGS method requires O(n) memory for a fixed memory parameter m,
which is typically between 10 and 50, and hence, L-BFGS requires significantly less memory than
BFGS when n is large.
where r = y − Bk s. Taking the inner product with s on both sides of this equation yields 𝜎(𝑣T s)2 =
r T s. Combining this with (6.42), we find that
rr T rr T
𝑣𝑣T = = ,
(𝜎 2 𝑣T s)2 𝜎r T s
and hence,
rr T
Bk+1 = Bk + . (6.43)
rT s
Note that if r = 0 or r T s = 0, we simply skip the update and take Bk+1 = Bk . Unlike the BFGS and
DFP updates, the SR1 update does not guarantee that Bk+1 is positive definite, and hence, it is often
used in combination with a trust-region method.
The resulting method is known as the proximal gradient (PG) method, and it can also be expressed
as the iteration
( )
xk+1 = proxtk h xk − tk ∇g(xk ) , k ∈ ℤ+ , (6.46)
where tk = 1∕L and proxh ∶ ℝn → ℝn is the so-called proximal operator associated with the func-
tion h and defined as
{ }
1
proxh (x) = argmin h(y) + ||y − x||22 . (6.47)
y 2
From the definition of the proximal operator, we see that
u = proxh (x) ⟺ x − u ∈ 𝜕h(u),
and hence, the PG update in (6.46) can be expressed as
xk+1 = xk − tk ∇g(xk ) − tk 𝑣k+1 ,
for some 𝑣k+1 ∈ 𝜕h(xk+1 ). Note that evaluating the proximal operator amounts to solving an opti-
mization problem, and hence, a single PG iteration is therefore expensive unless the proximal
operator is cheap to evaluate. The PG method is outlined in Algorithm 6.7. It can also be combined
with a line search as an alternative to the constant step size tk = 1∕L.
When g is nonconvex, the PG method can be shown to converge to a stationary point under some
mild conditions. The first-order necessary optimality condition for (6.44) implies that a stationary
point x⋆ must satisfy
−∇g(x⋆ ) ∈ 𝜕h(x⋆ ). (6.48)
If we compare this condition to the stationarity condition for the majorization in (6.45) at x⋆ , i.e.
−∇g(x⋆ ) − L(y − x⋆ ) ∈ 𝜕h(y),
we see that the conditions coincide when y = x⋆ . This implies that a fixed-point of the iteration in
(6.46) is a stationary point of f = g + h. In the convex case, i.e. when both g and h are convex, the
PG iteration can be shown to satisfy
f (xk ) − p⋆ = O(1∕k).
An overview of other variants of the PG method, including a detailed analysis of both the convex
and the nonconvex case, can be found in [10]. We end this section by mentioning a few methods
that are closely related to the PG method.
6.5 Proximal Gradient Method 139
with mass, e.g. a heavy ball, under friction in a potential field. The constants 𝜇 and m are friction
and mass constants, respectively. Employing a finite-difference method leads to an iteration of the
form
where 𝛾 and 𝜂 are constants and xk = x(tk ) is the state at time tk . This is an example of a so-called
“multistep method.” Notice that it reduces to the gradient method if 𝜂 = 0, and hence, it is the last
term that introduces momentum.
140 6 Optimization Methods
Inspired by the heavy ball method, Nesterov [80] proposed an accelerated gradient method that
satisfies the improved worst-case bound
f (xk ) − p⋆ = O(1∕k2 ),
when f is convex and L-smooth. An extension of this method is the so-called accelerated proximal
gradient (APG) method, which is due to [11] and outlined in Algorithm 6.8. Unlike the PG method,
the APG method shown here is not a descent method. The next example illustrates the effect of the
acceleration.
10−2
(𝑥𝑘 ) − 𝑝⋆
10−5
10−8
0 100 200 300 400
𝑘
6.6 Sequential Convex Optimization 141
Figure 6.5 shows a numerical example for a problem instance with m = 200 and n = 800. The
plot clearly shows that the PG method is a descent method, whereas the APG method is not. Both
methods exhibit a sublinear convergence rate, but the effect of acceleration makes a significant
difference.
We now consider optimization problems with an objective function that can be expressed as the
difference of two convex functions, i.e.
where g ∶ ℝn → ℝ ̄ and h ∶ ℝn → ℝ ̄ are both proper, closed, and convex functions. Such problems
are often referred to as difference convex optimization problems. The difference g − h is generally
not a convex function, but a convex majorization of g − h at a point x can easily be constructed if
h is continuously differentiable or subdifferentiable at x. Indeed, since h is convex, we can use an
affine lower bound on h
for all y, x ∈ℝn .The assumption that h is nondecreasing implies that h′ (|xi |) ≥ 0, and hence, the
left-hand side is a convex function of y, and it is a majorization of the right-hand side at x. This
leads to the majorization minimization iteration
where 𝑤k ⪰ 0 is the vector with elements 𝑤k,i = h′ (|xk,i |) for i ∈ ℕn . For example, the function
h(t) = ln(t + 𝛿) with 𝛿 > 0 is an increasing concave function on (−𝛿, ∞), which leads to the weight
update 𝑤k,i = (|xk,i | + 𝛿)−1 . We end this example by noting that the approach is also known as
iteratively reweighted 𝓁1 -regularization since each iteration involves a weighted 𝓁1 -regularized
optimization problem with new weights.
Recall the general nonlinear LS problem (5.1), which is an unconstrained optimization problem of
the form
1
minimize ||f (x)||22 ,
2
where f ∶ ℝn → ℝm is defined as f (x) = (f1 (x), … , fm (x)) and fi ∶ ℝn → ℝ, i ∈ ℕm . This is generally
a difficult nonlinear optimization problem. Several local optimization methods exist that are tai-
lored to the specific structure of the problem. One such method is the Gauss–Newton (GN) method,
which is applicable when the functions f1 , … , fm are continuously differentiable. The basic idea is
to replace f by its first-order Taylor approximation around the current iterate xk , i.e. f̃ ∶ ℝn → ℝm
defined as
𝜕f (xk )
f̃ (x; xk ) = f (xk ) + (x − xk ).
𝜕xT
The GN method can then be expressed as the iteration
1
xk+1 ∈ argmin ||f̃ (x; xk )||22 , k ≥ 0. (6.57)
x 2
Each iteration involves solving a linear least-squares problem since f̃ (x; xk ) is an affine function of x,
and the update is unique if the Jacobian matrix 𝜕f (xk )∕𝜕xT has full rank. Noting that (1∕2)||f̃ (x; xk )||22
is a quadratic approximation of (1∕2)||f (x)||22 , we see that the GN method can also be viewed as a
variable metric method.
If f is twice continuously differentiable, then the Hessian of f0 (x) = (1∕2)||f (x)||22 is given by
∑
m
( )
∇2 f0 (x) = fi (x)2 ∇2 fi (x) + ∇fi (x)∇fi (x)T ,
i=1
This suggests that the GN update (6.57) mimics a pure Newton step if the sum on the right-hand
side is negligible, in which case we can expect super-linear convergence. However, although the
GN method often works well in practice, it may fail to converge. This issue can be addressed by
combining the GN method with a line search provided that the sublevel set {x | f0 (x) ≤ f0 (x0 )} is
bounded and that 𝜕f (xk )∕𝜕xT has full rank for all k.
6.7 Methods for Nonlinear Least-Squares 143
we see that if 𝜇k is sufficiently large, then the LM method will essentially behave like gradient
descent with a step size that is roughly equal to 1∕𝜇k . On the other hand, if 𝜇k is small, then the
step is very close a Gauss–Newton step. Algorithm 6.9 adjusts 𝜇k in an adaptive manner and is
one of many different variants of the LM algorithm. For example, there are trust-region variants of
the GN method, which are also sometimes referred to as LM algorithms; see, e.g. [78].
where g ∶ ℝp → ℝm is defined as
g(𝛼) = f (x⋆ (𝛼), 𝛼) = −(I − A(𝛼)A(𝛼)† )b(𝛼).
Recall that P(𝛼) = I − A(𝛼)A(𝛼)† is a projection matrix, and hence g(𝛼) is the projection of −b(𝛼)
onto the nullspace of A(𝛼)T . We will henceforth assume that rank(A(𝛼)) = n, which implies that
A† (𝛼) = (A(𝛼)T A(𝛼))−1 A(𝛼)T , and we will omit 𝛼 from A(𝛼), b(𝛼), and P(𝛼) for notational conve-
nience.
The method we will derive is often called the variable projection method, since it is based on
minimizing the variable projection functional (6.60). In principle, one could simply apply the GN
method or the LM algorithm directly to this problem, but it turns out that there is a better way to
proceed. The idea is to use only an approximate gradient [61]. We will follow the derivation in [86].
The gradient of g(𝛼) is the transpose of 𝜕g∕𝜕𝛼 T , which may be expressed as
𝜕g 𝜕P 𝜕b
=− b−P T,
𝜕𝛼 T 𝜕𝛼 𝜕𝛼
𝜕P 𝜕P
where we define 𝜕𝛼
b to be the matrix with columns 𝜕𝛼k
b. Now, using the fact that
In the context of the stochastic problem in (6.62), the nonlinear equation of interest is the sta-
tionarity condition ∇f (x) = 0, where f (x) = 𝔼[ F(x, 𝜉) ]. The resulting algorithm is an iteration of
the form
xk+1 = xk − tk gk , k ∈ ℤ+ , (6.63)
where x0 is an initial guess, tk > 0 is the step size at iteration k, and gk ∈ ℝn
is a realization of an
estimator of Gk ∈ ℝn of ∇f (xk ). It should be stressed that Gk is random variable for each k. Hence,
the iteration (6.63) is a realization of a stochastic process
Xk+1 = Xk − tk Gk , k ∈ ℤ+ , (6.64)
where the random variable X0 is the initial state of the process. We consider an unbiased estimator
Gk of ∇f (xk ) with bounded variance, i.e.
[ ] [ ]
𝔼 Gk | Xk = xk = ∇f (xk ), 𝔼 ||Gk − ∇f (xk )||22 | Xk = xk ≤ c2 , (6.65)
for all k and for some scalar c ≥ 0. Note that in the special case where c = 0, the iteration in (6.63)
is essentially the gradient descent method (6.13). We note that the gradient of F(x, 𝜉) with respect
to x, which we will denote ∇F(x, 𝜉), is a random variable, and it is an unbiased estimator of ∇f (x) if
[ ]
𝜕𝔼[ F(x, 𝜉) ] 𝜕F(x, 𝜉)
=𝔼 , i ∈ ℕn .
𝜕xi 𝜕xi
When this condition is satisfied, it often natural to choose Gk = ∇F(Xk , 𝜉k ), where 𝜉k has the same
distribution as 𝜉 for all k ≥ 0, and where 𝜉k and 𝜉j are independent for j ≠ k.
The iteration (6.63) is often referred to as a stochastic gradient (SG) method or a stochastic
gradient descent (SGD) method. However, it is important to note that it is not a descent method in
the deterministic sense, i.e. the search direction −gk is not necessarily a descent direction, so the
objective value may increase in some iteration. We note that the step size tk is often referred to as
the learning rate in machine learning literature.
To better understand the influence of the step size sequence tk , we will now analyze the process
under different assumptions on f .
where p⋆ = inf x f (x), and combining this inequality with (6.67) yields
∑
k−1
[ ] [ ] ∑
k−1
Lc2
tj (1 − Ltj ∕2) min 𝔼 ||∇f (Xj )||22 ≤ 𝔼 f (X0 ) − p⋆ + tj2 .
j=0
j=0,…,k−1
j=0
2
∑k−1
Equivalently, if we divide by the sum j=0 tj (1 − Ltj ∕2) on both sides, and assuming that it is
positive, we see that
[ ] ∑k−1 2
[ ] 𝔼 f (X0 ) − p⋆ j=0 tj Lc2
min 𝔼 ||∇f (Xj )||2 ≤ ∑k−1
2
+ ∑k−1 . (6.68)
j=0,…,k−1 2
j=0 tj (1 − Ltj ∕2) j=0 tj (1 − Ltj ∕2)
It follows that a sufficient condition for the right-hand side to vanish as k → ∞ is that
∑k−1
∑
k−1
j=0 tj (1 − Ltj ∕2)
lim tj (1 − Ltj ∕2) = ∞, lim ∑k−1 2 = ∞,
k→∞ k→∞
j=0 j=0 tj
or equivalently,
∑k−1
∑
k−1
j=0 tj
lim tj = ∞, lim ∑k−1 = ∞. (6.69)
k→∞ k→∞ 2
j=0 j=0 tj
Examples of step size sequences that satisfy these conditions are sequences of the form
t
tk = , k ∈ ℤ+ , (6.70)
(k + 1 + 𝜁)𝛿
where t > 0, 𝜁 ≥ 0, and 𝛿 ∈ (0, 1] are fixed parameters. The parameter t scales the sequence,
𝛿 controls the asymptotic rate of decay, and 𝜁 may be used to reduce the rate of decay in early
iterations. The value of t has no effect on the asymptotic behavior, but it typically has a strong
effect on the nonasymptotic behavior. To see this, first note that step-size sequences of the form in
(6.70) satisfy
∑k−1 2
1 1 j=0 tj
∑k−1 ∝ t , ∑k−1 ∝ t.
j=0 jt tj=0 j
Comparing with the right-hand side of (6.68), we see that the choice of t presents a trade-off between
the two terms: increasing t reduces the first term, but increases the second and vice versa.
We now analyze the worst-case bound in (6.68) for different step-size sequences. We start by
noting that if c = 0, which corresponds to the ordinary gradient method, then it suffices to choose
6.8 Stochastic Optimization Methods 147
(1 + 𝜁)−𝑝
(𝜏 − 1 +𝜁) −𝑝
(𝜏 + 𝜁)−𝑝
𝜏
1 2 3 4 5 6 7 8
∑k
Figure 6.6 Construction of upper and lower bounds on sk = j=1 (j + 𝜁)−p , illustrated for k = 7. The gray
area is equal to s7 .
∑k−1
a constant step-size sequence tk = t ∈ (0, 2∕L) such that j=0 tj (1 − Ltj ∕2) = O(k). Indeed, this
implies that the right-hand side of (6.68) decays as O(1∕k). However, in the stochastic setting,
where c > 0, a constant step-size sequence does not make the right-hand side of (6.68) vanish
as k → ∞. In this case, we will instead consider the decreasing step-size sequence in (6.70) for
different values of the decay rate parameters 𝛿. The sum of the first k step sizes and the sum of
their squares can be expressed as
∑
k−1
∑
k
∑
k−1
∑
k
tj = t (j + 𝜁)−𝛿 , tj2 = t2 (j + 𝜁)−2𝛿 ,
j=0 j=1 j=0 j=1
To expose the asymptotic behavior of this sum, we now bound sk from above and below in terms
k
of the definite integral ∫1 (𝜏 + 𝜁)−p d𝜏, as illustrated in Figure 6.6. This leads to the inequalities
k k
(𝜏 + 𝜁)−p d𝜏 ≤ sk ≤ (1 + 𝜁)−p + (𝜏 + 𝜁)−p d𝜏,
∫1 ∫1
and using the result that
{
(1+𝜁 )1−p
k
, p > 1,
lim (𝜏 + 𝜁) −p
d𝜏 = p−1 ,
k→∞ ∫1 ∞, p ∈ [0, 1],
we can conclude that sk converges when p > 1 and diverges otherwise. In the latter case, the upper
and lower bounds on sk allow us to establish the asymptotic equivalence
{ 1−p
k , p ∈ [0, 1),
sk ∼
ln(k), p = 1.
This result may be used to derive upper bounds on the right-hand side of (6.68) for different values
of 𝛿, which are summarized in Table 6.1. Note that asymptotically, the upper bound decays the
fastest when 𝛿 = 1∕2. However, this choice is not necessarily the best one in practice.
148 6 Optimization Methods
∑k−1 ∑k−1
Parameter j=0 tj j=0 tj2 Upper bound
where ik ∈ ℕm is chosen according to some index selection rule. Two of the most common rules
are the cyclic rule ik = (k mod m) + 1 and the random rule, where each ik is chosen uniformly at
random from ℕm . With the random index selection rule, the incremental gradient method may
be viewed as a SG method applied to (5.44) if the random variable 𝜉 is discrete with m equiprobable
outcomes. We note that more general incremental proximal gradient (IPG) methods that can handle
simple nondifferentiable functions akin to the proximal gradient method (6.46) have been proposed
and analyzed in [16].
We now return to the problem in (5.46), where the objective function consists of a sum of m
functions, e.g. corresponding to m observations. Applying the gradient method to this problem
requires the full gradient, i.e.
1∑
m
∇f (xk ) = ∇f (x ).
m i=1 i k
In other words, the gradient of all the functions must be computed in order to obtain the search
direction, and as a result, the gradient method is sometimes referred to as a batch gradient method.
In contrast to the gradient method, the incremental gradient method uses as search direction the
negative gradient of a single function, i.e. −∇fik (xk ), and as a consequence, an incremental gradient
iteration can be much cheaper than a gradient iteration when m is large. However, the gradient
of a single function may be viewed as a noisy approximation of the full gradient, and hence, the
search direction is not necessarily a descent direction. A compromise between the gradient and
incremental gradient methods may be obtained by using a subset of p functions at each iteration, i.e.
t ∑
xk+1 = xk − k ∇f (x ),
p i∈ i k
k
where k ⊂ ℕm and |k | = p. This approach, which is known as a mini-batch method, provides a
way to reduce the variance of the search direction at the expense of increased computation cost per
iteration.
1∑
m
𝛾
minimize ||x||22 + ln(1 + exp(−yi aTi x)) ,
2 m i=1
with variable x ∈ ℝn , regularization parameter 𝛾 > 0, and problem data y ∈ {−1, 1}m and
A ∈ ℝm×n , where aTi is the ith row of A. To apply the stochastic gradient method to this problem,
we define a realization of a stochastic gradient of the objective function at xk as
1 ∑ yi exp(−yi ai xk )
T
gk = 𝛾xk + ai ,
p i∈ 1 + exp(−yi aTi xk )
k
where k is a sample of p elements from ℕm drawn at random without replacement. Note that
gk = ∇f (xk ) in the special case where p = m. Figure 6.7 illustrates the typical behavior of the
stochastic gradient method based on a data set with m = 32561 and n = 123 using the mini-batch
150 6 Optimization Methods
100
10–2
102
𝛿 = 0.7
100
10–2
102
𝛿 = 0.9
100
10–2
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Epochs
Figure 6.7 The relative suboptimality for the stochastic gradient method using step-size sequences of the
form (6.70) with initial step size t and decay parameter 𝛿.
size p = 326, which corresponds to roughly 1% of the data. The plots show the relative suboptimal-
ity |f (xk ) − f (x⋆ )|∕|f (x⋆ )| obtained with different step-size sequences of the form (6.70) with 𝜁 = 0
and different values of t and the decay parameter 𝛿. The primary axis shows the number of epochs,
which is the number of iterations scaled by p∕m. The figure clearly demonstrates that progress
can be very slow if the initial step size is too small or too large.
replace a simple gradient estimator with a more sophisticated unbiased estimator with lower vari-
ance. For example, in the incremental setting where the objective function is a finite sum of m
functions, the so-called stochastic variance-reduced gradient (SVRG) method uses an update of the
form
xk+1 = xk − tk (∇fik (xk ) − ∇fik (̃x) + 𝜇)
̃
where ik ∈ ℕm is selected uniformly at random. The vector x̃ is an approximation of xk and is
updated only every M iterations, and 𝜇̃ = ∇f (̃x) is the full gradient at x̃ . In the smooth and strongly
convex setting, this can be shown to converge linearly, in expectation, with a suitable constant
step size. We note that the technique is closely related to the method of control variates, which is
illustrated in Exercise 3.11. Several other variance reduction techniques exist; see, e.g. [46] for an
overview.
We end this section by outlining some examples of popular stochastic methods with adaptive
step-size strategies.
6.8.4.1 AdaGrad
The adaptive gradient method, or AdaGrad, is a method for stochastic problems of the form (5.44),
where the function F(x, 𝜉) is assumed to be of the form F(x, 𝜉) = G(x, 𝜉) + h(x) with G(⋅, 𝜉) and h
closed and convex. The method uses either a diagonal matrix or a full matrix to adaptively scale
stochastic gradients. Specifically, if gk ∈ 𝜕G(xk , 𝜉k ), where 𝜉k denotes a realization of 𝜉 at iteration
k and
∑
k
̂k =
G gk (gk )T ,
i=0
1 ∑
k
𝑣̃k = g ∘g .
k + 1 i=0 k k
The AdaGrad update may be expressed as
B
xk+1 = proxh k (xk − tk g̃ k ),
152 6 Optimization Methods
1
with tk = √ and g̃ k = diag(𝑣̃k + 𝜀∕(k + 1) 𝟙)−1∕2 gk which shows that AdaGrad implicitly
𝛾 k+1
employs a diminishing step-size sequence. In the convex setting, AdaGrad can be shown to satisfy
the worst-case bound
[ ] √
𝔼 f (Xk ) − p⋆ ≤ O(1∕ k).
We refer the reader to [35] for further analysis of AdaGrad and details regarding convergence.
6.8.4.2 RMSprop
The root mean square propagation (RMSprop) method is in many ways similar to AdaGrad. It is a
stochastic gradient iteration of the form
xk+1 = xk − B−1
k k
g ,
where gk = ∇F(xk , 𝜉k ) is a gradient estimate, and Bk is a diagonal matrix that is chosen proportion-
ally to the elementwise root mean square of previous gradient estimates. Unlike AdaGrad that uses
all gradient estimates to compute an adaptive scaling, RMSprop emphasizes more recent stochastic
gradients through the use of an exponential moving average, i.e.
𝑣k = 𝛽𝑣k−1 + (1 − 𝛽)(gk ∘ gk )
Bk = 𝛾 diag(𝑣k + 𝜀 𝟙)1∕2
xk+1 = xk − B−1
k k
g
where the parameter 𝛽 ∈ (0, 1) controls the adaptiveness of the estimate. RMSprop was proposed in
a lecture note [103] along with the suggested parameter value 𝛽 = 0.9. Note that unlike AdaGrad,
RMSprop does not implicitly result in a diminishing step-size sequence, and in fact, the method
need not converge, as shown in [95].
6.8.4.3 Adam
The adaptive moment estimation method, which is known as Adam, combines the adaptive scaling
approach of RMSprop with gradient aggregation or momentum. It employs an exponential moving
average of the form
𝜇k = 𝛽1 𝜇k−1 + (1 − 𝛽1 )gk ,
to compute a weighted average of previous gradient estimates. The vector 𝜇k can be viewed as an
estimate of the first raw sample moment of the weighted sequence of gradient estimates, and this
is used instead of gk to compute a search direction at iteration k, i.e.
𝜇k = 𝛽1 𝜇k−1 + (1 − 𝛽1 )gk
𝑣k = 𝛽2 𝑣k−1 + (1 − 𝛽2 )(gk ∘ gk )
( )
Bk = 𝛾(1 − 𝛽1 ) (1 − 𝛽2 )−1∕2 diag(𝑣k )1∕2 + 𝜀 I
xk+1 = xk − B−1
k
𝜇k .
The method was proposed in [63] with the recommended parameter values 𝛽1 = 0.9 and 𝛽2 = 0.999.
Like RMSprop, the method need not converge, as shown in [95]. A variant of Adam, known as
AdaMax, can be obtained by replacing 𝑣k and Bk by
We now consider a class of methods for solving unconstrained optimization problems of the form
(6.5) using coordinatewise updates. The prototypical coordinate descent iteration is of the form
xk+1 = xk − tk [∇f (xk )]ik eik , (6.73)
where ik ∈ ℕn is a coordinate index and tk > 0 is a step size. Common index selection strategies
include the cyclic order ik = (k mod n) + 1, a randomized cyclic order where the order is reshuffled
every n iterations, and a fully randomized order where the indices are selected uniformly at random.
The step size can be chosen in a similar way to the gradient method, e.g. using some form of line
search. With an exact line search, the method performs coordinatewise minimization, which can
be expressed as
tk ∈ argmin{f (xk + teik )},
t
xk+1 = xk + tk eik . (6.74)
Note that only the ik th element of x is updated at iteration k. This is a descent method by construc-
tion. However, without additional assumptions on f , the iteration in (6.74) does not necessarily
converge: one can construct an example, where (6.74) with the cyclic update order will enter a cycle
for some initializations, as demonstrated by Powell [90]. Moreover, even if the iteration (6.74) does
converge, it may not be to a stationary point if f is nonsmooth. Figure 6.8 shows an example of a
nonsmooth convex function where this can happen.
Next, we consider the case where f is smooth and convex, and we assume that the set
{x | f (x) ≤ f (x0 )} is nonempty and compact, which ensures that f attains its minimum. Using a
suitable step-size sequence, the coordinate descent iteration (6.73) can then be shown to converge
to a minimizer of f . The first-order condition for convexity (4.27) implies that
𝜕f (x)
f (x + tei ) ≥ f (x) + t, i ∈ ℕn ,
𝜕xi
for all x ∈ ℝn , and hence, we have that
∇f (x) = 0 ⟺ f (x + tei ) ≥ f (x) for all t ∈ ℝ, i ∈ ℕn .
In other words, x is a minimizer of f if and only if x minimizes f along all n coordinate directions.
Now, using the exact line search (6.74), iteration k yields a descent unless the ik th element of ∇f (xk )
is equal to zero, and hence, a cycle through all n coordinates yields a descent unless x is a minimum.
With additional assumptions on f and the step sizes, the iteration (6.74) can be shown to converge
to a minimizer of f . An overview of more sophisticated variants of the coordinate descent method
and detailed analyses can be found in, e.g. [12, 116].
Figure 6.8 Contour plot of the function 2
f (x) = max (|2x1 − x2 |, |2x2 − x1 |), which is convex but
nonsmooth. The function is nondifferentiable whenever
x1 = x2 or x1 = −x2 , and none of the coordinate directions is a 1
descent direction when x1 = x2 .
0
𝑥2
−1
−2
−2 −1 0 1 2
𝑥1
154 6 Optimization Methods
Example 6.7 Coordinate descent methods typically work well when the coupling between
variables is weak. To illustrate this, we now consider convex quadratic functions of the form
f (x) = xT Qx with a fixed condition number 𝜅(Q) = 20. Specifically, we take Q = Udiag(20, 1)U T
for different choices of U ∈ ℝ2×2 such that U T U = I. Figure 6.9 shows the coordinate descent
method in action for different choices of U corresponding to different orientations of the coordi-
nate system. The problem of minimizing f (x) is separable in the special case where Q is diagonal,
and in this case, the minimum point is reached in two iterations (one cycle). In contrast, progress
is slow when x1 and x2 are maximally coupled, which is the case when
[ ]
1 1 ±1
U = ±√ .
2 ∓1 1
Coordinate descent methods are often useful for regularized risk minimization problems of the
form
1 ∑ T
m
minimize g(ai x) + λ h(x), (6.75)
m i=1
where a1 , … , am ∈ ℝn are problem data, and where the regularizer h(x) is a separable function
(e.g. ||x||1 or ||x||22 ). Notice that (6.75) is of the form (5.46) with fi (x) = m1 (g(aTi x) + λ h(x)). The dual
problem may be expressed as
1 ∑ ∗
m
maximize − g (−mzi ) − λ h∗ (λ−1 AT z), (6.76)
m i=1
1 1
0 0
𝑥2
𝑥2
−1 −1
−1 0 1 −1 0 1
𝑥1 𝑥1
1 1
0 0
𝑥2
𝑥2
−1 −1
−1 0 1 −1 0 1
𝑥1 𝑥1
Figure 6.9 Coordinate descent with exact line search applied to convex quadratic functions of the form
f (x) = xT Qx with Q = U diag(20, 1)UT , where U is an orthogonal matrix. Each plot includes 20 iterations in
addition to the initial guess x0 = (−1, −1). The minimum is reached in two iterations, i.e. one cycle through
all coordinate directions, when Q is diagonal (upper left).
6.10 Interior-Point Methods 155
with variable z ∈ ℝm , and where A ∈ ℝm×n is the matrix with rows aT1 , … , aTm . The sum is a separa-
ble function, whereas the last term involves all the dual variables. To apply coordinate ascent to the
dual problem, we restrict the dual objective to one of its coordinate directions. Letting z = z̄ + tei ,
the dual function reduces to
1∑ ∗
m
− g (−m(̄zi + t)) − λ h∗ (λ−1 (AT z̄ + tai )) + const.,
m i=1
The methods that we have discussed so far are not suitable for problems that involve all but simple
inequality constraints. For example, the gradient projection method is inefficient unless projections
onto the feasible set are cheap, and Newton’s method cannot be applied directly to problems with
inequality constraints. We will now look at a conceptually simple technique for handling inequality
constraints in a general convex optimization problem on the form (4.52), i.e.
minimize f0 (x)
subject to fi (x) ≤ 0, i ∈ ℕm (6.77)
Ax = b
where we assume that f0 ∶ ℝn → ℝ and fi ∶ ℝn → ℝ, i ∈ ℕm , are twice continuously differentiable
convex functions and where A ∈ ℝp×n and b ∈ ℝp . We will assume that Slater’s condition holds
and that rank(A) = p. We remind the reader that the Lagrangian L ∶ ℝn × ℝm × ℝp → ℝ is
∑
m
L(x, λ, 𝜇) = f0 (x) + λi fi (x) + 𝜇 T (Ax − b),
i=1
1∑
m
minimize f0 (x) + 𝜙(−fi (x))
t i=1 (6.78)
subject to Ax = b
as an approximation to (6.77), and where t > 0 controls the accuracy of the approximation. It is
natural to expect that a solution to (6.78) approaches a solution to (6.77) as t → ∞. We will soon
formalize this intuition. The problem (6.78) is an equality constrained convex optimization prob-
lem, which follows by noting that 𝜙(−fi (x)) is a convex function. Moreover, the domain of the barrier
156 6 Optimization Methods
𝜏
1
function is ℝ++ , and this implies that the domain of the barrier problem (6.78) is the relative inte-
rior of the feasible set of the original problem (6.77). This gives rise to the name interior-point (IP)
method for a method that solves the barrier problem (6.78).
We will now compare the optimality conditions associated with the original problem (6.77) and
the barrier problem (6.78). The KKT conditions for (6.77) may be expressed as
∑
m
∇f0 (x) + λi ∇fi (x) + AT 𝜇 = 0 (6.79a)
i=1
fi (x) ≤ 0, i ∈ ℕm (6.79b)
Ax = b (6.79c)
λ ⪰ 0 (6.79d)
λi fi (x) = 0, i ∈ ℕm , (6.79e)
and the KKT conditions for the barrier problem are
∑
m
1
∇f0 (x) + ∇fi (x) + AT 𝜇 = 0 (6.80a)
i=1
−tfi (x)
Ax = b, (6.80b)
with the additional implicit constraints that fi (x) < 0, i ∈ ℕm , which are imposed by the domain
of the barrier function. Now, suppose that x⋆ (t) and 𝜇 ⋆ (t) satisfy the optimality conditions for the
barrier problem and let λ⋆ (t) be the vector with elements λ⋆i (t) = −1∕(tfi (x⋆ (t))), i ∈ ℕm . It then
follows that x⋆ (t), λ⋆ (t), and 𝜇 ⋆ (t) satisfy (6.79a)–(6.79d) but not the complementarity condition
(6.79e) since λ⋆i (t)fi (x⋆ (t)) = −1∕t. Moreover, x⋆ (t) minimizes the Lagrangian L(x, λ⋆ (t), 𝜇 ⋆ (t)), and
hence, λ⋆ (t) and 𝜇 ⋆ (t) are dual feasible and
p⋆ ≥ g(λ⋆ (t), 𝜇 ⋆ (t)) = L(x⋆ (t), λ⋆ (t), 𝜇 ⋆ (t))
∑
m
−1
= f0 (x⋆ ) + f (x⋆ (t)) + (𝜇 ⋆ (t))T (Ax⋆ (t) − b)
⋆ (t)) i
tf
i=1 i
(x
= f0 (x⋆ (t)) − m∕t.
This shows that the duality gap at (x⋆ (t), λ⋆ (t), 𝜇 ⋆ (t)) is m∕t. Thus, the solution to the barrier prob-
lem x⋆ (t) defines a trajectory of strictly feasible (m∕t)-suboptimal points, which is known as the
central path.
6.10 Interior-Point Methods 157
Example 6.8 To illustrate the basic principle behind the path-following method, we will now
apply it to find an 𝜖-suboptimal solution to the convex optimization problem
minimize x1 + x2 ∕5
subject to exp(x1 ) + x2 − 3 ≤ 0
x12 − x2 ≤ 0
with variable x ∈ ℝ2 . The corresponding barrier problem can be expressed as
x 1( )
minimize x1 + 2 + − ln(3 − x2 − exp(x1 )) − ln(x2 − x12 ) ,
5 t
which is an unconstrained convex optimization problem. Figure 6.11 shows the iterates generated
using Algorithm 6.10 for two different values of the parameter 𝛾. In both cases, we used the point
x0 = (0, 0.5) as a strictly feasible starting point and the parameters t0 = 0.5 and 𝜖 = 10−3 .
Algorithm 6.10 requires a strictly feasible initial point x0 , and it is not always easy to find such a
point. One approach is to solve a so-called phase I problem, which can be expressed as
minimize s
subject to fi (x) ≤ s, i ∈ ℕm (6.81)
Ax = b,
with variables x ∈ ℝn and s ∈ ℝ. If the optimal value is attained at (x⋆ , s⋆ ), then x⋆ is clearly a
strictly feasible point for the original problem (6.77) if s⋆ < 0. On the other hand, if s⋆ > 0, the
original problem is infeasible. Note that it is straightforward to find a strictly feasible point for the
phase I problem, and hence, it can be solved using Algorithm 6.10.
3 3
2 2
𝑥⋆ (𝑡0) 𝑥⋆ (𝑡0)
𝑥2
𝑥2
1 1
𝑥0 𝑥0
0 0
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
𝑥1 𝑥1
(a) (b)
Figure 6.11 The path-following method converges to an 𝜖-suboptimal point x by solving a sequence of
barrier problems. The two plots show the iterates obtained with (a) 𝛾 = 2 and (b) 𝛾 = 20, respectively. With
𝛾 = 2, 13 barrier problems were solved using a total of 48 Newton iterations, whereas with 𝛾 = 20, 4 barrier
problems were solved using a total of 21 Newton iterations.
6.10 Interior-Point Methods 159
with variable x ∈ ℝn and problem data A ∈ ℝp×n , b ∈ ℝp , and c ∈ ℝn , and where K ⊂ ℝn is a proper
convex cone. The Lagrangian L ∶ ℝn × ℝn × ℝp → ℝ ̄ is defined as
where 𝜙i ∶ ℝni → ℝ ̄ with dom 𝜙i = int Ki is a barrier for Ki , i ∈ ℕm . Moreover, it is easy to verify
∑m
that 𝜙 is logarithmically homogeneous with constant 𝜃 = i=1 𝜃i if for all i ∈ ℕm , 𝜙i is logarithmi-
cally homogeneous with constant 𝜃i . Table 6.2 lists logarithmically homogeneous barrier functions
for some elementary proper convex cones.
160 6 Optimization Methods
a1 , … , ap .
T T
is positive definite if and only if A1 , … , Ap are linearly independent, in which case the Cholesky
factorization H = LLT can be used to solve HΔ𝜇 = −g.
We end this section by mentioning that there are much more advance IP methods than the basic
path-following scheme that we have presented in this section. These methods generally maintain
both primal and dual variables, and rather than solving a sequence of barrier subproblems to high
accuracy, they stay inside some neighborhood of a primal–dual central path and update the param-
eter t adaptively; see, e.g. [81, 115]. We also note that there are IP methods for local optimization
of more general nonlinear optimization problems; see, e.g. [111].
We will now consider methods for equality constrained optimization problems of the form
minimize f0 (x)
(6.86)
subject to h(x) = 0,
with variable x ∈ ℝn , and where f0 ∶ ℝn → ℝ ̄ and h ∶ ℝn → ℝp . We define the Lagrangian L ∶ ℝn ×
̄
ℝ → ℝ as L(x, 𝜇) = f0 (x) + 𝜇 h(x) and the dual function g ∶ ℝp → ℝ
p T ̄ as g(𝜇) = infx L(x, 𝜇).
A conceptually simple approach to finding an approximate local minimizer is to consider an
unconstrained problem as a proxy for the problem in (6.86). For example, we may consider a
so-called penalty problem
𝜌
minimize f0 (x) + ||h(x)||22 ,
2
where 𝜌 > 0 is a penalty parameter. Roughly speaking, we can expect that the constraint viola-
tion will be small when 𝜌 is large, and this observation is the motivation behind penalty methods
that solve a sequence of penalty problems with increasing values of 𝜌. Unfortunately, the penalty
problem typically becomes very ill-conditioned when 𝜌 is large, which makes it difficult to solve it
reliably and accurately.
An alternative to the penalty approach is to consider the optimization problem
𝜌
minimize f0 (x) + ||h(x)||22
2 (6.87)
subject to h(x) = 0,
with penalty parameter 𝜌 > 0. This problem is equivalent to (6.86), which follows immediately by
noting that ||h(x)||22 = 0 whenever x is feasible. The Lagrangian for the problem in (6.87), which is
the so-called augmented Lagrangian for the problem in (6.86), is the function L𝜌 ∶ ℝn × ℝp → ℝ ̄
defined as
𝜌
L𝜌 (x, 𝜇) = f0 (x) + 𝜇 T h(x) + ||h(x)||22
2
𝜌 1
= f0 (x) + ||h(x) + 𝜌−1 𝜇||22 − ||𝜇||22 .
2 2𝜌
The corresponding dual function g𝜌 ∶ ℝp → ℝ ̄ is given by
{ 𝜌 }
g𝜌 (𝜇) = inf L𝜌 (x, 𝜇) = inf f0 (x) + ||h(x) + 𝜌−1 𝜇||22 .
x x 2
Notice the similarity with the penalty problem: the penalty term in the definition of g𝜌 includes a
shift that is determined by the dual variable 𝜇.
162 6 Optimization Methods
Example 6.10 Constrained nonlinear LS problems is an example of a class of problems that can
be solved efficiently to local optimality with the augmented Lagrangian method, i.e. problems of
the form
1
minimize ||f (x)||22
2
subject to h(x) = 0
with variable x ∈ ℝn , and where f ∶ ℝn → ℝm and h ∶ ℝn → ℝp are nonlinear functions. The aug-
mented Lagrangian L𝜌 ∶ ℝn × ℝp can be expressed as
1 𝜌
L𝜌 (x, 𝜇) = ||f (x)||22 + 𝜇 T h(x) + ||h(x)||22
2 2
1 𝜌 1
= ||f (x)||2 + ||h(x) + 𝜇∕𝜌||22 − ||𝜇||22
2
2 2 2𝜌
[ ] ‖2
1‖ ‖ f (x)
√ ‖ 1
= ‖ √ ‖ − ||𝜇||22 .
2‖ ‖ 𝜌h(x) + 𝜇∕ 𝜌 ‖
‖2 2𝜌
Thus, for fixed value of 𝜇, the problem of minimizing L𝜌 with respect to x is an unconstrained
nonlinear LS problem, which can be minimized locally using, e.g. the LM algorithm.
for k ∈ ℤ+ and where y0 ∈ ℝm and 𝜇0 ∈ ℝp are initial values. If L0 has a saddle-point, then the
assumption that f and g are proper, closed, and convex ensures that the updates (6.89a) and (6.89b)
are well-defined, i.e. a minimizer exists, but it is not necessarily unique. If we let uk = 𝜇k ∕𝜌, then the
ADMM updates can be expressed in a more convenient form as shown in Algorithm 6.12, which is
a basic implementation of ADMM. Is can be advantageous to adaptively update the penalty param-
eter 𝜌 and/or make use of preconditioning techniques, which is often done in more sophisticated
variants.
164 6 Optimization Methods
The ADMM updates are often cheaper to compute compared to the cost of the joint minimization
over x and y in the method of multipliers. In the special case where A = I, the update of x can be
expressed as
xk+1 = prox𝜌−1 f (c − uk − Byk )
where prox𝜌−1 f is the proximal operator associated with 𝜌−1 f . Similarly, if B = I, then
yk+1 = prox𝜌−1 g (c − uk − Axk+1 ).
∑
N
minimize fi (xi ) + g(y)
(6.92)
i=1
subject to xi − y = 0, i ∈ ℕN ,
Exercises 165
with variables y ∈ ℝn and x = (x1 , … , xN ) with xi ∈ ℝn , i ∈ ℝn . Applying the ADMM to this prob-
lem, it is straightforward to see that the ADMM update of x becomes separable. In other words,
x1 , … , xN can be updated in parallel; see Exercise 6.14.
Exercises
6.1 Consider the gradient descent iteration (6.14), i.e.
1
xk+1 = xk − ∇f (xk ), k ∈ ℤ+ ,
L
and assume that f is bounded below and L-smooth. Show that limk→∞ ||∇f (xk )||22 = 0.
6.2 Recall that a function f ∶ ℝn → ℝ is L-smooth if there exists a finite constant L > 0 such
that
||∇f (y) − ∇f (x)||2 ≤ L||y − x||2 , ∀x, y ∈ ℝn .
6.4 Many learning problems involve objective functions of the form f (x) = g(x) + h(x), where g
is convex and Lg -smooth and where h is a 𝜇-stongly convex and Lh -smooth regularization
function. Show that this implies that f is (Lg + Lh )-smooth and 𝜇-strongly convex.
166 6 Optimization Methods
6.5 Let g ∶ ℝn → ℝ and let f (x) = λg(x∕λ) for some λ > 0. Show that
proxf (x) = λproxλ−1 g (x∕λ).
∑n
6.6 (a) Show that if f ∶ ℝn → ℝ is of the form f (x) = i=1 fi (xi ), where fi ∶ ℝ → ℝ, then
( )
proxf (x) = proxf1 (x1 ), … , proxfn (xn ) .
(b) Show that the proximal operator associated with f (x) = 𝛾||x||1 with 𝛾 > 0 may be
expressed as
( )
proxf (x) = S𝛾 (x1 ), … , S𝛾 (xn ) ,
where S𝛾 ∶ ℝ → ℝ is defined as
⎧0, |t| ≤ 𝛾,
⎪
S𝛾 (t) = ⎨t − 𝛾, t > 𝛾,
⎪t + 𝛾, t < −𝛾,
⎩
or, equivalently,
S𝛾 (t) = sgn(t) max (0, |t| − 𝛾).
i.e. the nuclear norm is the dual norm of the operator norm || ⋅ ||2 on ℝm×n . Let X = UΣV T
be an singular value decomposition (SVD) of X ∈ ℝm×n , where U ∈ ℝm×m and V ∈ ℝn×n
are orthogonal matrices, and Σ ∈ ℝm×n has the singular values of X on its main diagonal in
descending order and zeros elsewhere. Now, letting r = rank(X), a thin SVD of X may be
expressed as
X = U1 SV1T ,
where U1 ∈ ℝm×r and V1 ∈ ℝn×r are the first r columns U and V, respectively, and
S = diag(𝜎1 , … , 𝜎r ) is a submatrix of Σ.
(a) Show that the nuclear norm of X is the sum of its singular values, i.e.
∑
r
||X||∗ = 𝜎i (X).
i=1
1∑
m
𝛾
f (x) = ||x||22 + ln(1 + exp(−yi aTi x)), (6.96)
2 m i=1
with variable x ∈ ℝn , problem data y ∈ {−1, 1}m and A ∈ ℝm×n where aTi denotes the ith
row of A, and where 𝛾 > 0 is a given parameter.
1
(a) Show that f is 𝛾-strongly convex and L-smooth with L = 4m ||A||22 + 𝛾.
(b) Implement Newton’s method for minimizing (6.96), and use the condition
(1∕2)𝜆(x)2 ≤ 10−6 as a stopping criterion. Test your implementation using the problem
data contained in the file classification_small.mat and with 𝛾 = 10−6 .
(c) Implement a quasi-Newton method with BFGS updates of the inverse Hessian approxi-
mation. Test your implementation and compare with your implementation of Newton’s
method.
(d) Implement the gradient method for minimizing (6.96) with the option to use a constant
step size or a backtracking line search.
(e) Implement a stochastic gradient method for minimizing (6.96). The expression
1 ∑ yi exp(−yi ai xk )
T
gk = 𝛾xk − ai ,
| | T
| k | i∈k 1 + exp(−yi ai xk )
168 6 Optimization Methods
i.e.
t
tk = , k ∈ ℤ+ .
(k + 1 + 𝜁)𝛿
Test your implementation with p/m ≈ 0.05 and compare with your implementation of
Newton’s method and the gradient method. Plot the objective value versus the number
of epochs for realizations with different values of the parameters 𝛿 ∈ (0, 1] and t > 0.
6.10 In this exercise, we will derive a method for solving the QP in (5.6) based on ADMM. We
will here state the QP slightly differently as
1 T
minimize x Qx + r T x
2
subject to Ax ∈
with variable x ∈ ℝn . The problem data are Q ∈ 𝕊n+ , r ∈ ℝn , and A ∈ ℝm×n . We will assume
that ⊆ ℝm is a nonempty, closed, and convex set of the form
= {z ∈ ℝm | l ⪯ z ⪯ u}
with l, u ∈ ℝm . Notice that the constraints Bx = c and Cx ⪯ d in (5.6) can be cast in the
above format by defining A, l, and u appropriately, e.g. an equality constraint is obtained by
defining li = ui .
(a) Consider the equivalent optimization problem
1 T
minimize x̄ Q̄x + r T x̄ + I (̄x, ȳ ) + I (y)
2
subject to x̄ − x = 0
ȳ − y = 0
with variables x, x̄ ∈ ℝn and y, ȳ ∈ ℝm , and where = {(x, y) ∈ ℝn × ℝm | Ax = y}.
Let (λ, 𝜇) be Lagrange multipliers associated with the equality constraints, and let
L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) be the augmented Lagrangian. Show that the three ADMM updates
are equivalent to the following problems:
1. Minimize L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) with respect x̄ and ȳ :
1 T 𝜌 𝜌
minimize x̄ Q̄x + r T x̄ + λTk x̄ + 𝜇kT ȳ + ||̄x − xk ||22 + ||̄y − yk ||22
2 2 2
subject to Āx = ȳ .
2. Minimize L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) with respect x and y:
𝜌 𝜌
minimize ||̄x − x||22 + ||̄yk+1 − y||22 − λTk x − 𝜇kT y
2 k+1 2
subject to y ∈ .
3. Update Lagrange multipliers:
λk+1 = λk + 𝜌(̄xk+1 − xk+1 )
𝜇k+1 = 𝜇k + 𝜌(̄yk+1 − yk+1 ).
(b) Write down the optimality conditions for the first update with 𝜈 as Lagrange multiplier
for the equality constraint, and show that the optimality conditions are equivalent to
[ ][ ] [ ]
Q + 𝜌I AT x̄ k+1 −r + 𝜌xk − λk
=
A −𝜌−1 I 𝜈 yk − 𝜌−1 𝜇k
Exercises 169
together with
ȳ k+1 = 𝜌−1 (𝜈 − 𝜇k ) + yk .
and that yk+1 is the projection of ȳ k+1 + 𝜌−1 𝜇k onto the set . Also, show that it follows
that λk = 0 for all k, and hence, the update for λk can be omitted.
(d) Define
p
rk = Axk − yk
rkd = Qxk + r + AT 𝜈
which are primal and dual residuals for the optimization problem. Let
where 𝜖 a > 0 and 𝜖 r > 0 are some absolute and relative tolerances, respectively. The
termination criteria for the algorithm are
p
||rk ||∞ ≤ 𝜖 p , ||rkd ||∞ ≤ 𝜖 d .
Write a MATLAB code that implements the ADDM algorithm you have derived above.
Make sure to use sparse linear algebra routines in MATLAB.
(e) Consider the following so-called “support vector machine” problem
minimize 12 xT x + 𝟙T t
subject to diag(b)Ax ≥ 𝟙 − t
t≥0
with variables x ∈ ℝn and t ∈ ℝm and where b ∈ {−1, 1}m is a label vector and the
rows of A ∈ ℝm×n are m feature vectors of length n. We will return to this problem in
Section 10.5. Generate problem instances for given values of m and n as
{
1, i ≤ m∕2,
bi =
−1, i > m∕2,
and let the elements Aij of the matrix A come from a normal probability distribution
with standard deviation 1∕n and mean given by
{
1∕n, i ≤ m∕2,
−1∕n, i > m∕2,
with 15% nonzeros in each row. This means that 85% of the entries of a row of the matrix
A should be zero. What entries that are zero should be chosen in a random way for each
row. Notice that the matrix A of this subproblem is not the same as the matrix A we
defined before. The same goes for the variable x. Solve several instances of the problem
for different values of m and n. Try different values of m and n. For how large values
of m and n does your code perform well? Make plots of the solution time versus the
total number of nonzero elements in the matrices A and Q. You may use 𝜖 a = 10−3 and
𝜖 r = 10−3 .
170 6 Optimization Methods
6.11 Let ∶ ℝn → ℝp×q be a linear function, and let H ∈ 𝕊n+ , A0 ∈ ℝp×q , and a ∈ ℝn . Consider
the optimization problem
1
minimize ||(x) + A0 ||∗ + (x − a)T H(x − a)
2
which is equivalent to
minimize ||X||∗ + 12 (x − a)T H(x − a)
subject to (x) + A0 = X
with variables x ∈ ℝn and X ∈ ℝp×q .
Show that an ADMM algorithm for the above optimization problem can be expressed as the
following updates:
1. Compute xk+1 by solving
( )
(H + 𝜌M)xk+1 = adj 𝜌Xk + A0 − Zk + Ha
where the matrix M ∈ 𝕊n is defined as the matrix that satisfies adj ((z)) = Mz for all
z ∈ ℝn , and where adj ∶ ℝp×q → ℝn is the adjoint of .
2. Compute
( )
𝜌 ‖1 ‖2
Xk+1 = argmin ||X||∗ + ‖ Z + (x ) + A − X ‖
X 2‖‖𝜌
k 0 ‖
‖F
( )
∑
min (p,q)
1
= max 0, 𝜎i − ui 𝑣Ti
i=1
𝜌
where ui , 𝑣i and 𝜎i are given by an SVD
1 ∑
min (p,q)
Z + (xk ) + A0 = 𝜎i ui 𝑣Ti .
𝜌 i=1
3. Compute
( )
Zk+1 = Zk + 𝜌 (xk ) + A0 − Xk .
Hint: When defining the augmented Lagrangian, the appropriate inner product is ⟨⋅, ⋅⟩ ∶
ℝp×q × ℝp×q → ℝ is ⟨X,
√ Y ⟩ = tr(X Y ), and the appropriate norm is the Frobenius norm
T
6.12 Show that if 𝜓 ∶ int K → ℝ is a generalized logarithm for a cone K ⊆ ℝn , then it holds that
∇𝜓(x)T x = 𝜃, x ∈ int K,
where 𝜃 is the degree of 𝜓.
( )
6.13 Show that 𝜓 ∶ int ℚn → ℝ defined as 𝜓(x) = ln xn2 − x12 − · · · − xn−1
2
is a generalized loga-
rithm for Qn with degree 𝜃 = 2.
where g ∶ ℝn → ℝ ̄ and fi ∶ ℝn → ℝ,
̄ i ∈ ℕN , are proper, closed, and convex functions, and
the variables are y ∈ ℝ and x = (x1 , … , xN ) with xi ∈ ℝn , i ∈ ℕN . Show that the scaled form
n
1 ∑ (k+1)
N
x̃ (k+1) = x
N i=1 i
1 ∑ (k)
N
ũ (k) = u
N i=1 i
y(k+1) = prox𝜌−1 g (̃x(k+1) + ũ (k) )
u(k+1)
i
= u(k)
i
+ xi(k+1) − y(k+1) , i ∈ ℕN
where we have used a superscript for the iteration index to avoid abiguity.
173
Part III
Optimal Control
175
Calculus of Variations
So far we have discussed optimization when the variables belong to finite-dimensional vector
spaces. However, it is often of interest to also optimize over infinite-dimensional vector spaces.
The most simple case is when the variable is a real-valued function of a real variable. This
has applications in optimal control in continuous time, where the optimal control signal is
a real-valued function of time. Another important application is the derivation of probability
density functions from the principle of maximizing the entropy subject to moment constraints.
In this case, the variable is the probability density function. For probability density functions, the
argument is often a vector. How to solve these types of optimization problems is called calculus
of variations. The origin of this theory goes back to Newton’s minimal resistance problem. Major
contributions were made by Euler and Lagrange. The generalizations to optimal control were
made by Pontryagin. We will present the theory in this general form, but we will not be able to
prove all the results. The interested reader is referred to the vast literature on optimal control for
most of the proofs, especially for the most general results. However, we will provide the proofs for
some special cases to build intuition.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
176 7 Calculus of Variations
Example 7.1 The functional in (7.1) is differentiable if f is a differentiable function in its last two
arguments. This follows from a Taylor series expansion:
b ( ( ) ( ))
ΔJ[𝛿y] = f x, y(x) + 𝛿y(x), y′ (x) + 𝛿y′ (x) − f x, y(x), y′ (x) dx,
∫a
( ( ) ( )
b 𝜕f x, y(x), y′ (x) 𝜕f x, y(x), y′ (x)
= 𝛿y(x) + 𝛿y′ (x),
∫a 𝜕y 𝜕y′
)
( )
+ h y(x), y′ (x) ‖ ‖
‖(𝛿y(x), 𝛿y (x))‖2
′
dx,
where h ∶ ℝ × ℝ → ℝ is a function that goes to zero as (𝛿y(x), 𝛿y′ (x)) → 0. The latter is implied by
||𝛿y|| → 0. Hence, we may define 𝜖 as
b ( )
∫a h y(x), y′ (x) ‖(𝛿y(x), 𝛿y′ (x))‖2 dx
𝜖= ,
||𝛿y||
b ( )
∫ h y(x), y′ (x) ||𝛿y||dx b ( )
≤ a = h y(x), y′ (x) dx,
||𝛿y|| ∫a
which converges to zero as h goes to zero. Hence, this functional is differentiable with first variation
( ( ) ( ) )
b 𝜕f x, y(x), y′ (x) 𝜕f x, y(x), y′ (x)
𝛿J[𝛿y] = 𝛿y(x) + 𝛿y (x) dx.
′
(7.2)
∫a 𝜕y 𝜕y′
It is left as an exercise to show that this functional is linear, see Exercise 7.2. Here, we under-
stand why the norm we use in the definition of (a, b) has to include the derivative. However,
in case we can assure that bounded norm of y implies bounded norm of y′ , we may use another
definition.
1 The reason to use the word “weak” is that there is also a definition of strong extremum that is based on another
norm for the linear function space, which is defined as ||y|| = max x∈ ||y(x)||. Clearly, strong extremum implies weak
extremum. Necessary conditions for weak extremum are, hence, also necessary conditions for strong extremum.
2 In case there are constraints on y for x = a or x = b, 𝛿y should be constrained to be zero at those values of x. This is
often stated as 𝛿y(x) is admissible. We will tacitly assume that we only consider such admissible 𝛿y.
7.1 Extremum of Functionals 177
where 𝜖 → 0 as ||𝛿y|| → 0. Hence for sufficiently small ||𝛿y||, the sign of ΔJ[𝛿y] will be the same as
the sign of 𝛿J[𝛿y]. Now assume that 𝛿J[𝛿y0 ] ≠ 0 for some 𝛿y0 . Then for any 𝛼 > 0, we have
𝛿J[−𝛼𝛿y0 ] = −𝛿J[𝛼𝛿y0 ],
since 𝛿J is linear. Hence, the increment can be made to have either sign for arbitrary small 𝛿y
contradicting that J has an extremum.
Example 7.2 The functional in (7.1) is twice differentiable if f is a twice differentiable function
in its last two arguments. This follows from a Taylor series expansion similar to what was done in
the previous example. The second variation is given by
⎡ 𝜕 2 f (x, y(x), y′ (x))) 𝜕 2 f (x, y(x), y′ (x))) ⎤
[ ]T ⎢ ⎥[ ]
1
b
𝛿y(x) ⎢ 𝜕y2 𝜕y𝜕y′ ⎥ 𝛿y(x) dx.
𝛿 2 J[𝛿y] =
2 ∫a 𝛿y′ (x) ⎢ 𝜕 2 f (x, y(x), y′ (x))) 𝜕 2 f (x, y(x), y′ (x))) ⎥ 𝛿y′ (x)
⎢ 𝜕y𝜕y′ ⎥
⎣ 𝜕y′ 2 ⎦
It can now, with similar techniques as used above, be proven that a necessary condition for y⋆ to
be a minimum for J is that 𝛿 2 J[𝛿y] ≥ 0 for all 𝛿y; see, e.g. [41], This is not a sufficient condition. We
say that the second variation is strongly positive if there exists a constant k > 0 such that 𝛿 2 J[𝛿y] ≥
k||𝛿y||2 for all y and 𝛿y. A sufficient condition for y⋆ to be optimal for J is that its first variation
vanishes and that its second variation is strongly positive. This is also straightforward to prove; see,
e.g. [41].
It then follows that ȳ is optimal for (7.3). To see this we realize that ȳ by the first condition above
minimizes L[y, 𝜇],̄ since L is twice differentiable with a strongly positive second variation. Now,
assume that ȳ is not optimal for the above optimization problem, but that ỹ is optimal. Then J[̃y] <
J[̄y] and K[̃y] = 0 implies that
L[̃y, 𝜇]
̄ = J[̃y] < J[̄y],
which contradicts that ȳ minimizes L[y, 𝜇].
We will now show that strong duality holds. Let g ∶ ℝp → ℝ be the Lagrange dual function
defined via
g(𝜇) = min L[y, 𝜇].
y∈(a,b)
We have
g(𝜇) = min L[y, 𝜇] ≤ min L[y, 𝜇] = min J[y] = p⋆ .
y∈(a,b) y∈(a,b), K[y]=0 y∈(a,b), K[y]=0
and hence strong duality holds. The extension to inequality constraints is straightforward and sim-
ilar to what was done in Chapter 4. Necessary conditions for optimality of (7.3) are a bit tricky.
For finite-dimensional optimization problems, we saw in Chapter 4 that constraint qualifications,
such as Slater’s conditions, are needed in order to guarantee strong duality, on which the proof of
necessity was based.
for some constant k. We have f ′ (y) = log y + 1 and f ′′ (y) = 1∕y and hence, from (7.2) and from
Example 7.2, the variations of J are
b ( )
𝛿J[𝛿y] = log y(x) + 1 𝛿y(x)dx,
∫a
b
1 1
𝛿 2 J[𝛿y] = 𝛿y(x)2 dx.
2 ∫a y(x)
b
The first variation of K is 𝛿K[𝛿y] = ∫a x𝛿y(x)dx, and the second variation of K is zero. We define
the Lagrangian as L[y, 𝜇] = J[y] + 𝜇K[y], and we see that its first variation is
b ( )
𝛿L[𝛿y] = log y(x) + 1 + 𝜇x 𝛿y(x)dx.
∫a
By the du-Bois–Reymond lemma, it holds that if the first variation is zero, then
log y(x) + 1 + 𝜇x = 0,
and hence, we have that y(x) = exp(−1 − 𝜇x). The constraint K[y] = 0 can be used to determine
𝜇 in terms of a, b, and k. We should try to verify that the second variation of L is strictly positive
in order to show that y constitutes a minimum. This is however not so easy, or even possible, and
we will in Section 9.2 prove optimality in another way for similar problems. Understand that strict
positivity is just a sufficient condition that might be too strong in some cases.
7.1.5 Generalizations
Most of what has been discussed in this section generalizes to ⊆ ℝn . This will be important when
we revisit the example above and extensions of it in Section 9.2. We will also consider y(x) to be
vector valued, i.e. y ∶ → ℝn . To this end, we just define the norm as
||y|| = sup ||y(x)||2 + sup ||y′ (x)||2 ,
x∈ x∈
where || ⋅ ||2 as usual is the Euclidean vector norm. From now on, n (a, b) is the normed linear
space of differentiable functions y ∶ → ℝn with the above norm, where = [a, b]. We will be
less formal in our derivation of results in the remaining part of this chapter.
We assume that the optimal u, which we denote by u⋆ , is continuous.3 We also assume that the
corresponding solution x⋆ to the differential equation is unique, and that in case u⋆ is perturbed
with a small amount, the corresponding perturbation of x⋆ is also small. We refrain from giving
detailed conditions when this is satisfied, but just mention that it is related to what is called
Lipschitz continuity of F.
We define the Lagrangian functional L ∶ m (0, T) → ℝ as
T ( )
L[u] = 𝜙 (x(T)) + ̇
f (t, x(t), u(t)) + λ(t)T (F (t, x(t), u(t)) − x(t)) dt,
∫0
T ( )
= 𝜙 (x(T)) + ̇
H (t, x(t), u(t), λ(t)) − λ(t)T x(t) dt,
∫0
where H ∶ ℝ × ℝn × ℝm × ℝn → ℝ is the Hamiltonian defined as
H (t, x, u, λ) = f (t, x, u) + λT F(t, x, u).
We could have defined the Lagrangian functional to also depend explicitly on x and λ, but we will
not need this in our derivations, and hence, we refrain from doing so. Similarly, as with the multi-
plier rule of Lagrange in (4.57a) and (4.57b) we expect to get a necessary condition for optimality
by letting the first variation of L be zero. We make a perturbation u = u⋆ + 𝛿u of u⋆ , where 𝛿u is
small, i.e. ||𝛿u|| < 𝜖. See Section 7.1 for the definition of the norm. The corresponding perturbed
trajectory x, which is the solution of (7.4) for u differs from the original solution x⋆ with the quan-
tity 𝛿x = x − x⋆ , which by our assumptions is small, i.e. ||𝛿x|| can be made as small as we like by
taking 𝜖 sufficiently small. We have that the increment of L is given by
( ) ( )
ΔL[𝛿u] = L[u⋆ + 𝛿u] − L[u⋆ ] = 𝜙 x⋆ (T) + 𝛿x(T) − 𝜙 x⋆ (T)
T ( ( ⋆ ) ( ))
+ H t, x (t) + 𝛿x(t), u⋆ (t) + 𝛿u(t), λ(t) − H t, x⋆ (t), u⋆ (t), λ(t) dt
∫0
d ( ⋆ )
T T
d
− λ(t)T x (t) + 𝛿x(t) dt + λ(t)T x⋆ (t))dt.
∫0 dt ∫0 dt
We now make a Taylor series expansion and obtain the first variation
T( )
𝜕𝜙 𝜕H 𝜕H T d𝛿x(t)
𝛿L = T 𝛿x(T) + 𝛿x(t) + T 𝛿u(t) − λ(t) dt.
𝜕x ∫0 𝜕xT 𝜕u dt
Here, we have not written out the arguments of the partial derivatives. This is clearly a linear
functional. The assumption on small perturbations of 𝛿u resulting in small perturbations of 𝛿x
is necessary. Otherwise, the remainder term does not converge to zero as ||𝛿u|| → 0. By integration
of parts, it follows that
T( )
[ ]T
𝜕𝜙 𝜕H 𝜕H dλ(t)T
𝛿L = T 𝛿x(T) + 𝛿x(t) + T 𝛿u(t) + 𝛿x(t) dt − λ(t)T 𝛿x(t) 0 .
𝜕x ∫0 𝜕xT 𝜕u dt
Since the initial value x(0) is given, it follows that 𝛿x(0) = 0, and hence, we obtain
( ) T( )
𝜕𝜙 𝜕H 𝜕H dλ(t)T
𝛿L = − λ(T)T
𝛿x(T) + 𝛿x(t) + 𝛿u(t) + 𝛿x(t) dt.
𝜕xT ∫0 𝜕xT 𝜕uT dt
We have so far made no assumptions on λ. We assume that it satisfies the adjoint equations defined
as
⋆ ⋆ ⋆
̇ = − 𝜕H(t, x (t), u (t), λ(t)) , λ(T) = 𝜕𝜙(x (T)) .
λ(t)
𝜕x 𝜕x
This is a linear time-varying differential equation for λ, and hence, it has a solution [97, Chapter 3],
under mild conditions on H.4 For this λ, it follows that
T
𝜕H
𝛿L = 𝛿u(t)dt.
∫0 𝜕uT
where all functions are evaluated for (x⋆ , u⋆ ). It is possible to show that the result holds also for
piecewise continuous u⋆ . In case H does not explicitly depend on t, which is called an autonomous
optimal control problem, we realize that H is a constant independent of t.
We have now proven the following necessary conditions of Pontryagin, also called the Pontryagin
maximum principle (PMP). Given optimal u⋆ and x⋆ for (7.5), there exists an adjoint variable λ such
that
⋆ ⋆
̇ = − 𝜕H(t, x (t), u (t), λ(t)) , 𝜕𝜙(x⋆ (T))
λ(t) λ(T) = , (7.6a)
𝜕x 𝜕x
𝜕H(t, x⋆ (t), u⋆ (t), λ(t))
= 0, (7.6b)
𝜕uT
dH(t, x⋆ (t), u⋆ (t), λ(t)) 𝜕H(t, x⋆ (t), u⋆ (t), λ(t))
= . (7.6c)
dt 𝜕t
We remark that the necessary conditions do not distinguish between maximum or minimum. They
hold for any extremum, and this is the reason why the conditions are called a maximum principle.
They could just as well have been called a minimum principle.
We are also interested in the sufficient conditions for a locally optimal solution of (7.5). We
will not provide details of the derivation. The condition is based on the second variation of the
̄ x̄ ) and λ satisfy
Lagrangian L. Let (u,
4 The system matrix and the input signal should be bounded for the existence of a solution. This holds if
𝜕f (t,x⋆ (t),u⋆ (t)) ⋆ (t),u⋆ (t))
𝜕x
and 𝜕F(t,x 𝜕x are bounded functions of t on [0, T].
182 7 Calculus of Variations
with variables x and u for given initial value x(0) = x0 . The Hamiltonian is given by
1( T )
H= x Qx + uT Ru + λT (Ax + Bu).
2
We realize that the adjoint equations are
𝜕H
λ̇ = − = −Qx − AT λ, λ(T) = Q0 x(T).
𝜕x
From the PMP, we have that
𝜕H
= Ru + BT λ = 0.
𝜕u
If we assume that R ≻ 0, i.e. positive definite, we have that u = −R−1 BT λ. If we insert this into the
differential equations for x and λ, we obtain
[ ] [ ][ ] [ ] [ ]
ẋ A −BR−1 BT x x(0) x0
= , = .
λ̇ −Q −AT λ λ(T) Q0 x(T)
This is a two-point boundary value problem. Such problems are in general not easy to solve. The
reason is that we do now know the initial value for λ. However, for the problem above, we may
define
[ ] [ ]
A −BR−1 BT Φ11 (t, s) Φ12 (t, s)
= , Φ(t, s) = = e(t−s) ,
−Q −AT Φ21 (t, s) Φ22 (t, s)
where the partitions are conformable. The matrix Φ is called the transition matrix, and it can be
shown that it is always invertible, and that Φ(t, t) = I for any t, see [97]. It then follows that
[ ] [ ][ ] [ ]
x0 Φ11 (0, T) Φ12 (0, T) x(T) Φ11 (0, T) + Φ12 (0, T)Q0
= = x(T).
λ(0) Φ21 (0, T) Φ22 (0, T) Q0 x(T) Φ21 (0, T) + Φ22 (0, T)Q0
We may now use these equations to express λ(0) in terms of x0 , i.e. we obtain
( )( )−1
λ(0) = Φ21 (0, T) + Φ22 (0, T)Q0 Φ11 (0, T) + Φ12 (0, T)Q0 x0 ,
assuming the inverse exists. It actually does; see, e.g. [20]. We are now in a position to solve the
linear differential equation, which will then give us the optimal control signal.
λ(t) = P(t)x(t),
where
( )( )−1
P(t) = Φ21 (t, T) + Φ22 (t, T)Q0 Φ11 (t, T) + Φ12 (t, T)Q0 .
7.3 The Euler–Lagrange Equations 183
Example 7.4 In mechanical systems, the state x contains positions and angels, and ẋ is called
generalized velocity. We define the potential energy of the system as V(x) and the kinetic energy as
184 7 Calculus of Variations
Example 7.5 We will now derive the shape of a chain hanging from two given points by min-
imizing the total potential energy of the chain. A differential piece√ of the chain, of length ds has
mass dm = 𝜌ds, where 𝜌 is the mass density of the chain. Here, ds = 1 + ẋ 2 , where x(t) is the posi-
tion x(t) of the differential segment. The position x(t) of the segment multiplied by its mass, and
the gravitational constant g, is equal to the potential energy of the segment, i.e. g𝜌x(t)ds. The total
potential energy is therefore given by
T
̇
f (x, x)dt,
∫0
√
̇ = g𝜌x 1 + ẋ 2 . This integral should be minimized subject to the constraints x(0) = x0
where f (x, x)
and x(T) = xT , which defines where the endpoints of the chain are located. We see that for this
𝜕f
example we do not have a constraint on λ = − 𝜕ẋ at t = T. Instead, we use a constraint on x(T).
Beltrami’s identity gives
√ ẋ 2
g𝜌x 1 + ẋ 2 − g𝜌x √ = C.
1 + ẋ 2
7.4 Extensions 185
7.4 Extensions
We will now discuss different extensions of the PMP. These are for the cases when there are con-
straints on the control signal, when the final time T is optimized, and for the cases when the
initial value of the state x(0) and/or the final value of the state x(T) are constrained to belong to
manifolds. We will not give the proof for these cases, but we will instead state the corresponding
version of the PMP and solve the examples. We will from now on only discuss the autonomous
case.
Consider the optimal control problem
T
minimize 𝜙 (x(T)) + f (x(t), u(t)) dt,
∫0
̇ = F (x(t), u(t)) ,
subject to x(t)
(7.8)
x(0) ∈ S0 , x(T) ∈ ST ,
u(t) ∈ U ⊂ ℝm ,
T ≥ 0,
with variables x, u, and T, where f ∶ ℝn × ℝm → ℝ, F ∶ ℝn × ℝm → ℝ, and 𝜙 ∶ ℝn → ℝ are con-
tinuously differentiable. The sets S0 and ST are subsets of ℝn and manifolds. We assume that S0 can
be described as
{ }
S0 = x ∈ ℝn ∶ G0 (x) = 0 ,
where G0 ∶ ℝn → ℝp with p ≤ n is differentiable with a full rank Jacobian for x in a neighborhood
of the optimal solution point on S0 . We assume a similar description of ST with a function GT .
Notice that we in this formulation also optimize the final time T.
Define the Hamiltonian H ∶ ℝn × ℝm × ℝn+1 → ℝ as
̃ = λ f (x, u) + λT F(x, u),
H(x, u, λ) 0
186 7 Calculus of Variations
where λ̃ = (λ0 , λ). Assume that (x⋆ , u⋆ , T ⋆ ) are optimal for the optimal control problem above. Then
there exists a nonzero adjoint function λ̃ ∶ [0, T] → ℝn+1 such that
̃
(i) ̇ = − 𝜕H(x⋆ (t),u⋆ (t),λ(t))
λ(t) , λ0 = c ≥ 0, where c ∈ ℝ is a constant
𝜕x
(ii) ̃
H(x⋆ (t), u⋆ (t), λ(t)) ̃
= min 𝑣∈U H(x⋆ (t), 𝑣, λ(t)) = 0, ∀t ∈ [0, T ⋆ ]
(iii) λ(0) ⟂ S0
⋆ ⋆
(iv) λ(T) − 𝜕𝜙(x𝜕x(T )) ⟂ ST
𝜕G (x(0))
Above we have used the notation λ0 ⟂ S0 to mean that λ(0)T 𝑣 = 0 for all 𝑣 such that 𝜕x 0
T
𝑣 = 0,
where G0 is the function defining the manifold S0 . In case the final time T is not optimized, then
condition (ii) is replaced with that the Hamiltonian is constant and not necessarily zero along the
optimal solution. For many problems, it turns out that λ0 > 0, and then since H is homogeneous in
̃ there is no loss in generality to take λ = 1. We then obtain the same definition of the Hamiltonian
λ, 0
as we had before. It is typically pathological cases for which λ0 = 0, and hence, we will often in the
examples we investigate assume that λ0 = 1.
When using the above conditions to try to find a solution to the optimal control problem, the
first step is to define the Hamiltonian. Then one should try to minimize this with respect to u.
This should be done parametrically with respect to λ̃ and x. This means that we obtain a function
𝜇 ∶ ℝn × ℝn+1 → ℝm such that u⋆ = 𝜇(x, λ). ̃ This function is then substituted into the dynamical
̃
equations for x and λ, i.e.
̃
ẋ = F(x, 𝜇(x, λ)),x(0) ∈ S0 , x(T) ∈ ST ,
̃ ̃
𝜕H(x, 𝜇(x, λ), λ) 𝜕𝜙(x(T)) (7.9)
λ̇ = − , λ(0) ⟂ S0 , λ(T) − ⟂ ST ,
𝜕x 𝜕x
This is a so-called two-point boundary value problem (TPBVP). We should also use (ii) when solving
the above equations. The equations are by no means easy to solve in general. To carry out the
parametric optimization of H can be very difficult, especially when u is constrained by the set U.
We should also remember that the PMP are only necessary conditions for optimality. Hence, they
may not provide enough information to uniquely determine the optimal control. Moreover, they do
not guarantee optimality, but only stationarity. Further investigations are necessary to prove that a
candidate solution obtained from the PMP is indeed optimal.
We will now investigate some examples that can be solved analytically. In the first example, we
illustrate how to carry out the parametric optimization of H for a constrained problem.
Such a control signal that only takes values on the boundary of the set of allowed values for the
control signal is often called bang–bang. The reason for this is that if implemented using a mechan-
ical actuator, a bang will often be heard when switching from one of the values to the other. The
Hamiltonian is given by
H(t, x, u, λ) = λu.
Pointwise minimization yields
⎧ 1, λ < 0,
⎪
𝜇(t, x) = argmin {λu} = ⎨−1, λ > 0,
|u|≤1 ⎪ u,
⎩ ̃ λ = 0,
where ũ ∈ [−1,1] is arbitrary. The adjoint equation is given by
̇ = − 𝜕H (t, x, u, λ) = 0, λ(T) = 𝜕𝜙 (x(T)) = x(T),
λ(t)
𝜕x 𝜕x
which has the solution λ(t) = x(T). We now have two cases:
● x(T) ≠ 0: In this case, λ(t) ≠ 0 for all t, and we can write
𝜇(t, x) = −sgn(λ) = −sgn(x(T)) = −sgn(x(t)).
The last equality holds since x will have the same sign as x(T) during the whole time interval.
● x(T) = 0: In this case, λ = 0 for all t and we may use any control signal ũ ∈ [−1,1], which obeys
the constraint x(T) = 0. One such control signal is
𝜇(t, x) = −sgn(x(t)),
since this will drive x to zero and stay there.
Consequently, one optimal control is
𝜇 ∗ (t, x) = −sgn(x(t)).
Example 7.7 We are interested in finding the path with the shortest distance from a given point
{ }
x0 ∈ ℝ2 to a manifold ST = x ∈ ℝ2 ∶ G(x) = 0 , where G ∶ ℝ2 → ℝ is a differentiable function.
This can be done by finding the shortest time T it takes to “drive” with constant speed of one from
the given point to the manifold. This driving is described by the differential equations:
ẋ 1 (t) = cos 𝜃(t),
ẋ 2 (t) = sin 𝜃(t),
T
where 𝜃(t) ∈ [0,2𝜋) is the heading angle. We express the time as T = ∫0 dt. The corresponding
optimal control problem is hence
T
minimize dt,
∫0
subject to ẋ 1 (t) = cos 𝜃(t),
ẋ 2 (t) = sin 𝜃(t),
x(0) = x0 , x(T) ∈ ST ,
𝜃(t) ∈ [0,2𝜋),
188 7 Calculus of Variations
with variable 𝜃 and T, where the final time T should be optimized. The Hamiltonian is for this
example given by
H(x, 𝜃, λ) = 1 + λ1 cos 𝜃 + λ2 sin 𝜃.
The adjoint equations are
̇ = 0,
λ(t)
since the Hamiltonian does not depend on x. Hence, the adjoint variables are constants, i.e. λ(t) = λ0
for some constant λ0 ∈ ℝ2 . Since the Hamiltonian has to be zero along the optimal solution, we
have that
√
1 + λ21 + λ22 sin(𝜃(t) + 𝛿) = 0,
√ √
where 𝛿 satisfies sin 𝛿 = λ1 ∕ λ21 + λ22 and cos 𝛿 = λ2 ∕ λ21 + λ22 . This follows from the formula for
the sinusoidal of the sum of two angles. Since λ is a constant, this implies that 𝜃 also has to be
a constant. Hence, the shortest path is a straight line. Moreover, from the fact that the optimal 𝜃
should minimize the Hamiltonian, we obtain the necessary condition that
dH
= −λ1 sin 𝜃 + λ2 cos 𝜃 = 0,
d𝜃
from which it follows that
cos 𝜃 λ
= 1,
sin 𝜃 λ2
if 𝜃 ∈ (0,2𝜋). We will now make use of the condition λ(T) ⟂ ST , which is equivalent to
𝜕G(x(T))
λ(T) = 𝛼 ,
𝜕x
for some 𝛼 ∈ ℝ. From the differential equations, we have that the slope of the straight line is given
by
dx1 cos 𝜃 λ
= = 1.
dx2 sin 𝜃 λ2
We hence conclude that the shortest path has to be perpendicular to the manifold ST . The case
𝜃 = 0 can be taken care of by investigating dx2 ∕dx1 instead.
Except for very simple optimal control problems, it is not possible to obtain analytical solutions.
Hence, we must often resort to numerical solutions. All methods use ideas from ordinary
differential equation (ODE) solvers to either integrate or approximate differential equations. There
are essentially two types of methods:
1. Indirect methods
2. Direct methods
The indirect methods aim at solving the necessary conditions of the PMP by integrating the differ-
ential equations in a recursive manner. The direct methods aim at solving a discretization of the
optimal control problem in (7.8) directly.
7.5 Numerical Solutions 189
Regarding the indirect methods, there are two main ideas that are used and which can be sum-
marized in the two methods
1. Shooting method
2. Gradient method
In the shooting method, the TPBVP is solved by integrating the dynamical equations in (7.9) for-
ward in time using an ODE solver. The challenge here is that normally no initial value is known
for the adjoint variable λ, and one has to guess an initial value and then modify this guess based on
what value is obtained for λ(T). The method is conceptually simple, and the control constraints are
easily accommodated. However, it is crucial to find a good initial guess for λ, and the method can be
numerically unstable due to the fact that the adjoint equations are often unstable when integrated
forward in time. A nonlinear equation solver is needed in order to find the correct initial value for
λ. This method was successfully used for launching satellites in the 1950s.
In the gradient method the control signal is used as an optimization variable, and the dynami-
cal equation relating the control signal and the state is integrated forward in time, and the adjoint
equation is integrated backward in time. This is done iteratively using an ODE solver until conver-
gence of the gradient of the objective function to zero. The gradient of the objective function is the
partial derivative of the Hamiltonian with respect to the control signal. The gradient method has
the advantage that both differential equations are integrated in their stable direction. Control sig-
nal constraints can be taken care of by projection on the feasible set of control signals. The method
has rapid convergence for the first iterates, but then tends to be slow. It was used successfully in
the 1960s for a large number of aeronautical problems. To speed up the convergence, second-order
derivatives may be used, however, with the drawback of making each iteration much more com-
putationally expensive.
Regarding the direct methods, there are three main ideas which can be summarized in the three
methods:
1. Discretization method
2. Collocation method
3. Multiple shooting method
The methods are all based on approximating the control signal as, e.g. piecewise constant or as a
polynomial. They basically differ in how the state trajectory is approximated. The first one uses a
very simple Euler forward difference. The latter two use ideas from ODE solvers. Collocation meth-
ods explicitly make use of the polynomial approximations used in ODE solvers to approximate the
state trajectory, whereas multiple shooting methods explicitly make use of an ODE solver with the
advantage of using adaptive step-lengths. This ODE solver has to be able to deliver derivatives with
respect to the control signal parameters. All methods rely on an efficient nonlinear programming
solver at the top level to optimize the control signal parameters.
In-between the indirect methods and the direct methods is the method of consistent approxi-
mation which borrows ideas from both methods. This method approximates the control signal as,
e.g. piecewise constant, as a polynomial or using orthogonal functions. It also uses a nonlinear
programming solver at the top level. However, ideas from ODE solvers is not used directly. Instead,
differential equations for adjoint variables are first derived. Then an ODE solver is used to integrate
first the differential equation relating the state to the control signal forward in time, and then the
adjoint equation backward in time. This will explicitly provide gradients for the nonlinear program-
ming solver. Hence, the ODE solver does not have to be able to deliver derivatives with respect to the
control signal parameters. More details regarding consistent approximations is given in, e.g. [75].
190 7 Calculus of Variations
We will now in detail explain several of the different algorithms. We will assume that we have
numerical algorithms for integrating ODEs, for finding roots of systems of nonlinear equations, and
for solving finite-dimensional optimization problems. General purpose algorithms in MATLAB for
this are, e.g. ode45, fsolve, and fmincon, respectively.
In case one does not want to write one’s own code for solving optimal control problems, there
are dedicated solvers for optimal control problems such as ACADO [58], and CasADi [5].
1. Guess an initial value for the control signal u(t), t ∈ [0, T].
2. Solve
Here, care has to be taken to integrate the adjoint equation backward in time. In case there are
constraints such that u(t) ∈ U for all t ∈ [0, T], this can be taken care of by projecting the new
value of u(t) in Step 4 onto U. The parameter 𝛼 > 0 is a step size that has to be chosen with care.
G0 (x(0), λ(0)) = 0,
7.5 Numerical Solutions 191
where G0 ∶ ℝn × ℝn → ℝp . Similarly, we assume that we can summarize what we know about the
final values in
GT (x(T), λ(T)) = 0,
where GT ∶ ℝn × ℝn → ℝ2n−p . We also define a function G ∶ ℝn × ℝn → ℝn × ℝn that takes as
input the initial values (x(0), λ(0)), integrates the differential equations, and outputs the final
values (x(T), λ(T)). This can be implemented with an ODE solver, e.g. ode45 in MATLAB. Then
we need to solve the system of equations given by
G0 (x(0), λ(0)) = 0,
GT (G(x(0), λ(0))) = 0.
In MATLAB, this can be done with fsolve. One of the challenges is as mentioned above to choose
a good enough initial guess for λ(0). Another challenge is to carry out the minimization of the
Hamiltonian explicitly in order to obtain the function 𝜇 used in (7.9). In case this cannot be done
analytically, we need to resort to numerical solutions. In case the minimum of the Hamiltonian is
obtained when its partial derivative with respect to u is zero, then this equation can be added to the
differential equations in (7.9), i.e. we consider
ẋ = F(x, u), x(0) ∈ S0 , x(T) ∈ ST ,
̃
𝜕H(x, u, λ) 𝜕𝜙(x(T))
λ̇ = − , λ(0) ⟂ S0 , λ(T) − ⟂ ST , (7.11)
( 𝜕x) 𝜕x
̃
𝜕H x, u, λ
= 0,
𝜕u
instead of (7.9) when we define the function G. The equation above is not an ODE, but a differential
algebraic equation (DAE), which in MATLAB can be solved with, e.g. ode15i for given initial
conditions. In case the minimum of the Hamiltonian could be on the boundary of the domain U,
then we would need to carry out the minimization of the Hamiltonian at each step of the DAE
solver, and this would require us to write special purpose code.
We then realize that the whole optimal control problem in (7.12) may be approximated with the
discrete time optimal control problem
∑
N−1
minimize 𝜙(xN ) + hi f (ti , xi , ui ),
i=0
subject to xi+1 = xi + hi F(ti , xi , ui ), i ∈ ℤN−1 ,
x0 ∈ S0 , xN ∈ ST ,
ui ∈ U ⊂ ℝ m ,
with variables x = (x0 , … , xN ), and u = (u0 , … , uN−1 ). How this optimal control problem can be
solved as a finite-dimensional optimization problem is explained for a very similar problem in
Section 8.1. We will anyway give the details, since they are useful later on. We define 0 ∶ ℝ(N+1)n ×
ℝNm → ℝ as
∑
N−1
0 (x, u) = 𝜙(xN ) + hi f (ti , xi , ui ).
i=0
with variables x and u. It is straightforward to generalize to the case when there are also inequality
constraints related to the states x. The above optimization problem is a finite-dimensional
optimization problem as discussed already in Chapter 4. The above optimization problem is
often of very high dimension, but it has a lot of structure. The objective function is what is called
separable. The constraint functions are what is called partially separable, see Section 5.5. This can
be utilized to solve the problem efficiently.
where 0 ∶ ℝ(N+1)n × ℝk → ℝ. To take the constraint u(t) ∈ U for all t ∈ [0, T] into account, one
possibility is to sample the constraint at each time ti . This means that for all i, we add the constraint
u(ti ) ∈ U. We assume that this can equivalently be expressed as i (ai ) ⪯ 0 for some functions i ∶
ℝki → ℝq . All inequality constraints can then be expressed in terms of ∶ ℝk → ℝNq , where
⎡ 0 (a0 ) ⎤
⎢ ⎥
(a) = ⎢ ⋮ ⎥. (7.13)
⎢ N−1 ⎥
⎣ (aN−1 )⎦
As in the previous section, we assume that the functions G0 and GT can be used to describe the
constraints on the initial state and final state values. All equality constraints can be described using
194 7 Calculus of Variations
with variables s and a. It is straightforward to generalize to the case when there are also inequality
constraints related to the states x. The above optimization problem is also a finite-dimensional opti-
mization problem. Here, we are not able to compute analytical derivatives of the functions defining
the optimization problem. However, often ODE solvers can also deliver derivatives of the solutions
with respect to (a, s), which can be used to compute derivatives of all the involved functions. The
fact that the differential equations can be solved in parallel can be used to speed up the solver.
A good reference to multiple shooting methods is [32]. The ACADO toolkit which implements
multiple shooting methods is described in [58].
for t ∈ [ti , ti+1 ], where cik ∈ ℝn are the coefficients of the vector-valued polynomial. The polynomial
has to agree with the true state x(t) and its derivative x(t)̇ at the endpoints of the interval. This results
in the following four equations for the coefficients cik :
xa (ti ) = xi ,
xa (ti + 1) = xi+1 ,
ẋ a (ti ) = F(ti , xi , ui ),
ẋ a (ti+1 ) = F(ti+1 , xi+1 , ui+1 ),
where
( )k−1
∑
3
kcik t − ti
ẋ a (t) = .
k=1
hi hi
Exercises 195
Since the equations are linear in cik , it is straightforward to obtain the solution:
ci0 = xi ,
ci1 = hi Fi ,
𝜑(ti , ai ) ∈ U ⊂ ℝm ,
with variables x = (x0 , … , xN ) and a = (a0 , … , aN−1 ), where we as before assume that the con-
straint involving the control signal can be written as inequalities involving a function as in (7.13).
Understand that the variables x and a are implicitly present in xa and ẋ a . An advantage as compared
to the multiple shooting method is that analytical derivatives with respect to (x, a) of the functions
defining the optimization problem are possible to compute. However, the cubical approximation of
the state might be less accurate as compared to the numerical integration of the state performed in
the multiple shooting method. A good reference to the collocation method presented here is [53].
More general collocation methods involving orthogonal polynomials is discussed in e.g. [93].
Exercises
7.6 We are interested in computing optimal transportation routes in a circular city. The cost for
transportation per unit length is given by a function g(r) that only depends on the radial
distance r to the city center. This means that the total cost for transportation from a point P1
to a point P2 is given by
P2
g(r)ds,
∫P1
where s represents the arc length along the path of integration. In polar coordinates (𝜃, r),
the total cost reads
P2 √
̇ 2 dr,
g(r) 1 + (r 𝜃)
∫P1
where 𝜃 = 𝜃(r), and 𝜃̇ = d𝜃∕dr.
(a) Formulate the problem of computing an optimal path as an optimal control problem.
(b) For the case of g(r) = 𝛼∕r for some positive 𝛼, show that any optimal path satisfies the
equation 𝜃 = a ln r + b for some constants a and b.
(c) Show that if the initial point and the final point are at the same distance from the origin,
then the optimal path is a circle segment. You may use the claim in (b).
Exercises 197
7.8 Consider the motion of a robotic manipulator with joint angles q ∈ ℝn , which may be
described as a function of the applied joint torques 𝜏 ∈ ℝn as
𝜏 = M(q)q̈ + C(q, q)
̇ q̇ + G(q), (7.14)
where M(q) ∈ ℝn×n ̇ ∈
is a positive definite mass matrix and C(q, q) ℝn×n is a matrix account-
ing for Coriolis and centrifugal effects, which is linear in the joint velocities q̇ = dq∕dt, and
where G(q) ∈ ℝn is a vector accounting for gravity and other joint angle-dependent torques.
Consider a path q(s) as a function of a scalar path coordinate s. The path coordinate deter-
mines the spatial geometry of the path, whereas the trajectory’s time dependency follows
from the relation s(t) between the path coordinate s and time t.
(a) Show that (7.14) can be expressed in terms of s as
𝜏(s) = m(s)̈s + c(s)ṡ 2 + g(s),
where
m(s) = M (q(s)) q′ (s),
( )
c(s) = M (q(s)) q′′ (s) + C q(s), q′ (s) q′ (s),
g(s) = G (q(s)) .
(b) Consider the time-optimal path-tracking problem
minimize T,
subject to 𝜏(s) = m(s)̈s + c(s)ṡ 2 + g(s),
s(0) = 0,
s(T) = 1,
̇
s(0) = 0,
̇
s(T) = 0,
𝜏 (s(t)) ≤ 𝜏 (s(t)) ≤ 𝜏̄ (s(t)) ,
198 7 Calculus of Variations
with variables s and T, where the torque lower bounds 𝜏 and upper bounds 𝜏̄ may depend
on s. Using the fact that dt = (dt∕ds)ds, show that
1
1
T= ds,
∫0 ṡ
and that for the change of variables,
a(s) = s̈ ,
b(s) = ṡ 2 ,
q(s) = s,
M(q) = l2 m = 1,
̇ = 0,
C(q, q)
G(q) = mlg cos(s) = cos(s),
𝜏(s) = −2, 𝜏(s)
̄ = 2.
ẋ 1 = x2 ,
ẋ 2 = x3 ,
ẋ 3 = u, |u| ≤ 1,
from a certain arbitrary initial condition x(0) to a final value x(T) = 0 in minimum time.
Show that the necessary conditions for optimality are satisfied by a control of the form
where p(t) is a polynomial. What is the maximum degree of the polynomial? How many
times can u change sign? It is not necessary to compute the values of the coefficients
of p(t).
Exercises 199
7.10 A community living around a lake wants to maximize the yield of fish taken out of the lake.
The amount of fish at a certain time is denoted as x. The growth rate of the fish is kx and
fish is caught at a rate of ux, where u is the control variable, which is assumed to satisfy
0 ≤ u ≤ umax . The dynamics of the fish population is then given by
ẋ = (k − u)x, x(0) = x0 .
Here, k > 0 and x0 > 0. The total amount of fish obtained during a time period T is
T
J= ux dt.
∫0
(a) Derive the necessary conditions given by the PMP for the problem of maximizing J.
(b) Show that the necessary conditions are satisfied by a bang–bang control. How many
switching times are there?
(c) Determine an equation for calculating the switching time(s).
7.11 Consider a motion model of a particle with position z(t) and speed 𝑣(t). Define the state
vector x = (z, 𝑣) and the continuous time model
[ ] [ ]
0 1 0
̇ = F(x(t), u(t)) =
x(t) x(t) + u(t). (7.15)
0 0 1
The problem is to go from the state xi = x(0) = (1,1) to xf = x(tf ) = (0,0), where tf = 2,
t
and such that the control input energy ∫0 f u2 (t)dt is minimized. Thus, the optimization
problem is
tf
minimize f (x, u)dt,
∫0
̇ = F(x(t), u(t)),
subject to x(t) (7.16)
x(0) = xi ,
x(tf ) = xf ,
Note that uN is superfluous, but it is included to make the presentation and the code
more convenient. Furthermore, define
N−1
∑
0 (y) = h u2k (7.19)
k=0
and
⎡ h1 (y) ⎤
⎢ ⎥ ⎡ ̄ 0 , u0 )
x1 − F(x ⎤
⎢ h2 (y) ⎥ ⎢ ̄ 1 , u1 )
x2 − F(x ⎥
⎢ h3 (y) ⎥ ⎢ ⎥
(y) = ⎢ ⎥=⎢ ⋮ ⎥. (7.20)
h
⎢ 4 (y) ⎥ ⎢xN − F(x
̄ N−1 , uN−1 )x0 − xi ⎥
⎢ ⋮ ⎥ ⎢ ⎥
⎢h (y)⎥ ⎣ xN − xf ⎦
⎣ N+2 ⎦
Show that the optimization problem in (7.18) can be expressed as the constrained
problem
minimize 0 (y),
subject to (y) = 0,
⎡−F E 0 0 … 0 0⎤
⎢ 0 −F E 0 … 0 0⎥
⎢ ⎥
⎢ 0 0 −F E … 0 0⎥
Aeq =⎢ ⋮ ⋱ ⋮⎥ , (7.21)
⎢ ⎥
⎢ 0 0 0 0 … −F E⎥
⎢E 0 0 0 … 0 0⎥
⎢ ⎥
⎣ 0 0 0 0 … 0 E⎦
where F and E are 2 × 3-matrices.
(c) Now, suppose that the constraints are nonlinear. This case can be handled by passing
a function that computes (y) and its Jacobian. Let the initial and terminal state con-
straints, hn+1 (y) = 0 and hN−2 (y) = 0, be handled as above and complete the function
secOrderSysNonlcon.m by computing
7.12 In this exercise, we will also use the discrete time model in (7.17). However, the terminal
constraints are removed by approximating them with a penalty term in the objective func-
tion. If the terminal constraints are not fulfilled, they are penalized in the objective function.
Now, the optimization problem can be defined as
∑
N−1
minimize c||xN − xf ||22 + h u2k ,
k=0
(7.24)
̄ k , uk ),
subject to xk+1 = F(x
x0 = xi ,
with variables x and u, where F̄ is given in (7.17), N = tf ∕h and c is a predefined constant.
This problem can be rewritten as an unconstrained quadratic optimization problem:
minimize (u0 , u1 , … , uN−1 ), (7.25)
with variable u. The optimization problem will be solved by using the MATLAB function:
fminunc.
(a) Write down an explicit algorithm for the evaluation of .
(b) Complete the file secOrderSysCostUnc.m with the cost function . Solve the prob-
lem by running the script mainDiscUncSecOrderSys.m.
(c) Compare the method in this exercise with the one in the previous exercise. What are the
advantages and disadvantages? Which method handles constraints best? Suppose the
task was also to implement the optimization algorithm, which method would be easiest
to implement, assuming that the state dynamics would be nonlinear?
(d) Implement an unconstrained gradient method that can replace the fminunc function
in mainDiscUncSecOrderSys.m.
7.13 We consider the same optimal control problem as in the previous exercise. We will investi-
gate the gradient method based on the PMP method. The optimal control problem is
tf
minimize J = c||x(tf ) − xf ||22 + u2 (t)dt,
∫0
̇ = F(x(t), u(t)),
subject to x(t)
x(0) = xi ,
with variables x and u.
(a) Write down the Hamiltonian and show that the Hamiltonian partial derivatives with
respect to x and u are
[ ]
𝜕H 0
= ,
𝜕x λ1 (7.26)
𝜕H
= 2u(t) + λ2 ,
𝜕u
respectively.
202 7 Calculus of Variations
(b) What are the adjoint equation and its terminal constraint?
(c) Complete the files secOrderSysEq.m with the system model, the file secOrder-
SysAdjointEq.m with the adjoint equations, the file secOrderSysFinal-
Lambda.m with the terminal values of λ, and the file secOrderSysGradient.m
with the control signal gradient. Finally, complete the script mainGradientSec-
OrderSys.m and solve the problem.
(d) Try some different values of the penalty constant c. What happens if c is “small?” What
happens if c is “large?”
7.14 In this exercise, we will solve the problem in (7.16) using a shooting method as discussed in
the subsection on shooting methods in Section 7.5. Such methods are based on successive
improvements of the unspecified initial conditions of the TPBVP.
Complete the files secOrderSysEqAndAdjointEq.m with the combined system and
adjoint equation which implements the function G in the abovementioned section. Also,
complete the file theta.m with the final constraints, which implement the function
GT . We do not need to specify the function G0 , since the initial value for the state is fully
known. The main script is mainShootingSecOrderSys.m, and it solves the equation
GT (G(x(0), λ(0))) = 0 with respect to λ(0) for a given value of x(0).
7.15 (a) What are the advantages and disadvantages of discretization methods?
(b) Discuss the advantages and disadvantages of the problem formulation in Exercise 7.11
compared to the formulation in Exercise 7.12.
(c) What are the advantages and disadvantages of gradient methods?
(d) Compare the methods in Exercises 7.11–7.13 in terms of accuracy and complexity/speed.
Also, compare the results for different algorithms used by fmincon in Exercises 7.11.
Can all algorithms handle the optimization problem?
(e) Assume that there are constraints on the control signal, and you can use either the
discretization method in Exercise 7.11 or the gradient method in Exercise 7.13. Which
method would you use?
(f) What are the advantages and disadvantages of shooting methods?
7.16 Consider the problem of finding the curve with the minimum length from a point (0,0) to
(xf , yf ). The solution is of course obvious, but we will use this example as a starting point for
introducing CasADi, see https://web.casadi.org, to solve optimal control problems
using a collocation method. We will use a control signal that is constant in each discretiza-
tion interval.
The problem can be formulated by using an expression of the length of the curve from (0,0)
to (xf , yf ):
xf √
s= 1 + y′ (x)2 dx. (7.27)
∫0
Note that x is the “time” variable. The optimal control problem is solved in minCurve-
LengthCol.m by using MATLAB and CasADi. Notice that (x, y) are represented as (t, x) in
the file.
(a) Derive the expression of the length of the curve in (7.27).
(b) Write the problem on standard optimal control form, i.e. determine 𝜙, f , F, and so on.
(c) Use the PMP to show that the solution is a straight line.
Exercises 203
(d) Examine and run the script minCurveLengthCol.m and compare the solution with
what the theory says.
7.17 Consider the minimum length curve problem again. The problem can be reformulated by
using a constant speed model where the control signal is the heading angle. This problem is
solved in the CasADi/MATLAB file minCurveLengthHeadingCtrlMS.m.
(a) Examine the CasADi/MATLAB file and write down the optimal control problem that is
solved in this example, i.e. what are F, f , and 𝜙 in the standard optimal control formu-
lation.
(b) Run the script and compare with the result from the previous exercise.
7.18 In this exercise, we will investigate the so-called “Brachistochrone problem,” which was
posed by Johann Bernoulli in Acta Eruditorum in 1696. The history of this problem involves
several of the greatest scientists ever, such as Galileo, Pascal, Fermat, Newton, Lagrange, and
Euler. It is about finding the curve between two points, A and B, that is covered in the least
time by a body that starts in A with zero speed and is constrained to move along the curve to
point B, under the action of gravity only and assuming no friction, see Figure 7.1. The word
“brachistochrone” comes from the Greek language: brachistos – the shortest, chronos – time.
Let the motion of the particle, under the influence of gravity g, be defined by ż = F(z, 𝜃),
where the state vector is defined as z = (x, y, 𝑣) and (x, y) is the Cartesian position of the
particle in a vertical plane and 𝑣 is the speed, i.e.
ẋ = 𝑣 sin(𝜃),
ẏ = −𝑣 cos(𝜃). (7.28)
The motion of the particle is constrained by a path that is defined by the angle 𝜃(t).
(a) Give an explicit expression for F(z, 𝜃). Only the expression for 𝑣̇ is missing.
(b) Define the Brachistochrone problem as an optimal control problem based on this
state-space model. Assume that the initial position of the particle is at the origin and that
the initial speed is zero. The final position of the particle is (x(tf ), y(tf )) = (xf , yf ) = (10,3).
(c) Modify the script minCurveLengthHeadingCtrlMS.m of the minimum curve
length example in Exercise 7.17 above and solve the Brachistochrone problem with
CasADi.
7.19 The time it takes for a particle to travel on a curve between the points p0 = (0,0) and
pf = (xf , yf ) is
pf
1
tf = ds, (7.29)
∫p0 𝑣
𝑥
𝐴 (0, 0) 𝑔
𝐵 (𝑥𝑓 , 𝑦𝑓 )
204 7 Calculus of Variations
with solution
C
x= (𝜙 − sin(𝜙)),
2 (7.32)
C
y = (1 − cos(𝜙)).
2
where 𝜙 parameterizes the curve.
7.20 An alternative formulation of the Brachistochrone problem, c.f. Exercise 7.18, can be
obtained by considering the “law of conservation of energy,” which is derived from the
principle of least action in Example 7.4. Consider the position (x, y) of the particle and its
̇ y).
velocity (x, ̇
̇ y,
(a) Write the kinetic energy T and the potential energy V as functions of x, y, x, ̇ the mass
m, and the gravity constant g.
(b) Define the Brachistochrone problem as an optimal control problem based on the law of
conservation of energy.
Hint: You should introduce u = ẋ and 𝑣 = ẏ as control signals. Notice that the problem
will contain an algebraic constraint that is not present in a standard optimal control
problem as we have defined it. This means that the state evolution is not described by
an ODE but by a differential algebraic equation.
(c) Solve this optimal control problem with CasADi by modifying the file brachis-
tochroneHeadingCtrlMS.m. Assume that the mass of the particle is m = 1.
Hint: You need to use a value for N of at least 40.
7.21 When solving the Brachistochrone problem, we have used three different problem formula-
tions in the three previous exercises. We will discuss here the pros and cons of the different
formulations.
(a) Discuss why not only the choice of optimization algorithm is important but also the
problem formulation, when deciding how to solve an optimal control problem numeri-
cally.
(b) Compare the different approaches to the Brachistochrone problem in the three previous
exercises. Try to explain the advantages and disadvantages of the different formulations
of the Brachistochrone problem.
7.22 We will solve here the so-called “Zermelo” problem. From the point (0,0) on the bank of a
wide river, a boat starts with relative speed to the water equal to 𝜈. The stream of the river
Exercises 205
becomes faster as it departs from the bank, and the speed is g(y) parallel to the bank. The
movement of the boat is described by
̇ = 𝜈 cos(𝜙(t)) + g(y(t)),
x(t)
̇ = 𝜈 sin(𝜙(t)),
y(t)
where 𝜙 is the angle between the boat direction and bank. We want to determine the move-
ment angle 𝜙(t) so that x(T) is maximized, where T is a fixed time. We will investigate the
case when g(y) = y, 𝜈 = 1, and T = 1. Use CasADi to solve this optimal control problem by
modifying the file minCurveLengthCol.m.
7.23 Solve the problem in Exercise 7.11 using CasADi and both the multiple shooting method
and the collocation method. How many iterations do the methods need to converge to a
solution?
206
Dynamic Programming
In Chapter 7, we discussed how to solve optimal control problems over a finite time interval.
We specifically considered the continuous time case, since for discrete time dynamics, it is fairly
straightforward how to solve the optimal control problem. The solutions we obtained are so-called
open-loop solutions, i.e. the value of the control signal at a certain time depends only on the
initial value of the state and not on the current value of the state. For the linear quadratic (LQ)
control problem, we were able to restate the solution as a feedback policy, i.e. to explicitly write the
control signal as u(t) = 𝜇(t, x(t)) for a feedback function 𝜇. This is very desirable since it is known
that feedback solutions are more robust to unmodeled dynamics and disturbances. We will in this
chapter look in more detail into the problem of obtaining feedback solutions. We will only treat
the discrete time case, since it is easier from a mathematical point of view. However, many of the
ideas can be extended to the continuous time case. We will first consider a finite time interval and
then we will discuss the case of an infinite time interval. Optimal feedback control goes back to
the work by Richard Bellman who in 1953 introduced what is known as dynamic programming to
solve these types of problems. We will see that this is a special case of message passing as discussed
in Section 5.5. The case of infinite time interval is of practical importance in many applications,
since under certain conditions stability can be proven. It is unfortunately often very difficult to
compute the solution, and hence, we will introduce what is known as model predictive control
(MPC) as a remedy. This is today commonly used in industry. We will at the end of the chapter
discuss how to treat uncertainty by considering a stochastic setting of the control problem. This
treatment is based on the stochastic multistage decision problem introduced in Section 5.7.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
8.1 Finite Horizon Optimal Control 207
final state xN called the terminal cost of final cost 𝜙 ∶ n → ℝ. The optimal control problem is then
the problem of solving
∑
N−1
minimize 𝜙(xN ) + fk (xk , uk ),
k=0
Here, the minimization over u should be carried out for all possible values of x and hence, the
optimal u is a function of x, i.e. we have a multiparametric optimization problem as discussed in
Section 5.6. If it is possible to carry out the minimizations above,1 then the optimal control signal for
(8.2) is given by the minimizing argument in the dynamic programming recursion, i.e. u⋆k = 𝜇k (x),
where 𝜇k ∶ n → m for k ∈ ℤN−1 is given by
𝜇k (x) = argmin Qk (x, u),
u∈Uk (x),Fk (x,u)∈Xk+1
Example 8.1 Consider the problem in (8.2) for the case when the dynamic equation is linear, i.e.
Fk (x, u) = Ak x + Bk u, where Ak ∈ ℝn×n and Bk ∈ ℝn×m and where = = ℝ. We assume that the
incremental costs and the final cost are quadratic functions given by fk (x, u) = xT Sk x + uT Rk u and
𝜙(x) = xT SN x, respectively, where Rk ∈ 𝕊m++ and Sk ∈ 𝕊+ . We also assume that there are no state
n
constraints or control signal constraints. This problem is called the linear quadratic (LQ) control
problem. Application of the dynamic programming recursion in (8.4) gives VN (x) = xT SN x and
{ ( )}
Vk (x) = min xT Sk x + uT Rk u + Vk+1 Ak x + Bk u .
u
We will make the guess that Vk (x) = xT Pk x for some Pk ∈ 𝕊n . This is clearly true for k = N with
PN = SN . We now assume that it is true for k + 1. Then the right-hand side of the above equation
reads
Q(x, u) = xT Sk x + uT Rk u + (Ak x + Bk u)T Pk+1 (Ak x + Bk u),
[ ]T [ ][ ]
x Sk + ATk Pk+1 Ak ATk Pk+1 Bk x
= ,
u BTk Pk+1 Ak Rk + BTk Pk+1 Bk u
which should be minimized with respect to u. If Rk + BTk Pk+1 Bk ∈ 𝕊m
++ , then the above optimization
problem is convex in u, and the solution is obtained similarly as in Example 5.7 as
( )−1
u = − Rk + BTk Pk+1 Bk BTk Pk+1 Ak x,
which defines the feedback policy 𝜇k (x). Back-substitution of this expression for u results in the
following expression for the right-hand side of the dynamic programming recursion:
{ ( )−1 }
xT Sk + ATk Pk+1 Ak − ATk Pk+1 Bk Rk + BTk Pk+1 Bk BTk Pk+1 Ak x.
the result follows. This is true for k = N − 1, since PN = SN ∈ 𝕊+ . Assume that it is true for k + 1,
n
then the above minimization will result in a minimal value that is nonnegative. Hence, the result
also holds for k.
Example 8.2 We will now look at the problem of finding the shortest path between two nodes in
a graph, see Figure 8.1. The nodes could represent different cities, and the numbers on the edges
could represent the distances between the cities. This is called a shortest path problem. We cast this
problem as an optimal control problem as in (8.2) with the following definitions. Let N = 5, x0 = 0,
Xk = {−1, 0, 1} for k = 1, … , N − 1 and XN = {0}. We define the control signal to take the values
−1, 0, 1, which mean to go down, stay, or go up in the graph, respectively. Hence, the control signal
constraint set is
⎧{−1, 0}, x = 1,
⎪
Uk (x) = ⎨{−1, 0, 1}, x = 0,
⎪{0, 1}, x = −1,
⎩
for k ∈ ℕN−2 and
⎧{−1}, x = 1,
⎪
UN−1 (x) = ⎨{0}, x = 0,
⎪{1}, x = −1.
⎩
210 8 Dynamic Programming
1 1 1
1 1 1 1
2
5
3
2
2
2 4 3 4 3
0 0 0 0 0 0
3
2
4
4
5 1 5
−1 −1 −1 −1
Figure 8.1 Graph for the shortest path problem in Example 8.2.
Moreover, we let Fk (x, u) = x + u, and hence, the next state value is the sum of the current state
value and the control signal value. The final cost is 𝜙(x) = 0. Finally, the incremental cost is
fk (x, u) = cki,j , where cki,j is the cost on the arrow from node i at stage k to node j at stage k + 1. It
should be stressed that the definitions above are not unique. We could, e.g. change the control
signal to take values −2, 0, 2 and change the function Fk to Fk (x, u) = x + 0.5u.
We now apply the dynamic programming recursion. At the final stage N = 5, we have V5 (x) = 0
for x = 0 and V5 (x) = ∞ for any other x, since X5 = {0}. Then
{ }
V4 (x) = min f4 (x, u) + V5 (F4 (x, u)) .
u∈U4 (x),F4 (x,u)∈X5
We realize that x and u must satisfy F4 (x, u) = 0 to obtain a minimum, since otherwise the second
term will be infinity. Hence, for Stage 4, we get
⎧c41,0 = 2, x = 1,
⎪
V4 (x) = ⎨c40,0 = 3, x = 0,
⎪ 4
⎩c−1,0 = 4, x = −1,
where the value for x = 1 is obtained for u = −1, the value for x = 0 is obtained for u = 0, and the
value for x = −1 is obtained for u = 1. For Stage 3, we have
{ }
V3 (x) = min f3 (x, u) + V4 (F3 (x, u)) ,
u∈U3 (x),F3 (x,u)∈X4
{ }
⎧min c3 + V4 (1), c3 + V4 (0) = min {3, 5} = 3, x = 1,
⎪ 1,1 1,0
⎪ { }
= ⎨min c30,1 + V4 (1), c30,0 + V4 (0), c30,−1 + V4 (−1) = min {5, 7, 6} = 5, x = 0,
⎪ { }
⎪min c3 + V (0), c3
⎩ −1,0 4 −1,−1 + V4 (−1) = min {4, 9} = 4, x = −1,
where the value for x = 1 is obtained for u = 0, the value for x = 0 is obtained for u = 1, and the
value for x = −1 is obtained for u = 1. We can now continue for the remaining stages, and the result
is depicted in Figure 8.2. The shortest path correspond to the thick arcs. Above each node, the value
of Vk (x) is given, and the small arrows show the optimal control signal at each node. From these
arrows, we are hence able to also obtain the shortest path from any other node than the initial
one to the final node. Hence, the optimal cost to go from Node 1 at Stage 1 to the final node is 5,
and the optimal path is to go to Node 1 at each stage except for at Stage 4, where one has to go to
node 0.
8.2 Parametric Approximations 211
2
3
2
2
5
3
2
(8) (6) (5) (5) (3) (0)
2 4 3 4 3
0 0 0 0 0 0
3
2
4
4
(9) (5) (4) (4)
5 1 5
−1 −1 −1 −1
→
Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
Figure 8.2 Graph for the shortest path problem in Example 8.2. The shortest path correspond to the thick
arcs. Above each node, the value of Vk (x) is given, and the small arrows show the optimal control signal at
each node.
It is clear that the dynamic programming recursion is difficult to solve analytically. Even if each
component of the states takes values in finite sets with say p elements, the tasks are challenging for
high- dimension n of the state. The total number of possible values of a state is then pn , which also
for moderate values of p and n can be a huge number. This is called the curse of dimensionality. For
problems where the state is a real-valued vector, one may approximate it with discrete values obtain-
ing a piecewise constant approximation, but then very course of approximations are necessary.
We will now discuss how the curse of dimensionality can be circumvented to some extent. We
introduce vector-valued functions 𝜑k ∶ n → ℝp called feature vectors and parameters ak ∈ ℝq for
k ∈ ℤN−1 . We would like to approximate the functions Vk in (8.4) with Ṽ k ∶ n × ℝq → ℝ, where
Ṽ k (x, ak ) = V̂ k (𝜑k (x), ak ) for a suitable choice of V̂ k ∶ ℝp × ℝq → ℝ and ak ∈ ℝq . One possible
choice of V̂ k is to use a linear regression model for which p = q, i.e.
V̂ k (𝜑k (x), ak ) = aTk 𝜑k (x).
There are of course many more possibilities. The definitions made above are general enough to
model V̂ k as an artificial neural network (ANN), see Section 10.7. In case one has problem-specific
insight, more clever choices can be made. In a preprocessing step, features could be extracted from
data or from sufficient statistics, see [17]. For the sake of simplicity, we will in what follows only
discuss the linear regression case. Notice that a piecewise constant approximation of Vk can be
obtained by taking
{
1, x ∈ Dk ,
𝜑k (x) =
0, x ∉ Dk ,
where Dk , k ∈ ℕp is a partition of ℝn . More general approximations are obtained by taking 𝜑k as
basis functions. In this section, we will not consider constraints on xk or uk .
iteration. It is based on sampling the state-space n . This can be done in many ways, and how
well this is done is crucial for the success of the algorithm. A popular choice is to use some sort of
Monte Carlo technique. It is important that the states that are sampled are representative for what
states are typically visited by a close to optimal policy. We start by defining approximate Q-functions
Q̃ k ∶ n × m × ℝq → ℝ as
{
̃
̃ k (x, u, a) = fk (x, u) + V k+1 (Fk (x, u), a), k ∈ ℤN−2 ,
Q
fk (x, u) + 𝜙(Fk (x, u)), k = N − 1.
̃ k does
Here, we have just replaced Vk+1 with Ṽ k+1 in the expression for Qk in (8.5). Notice that Q
not depend on any parameter a for k = N − 1, which is the iteration index we start with. We then
consider samples xks ∈ n , where s ∈ ℕr , and define the minimal values of the approximate
Q-functions as
̃ k (xs , u, ak+1 ),
𝛽ks = min Q (8.6)
u k
where ak+1 is a known value from the previous iterate. This is in general a nonconvex optimization
problem that can be challenging to solve. It should be stressed that the quality of the approximation
that is obtained depends critically on the choice of xks . After this, we define the following LS problem
for obtaining the next value of the parameter ak :
1 ∑( ̃ s
r
)2
minimize V (x , a) − 𝛽ks ,
2 s=1 k k
with variable a. The iterations start with k = N − 1 and proceed backward in time to k = 0. For the
case when V̂ k is a linear regression model, the LS problem is a linear LS problem with closed-form
solution. Otherwise, we need to use some iterative optimization method as discussed in Chapter 6.
Once all the parameters ak have been computed, the approximate feedback function is given by
( )
𝜇k (x) = argmin Q ̃ k x, u, ak+1 . (8.7)
u
Example 8.3 We will in this example perform fitted-value iteration for the optimal control prob-
lem in Example 8.1. We will specifically consider the case when m = 1 and n = 2, and we let
Ak , Bk , Rk , Sk be independent of k, and we write them as A, B, R, S. Since we know that the value
function is quadratic, we will use a feature vector that is 𝜑(x) = (x12 , x22 , 2x1 x2 ), where x = (x1 , x2 ) ∈
ℝ2 . Notice that the indices refer to components of the vector and not to time. We let
V̂ k (𝜑(x), a) = aT 𝜑(x),
where a ∈ ℝ3 . With
[ ]
̃P = a1 a3 ,
a3 a2
we may then write
̃
Ṽ k (x, a) = aT 𝜑(x) = xT Px.
Hence, the true value function Vk (x) = xT Pk x, and the approximate value function Ṽ k (x, a) agrees
if P̃ = Pk . We, moreover, have
̃ k (x, u, a) = xT Sx + uT Ru + (Ax + Bu)T P(Ax
Q ̃ + Bu),
[ ]T [ ] [ ]
x ̃
S + AT PA ̃
AT PB x (8.8)
= ̃ ̃ .
u BT PA R + BT PB u
8.3 Infinite Horizon Optimal Control 213
For k = N − 1 down to k = 0, we then solve the linear LS problem in (8.6) to obtain 𝛽ks . From
Example 5.7, we realize that
( )T { ( )−1 }
𝛽ks = xks S + AT P̃ k+1 A − AT P̃ k+1 B R + BT P̃ k+1 B BT P̃ k+1 A xks ,
assuming that R + BT P̃ k+1 B is positive definite. Here, P̃ N = S. We then obtain ak as the solution to
the linear LS problem
1 ∑( T s
r
)2
minimize 𝜑 (xk )a − 𝛽ks ,
2 s=1
with variable a. This defines P̃ k . The solution ak satisfies the normal equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛽k ,
where
⎡𝜑T (xk1 )⎤ ⎡𝛽k1 ⎤
⎢ ⎥ ⎢ ⎥
Φk = ⎢ ⋮ ⎥ ; 𝛽k = ⎢ ⋮ ⎥ .
⎢ T r ⎥ ⎢ r⎥
⎣𝜑 (xk )⎦ ⎣𝛽k ⎦
It is crucial here to choose xks such that ΦTk Φk is invertible. We realize that we need r ≥ 3 for this
hold. Moreover, we need to choose xks sufficiently different for ΦTk Φk to be well conditioned. For a
general n, we will need that r ≥ n(n + 1)∕2. From (8.7), (8.8), and Example 5.7 it follows that the
optimal control is given by
( )−1
uk = − Rk + BTk P̃ k+1 Bk BTk P̃ k+1 Ak xk .
In the example above, it should be possible to obtain the same solution as in Example 8.4. In
general, it is not the case that fitted value iteration will provide the exact solution to the problem in
(8.2). The reason is that in general, we cannot represent the value function exactly with the feature
vectors.
2 The point (x0 , u0 ) is a stationary point of xk+1 = F(xk , uk ) if x0 = F(x0 , u0 ), i.e. the state remains at x0 for all values
of k.
214 8 Dynamic Programming
of coordinates to make this hold. We also assume that the incremental cost f is such that f (0, 0) = 0.
Hence, if the state reaches the stationary point, only zero value is added to the cost function for each
stage we remain in stationarity. If this assumption would not be made, it would not be possible to
obtain closed-loop stability, i.e. (xk , uk ) → (0, 0) as k → ∞ with finite cost for the case when 𝛾 = 1.
To simplify the presentation below, we will also restrict ourselves to the case when the incremental
cost is strictly positive definite.3
then V(x) = J ⋆ (x) and u⋆k = 𝜇(xk ) is the optimal feedback control, where 𝜇 ∶ n → m is defined
as 𝜇(x) = argminu∈U(x) {f (x, u) + 𝛾V(F(x, u))}. If in addition 𝛾 is sufficiently close to one, this feed-
back results in closed-loop stability in the sense defined above. The proof of this result is given in
Section 8.10. For later reference, we define Q ∶ n × m → ℝ as
Q(x, u) = f (x, u) + 𝛾V(F(x, u)).
We next consider an example which is known as infinite horizon LQ control.
Example 8.4 Let us consider the case when F(x, u) = Ax + Bu for matrices A ∈ ℝn×n and B ∈
ℝn×m . We also assume that f (x, u) = xT Sx + uT Ru, where S ∈ 𝕊n+ and R ∈ 𝕊m++ and that U(x) = ℝ
m
for all x. Clearly, we satisfy the assumptions on the functions f and F. We will make the guess that
V(x) = xT Px for some P ∈ 𝕊n++ . Then
[ ]T [ ][ ]
x S + 𝛾AT PA 𝛾AT PB x
Q(x, u) = .
u 𝛾BT PA R + 𝛾BT PB u
Since this expression is strictly convex in u, it follows from Example 5.7 that Q is minimized for
u = −𝛾(R + 𝛾BT PB)−1 BT PAx.
Back substitution of this results in
( ( )−1 )
xT Px = xT S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA x.
This equation holds if P is the solution to the following discounted algebraic Riccati equation:
( )−1
P = S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA. (8.11)
It can be shown that there is a unique solution to the above equation under our assumptions
and if (A, B) is controllable.5 This solution is such that P ∈ 𝕊n++ , and hence, the function V
is strictly positive definite and quadratically bounded. The assumptions can be relaxed, see,
e.g. [51].
3 A function V ∶ n → ℝ is said to be strictly positive definite if V(0) = 0, and there exist 𝜖 > 0 such that
V(x) ≥ 𝜖||x||22 .
4 A function V ∶ n → ℝ is said to be quadratically bounded [ if there exists c >] 0 such that V(x) ≤ c||x||2 .
2
5 It holds that (A, B) is controllable if and only if the matrix B AB · · · An−1 B has full rank.
8.4 Value Iterations 215
where f̂ ∶ n × m → ℝ is defined as
The equation above is called the variational form of the Bellman equation, and f̂ is called the tem-
poral difference corresponding to W. This equivalent formulation plays an important role when
solving the Bellman equation numerically.
The Bellman equation is in general a difficult equation to solve. It is possible to show that the
dynamic programming iteration in (8.4) converges to the solution V(x) of the Bellman equation
when k → −∞.
We will for convenience restate it in a format where the iteration index proceeds forward instead
of backward as
{ ( )}
Vk+1 (x) = min fk (x, u) + 𝛾Vk Fk (x, u) , (8.13)
u∈U(x)
with initial value V0 (x) = 0. If one has a clever guess of an approximate solution to the Bellman
equation, this can be used as initial value instead. This will make the iterates converge much
faster. The iteration above is called value iteration (VI). The algorithm for VI is summarized in
Algorithm 8.2, where T is defined in (8.14).
Example 8.5 Here, we will consider VI for infinite horizon LQ control as in Example 8.4. To this
end, we let Vk (x) = xT Pk x with P0 = 0. Similarly, as in Example 8.1, we realize that if Pk satisfies
the recursion
( )−1
Pk+1 = S + 𝛾AT Pk A − 𝛾 2 AT Pk B R + BT Pk B BT Pk A,
then the VI recursion is satisfied. Based on a similar argument as in Example 8.1, we also realize
that the inverse in the recursion exists and that Pk ∈ 𝕊n+ for all k.
216 8 Dynamic Programming
The proof of convergence of VI is based on the contraction property of the Bellman operator
T ∶ ℝ ∶→ ℝ 6 defined as
n n
We first show that the Bellman operator is monotone. Assume that V1 and V2 are two functions
from n to ℝ such that V1 (x) ≤ V2 (x) for all x ∈ n . Then
T(V1 )(x) − T(V2 )(x) ≤ 𝛾V1 (F(x, u⋆2 )) − 𝛾V2 (F(x, u⋆2 )) ≤ 0,
V1 (x) − c ≤ V2 (x) ≤ V1 (x) + c ⟹ T(V1 )(x) − 𝛾c ≤ T(V2 )(x) ≤ T(V1 )(x) + 𝛾c,
holds. From this, it follows by the contraction-mapping theorem [73, p. 272] that VI converges to a
solution of the Bellman equation for the case when 𝛾 < 1. In case 𝛾 = 1, it is sometimes possible to
still prove convergence for VI. See, e.g. [19] for the LQ control case.
It turns out that the convergence of VI can be slow, and there is another approach that can be
pursued. This is called policy iteration (PI). Introduce the Bellman policy operator: T𝜇 ∶ ℝ → ℝ
n n
defined as
which is the same as the Bellman equation except for that we in place of the optimal u substitute a
feedback policy. This is a linear system of equations for Vk . Depending on there could be finitely
many equations or infinitely many equations. Solving this linear system of equations is called the
policy evaluation step. We then obtain a new feedback policy by solving
{ }
𝜇k+1 (x) = argmin f (x, u) + 𝛾Vk (F(x, u)) . (8.17)
u∈U(x)
This is called the policy improvement step. We summarize the PI algorithm in Algorithm 8.3.
The proof that PI converges to a solution of the Bellman equation goes as follows: it holds
6 The set ℝ is the set of all functions from n to ℝ, cf . the notation section.
n
8.5 Policy Iterations 217
that
( ) ( )
Vk (x) = T𝜇k Vk (x) ≥ T𝜇k+1 Vk (x),
where the equality is by definition of the policy evaluation step, and where the inequality is by def-
inition of the policy improvement step. Just like the Bellman operator, the Bellman policy operator
is a monotone operator, see Exercise 8.7. Hence, repeated application of T𝜇k+1 results in
( ) ( )2 ( )
Vk (x) ≥ T𝜇k+1 Vk (x) ≥ T𝜇k+1 Vk (x) ≥ · · · ,
( )n ( )
≥ lim T𝜇k+1 Vk (x) = Vk+1 (x),
n→∞
where the equality follows from the fact that also the Bellman policy operator is a contraction map-
ping when 𝛾 < 1, see Exercise 8.7, and therefore, the limit satisfies T𝜇k+1 (Vk+1 ) = Vk+1 . Hence, PI
results in an improving sequence of value functions, and in case Vk (x) = Vk+1 (x) for all x ∈ n , we
realize that the first inequality above also holds with equality, proving that Vk satisfies the Bellman
equation proving that 𝜇k+1 is optimal. Hence, PI either results in a strict improvement of the value
function in each iteration or termination at a solution of the Bellman equation.
We remark that neither VI nor PI are normally tractable methods. Hence, approximations are in
most cases required. We now give an example where the calculations can be carried out exactly.
Example 8.6 We consider PI for the infinite horizon LQ control problem in Example 8.4. We
guess that Vk (x) = xT Pk x for some Pk ∈ 𝕊n+ and that 𝜇k (x) = −Lk x for some Lk ∈ ℝm×n . The policy
evaluation step is then given by finding a solution Pk of
( )T ( )
xT Pk x = xT Sx + xT LTk RLk x + 𝛾xT A − BLk Pk A − BLk x,
for given Lk . This can be obtained by solving the algebraic Lyapunov equation
( )T ( )
Pk − 𝛾 A − BLk Pk A − BLk = S + LTk RLk ,
which has a positive definite solution Pk since the right- hand side is positive definite. This assumes
√
that 𝛾(A − BLk ) has all its eigenvalues strictly inside the unit disk. The policy improvement step
is then
{ }
𝜇k+1 (x) = argmin xT Sx + uT Ru + 𝛾(Ax − Bu)T Pk (Ax − Bu) .
u
It can be shown that if we start with L0 that is stabilizing, then so will all Lk be, see [51], where
it is also shown that convergence holds for the case when 𝛾 = 1. The iterations for the LQ control
problem derived above are called the Hewer iterations.
8.5.1 Approximation
In general, it is not possible to carry out the computations in PI exactly. One has to resort to approx-
imations. We will use here a similar idea as in Section 8.2. It is based on defining Ṽ ∶ n × ℝp → ℝ
using an ANN or as a linear regression with p parameters. This function will be used to approximate
Vk in (8.16). Before we do that we notice that (8.16) implies
( ) ( ( ))
Vk (x0 ) = f x0 , 𝜇k (x0 ) + 𝛾Vk F x0 , 𝜇k (x0 ) ,
( ) ( )
= f x0 , 𝜇k (x0 ) + 𝛾Vk x1 ,
( ) ( ) ( )
= f x0 , 𝜇k (x0 ) + 𝛾f x1 , 𝜇k (x1 ) + 𝛾 2 Vk x2 ,
(8.18)
⋮
∑
N−1
( ) ( )
= 𝛾 i f xi , 𝜇k (xi ) + 𝛾 N Vk xN ,
i=0
where xi+1 = F(xi , 𝜇k (xi )). In case N is large and 𝜇k is stabilizing, we have that xN is close to zero
and that also Vk (xN ) is close to zero. Hence, a way to evaluate Vk for a value x0 of the state is to just
simulate the dynamical system and add up the incremental costs. In case one has an idea about
( )
what Vk xN might be, that can also be used. This can in particular be beneficial in case xN is not
very small.
We define these sums for different initial values xs for s ∈ ℕr as
∑
N−1
( )
𝛽ks = 𝛾 i f xi , 𝜇k (xi ) ,
i=0
where xi+1 = F(xi , 𝜇k (xi )). We then find the approximation of Vk by solving
1 ∑( ̃ s
r
)2
minimize V(x , a) − 𝛽ks ,
2 s=1
with variable a. The solution is denoted ak . After this, we use the following exact policy improve-
ment step
{ ( )}
𝜇k+1 (x) = argmin f (x, u) + 𝛾 Ṽ F(x, u), ak . (8.19)
u∈U(x)
We remark that it is possible to reuse the simulated trajectory and use it to compute several costs.
This follows from the simple fact that we can use also x1 = F(xs , us ) as an initial value, for which
the simulated trajectory is obtained from the one starting at xs by omitting the first value. This
means that costs from any state on the simulated trajectory can be computed. These are just the
tail-costs of the overall cost when starting at xs . We should however stress that this might not
provide enough representative initial states to obtain a good approximation of Vk .
Example 8.7 We will in this example consider the optimal control problem in Example 8.4. We
will specifically consider the case when m = 1 and n = 2. Since we know that the value function is
quadratic, we will use a feature vector that is 𝜑(x) = (x12 , x22 , 2x1 x2 ), where x = (x1 , x2 ) ∈ ℝ2 . Notice
that the indices refer to components of the vector and not to time. We let
̃ a) = aT 𝜑(x),
V(x,
8.5 Policy Iterations 219
where a ∈ ℝ3 . With
[ ]
a a
P̃ = 1 3 ,
a3 a2
we may then write
̃ a) = xT Px.
V(x, ̃
Hence, the true value function V(x) = xT Px and the approximate value function V(x, ̃ a) agree if
P̃ = P. Here, with an abuse of notation ak ∈ ℝ , which defines P̃ k , is obtained as the solution to the
3
linear LS problem
1 ∑( T s
r
)2
minimize 𝜑 (x )a − 𝛽ks ,
2 s=1
with variable a. The solution ak satisfies the normal equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛽k ,
where
⎡𝜑T (x1 )⎤ ⎡𝛽k1 ⎤
Φk = ⎢ ⋮ ⎥ , 𝛽k = ⎢ ⋮ ⎥ ,
⎢ T r ⎥ ⎢ r⎥
⎣𝜑 (x )⎦ ⎣𝛽k ⎦
and where
∑ (
N−1
)
𝛽ks = 𝛾 i xiT Sxi + 𝜇k (xi )T R𝜇k (xi ) ,
i=0
and where xi+1 = Axi + B𝜇k (xi ) with initial values xs , s ∈ ℕr . It is crucial here to choose xs such that
ΦTk Φk is invertible. We realize that we need r ≥ 3 for this hold. Moreover, we need to choose xs
sufficiently different for ΦTk Φk to be well conditioned. For a general n, we will need r ≥ n(n + 1)∕2.
We define
̃ k (x, u, a) = f (x, u) + 𝛾 V(Ax
Q ̃ + Bu, a),
= xT Sx + uT Ru + 𝛾(Ax + Bu)T P(Ax ̃ + Bu),
[ ]T [ ] [ ] (8.20)
x ̃
S + 𝛾AT PA 𝛾AT PB̃ x
= ̃ ̃ .
u 𝛾BT PA R + 𝛾BT PB u
From Example 5.7, we realize that the solution to (8.19) is given by
̃ k (x, u, ak ) = −𝛾(R + 𝛾BT P̃ k B)−1 BT P̃ k Ax,
𝜇k+1 (x) = argmin Q
u
assuming that R + 𝛾BT P̃ k B is positive definite. Here, P̃ k is defined from ak in the same way as P̃ is
defined from a above. We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = 𝛾(R + 𝛾BT P̃ k B)−1 BT P̃ k A. It is a good idea to start with some L0 that is stabilizing.
We remark that it may be highly beneficial to consider the variational form of the Bellman
equation by replacing the incremental cost with the temporal difference in (8.12). Since the Bell-
man equation for the variational form looks the same as the original Bellman equation with the
only difference that the incremental cost is replaced with the temporal difference, all formulas in
this section remain the same when the temporal difference is used as incremental cost. The rea-
son that the variational form may be beneficial is that W(x) may be taken as an initial guess of the
solution of the Bellman equation, and then we only need to parameterize the difference between
the true solution and the initial guess, which might require less parameters if the initial guess
is good.
220 8 Dynamic Programming
We assume that 0 ≤ 𝛾 < 1. The LP formulation of the optimal control problem is then given by
maximize V(−1) + V(0) + V(1),
subject to V(−1) ≤ (−1)2 + (−1)2 + 𝛾V(−1),
V(−1) ≤ (−1)2 + 02 + 𝛾V(−1),
V(−1) ≤ (−1)2 + 12 + 𝛾V(0),
V(0) ≤ 02 + (−1)2 + 𝛾V(−1),
V(0) ≤ 02 + 02 + 𝛾V(0),
V(0) ≤ 02 + 12 + 𝛾V(1),
V(1) ≤ 12 + (−1)2 + 𝛾V(0),
V(1) ≤ 12 + 02 + 𝛾V(1),
V(1) ≤ (1)2 + 12 + 𝛾V(1),
over the variables (V(−1), V(0), V(1)). We immediately realize that the first constraint is implied
by the second constraint and that the last constraint is implied by the second last constraint. More-
over, the fifth constraint is equivalent to V(0) = 0, if 0 ≤ 𝛾 < 1, since V(x) ≥ 0 from the fact that
the objective function is bounded by zero from below. From this, we see that the fourth and sixth
constraint cannot be active at optimality, since they only provide lower bounds on the variables.
The remaining constraints can be summarized as V(−1), V(1) ≤ 𝑣, where 𝑣 = min {2, 1∕(1 − 𝛾)}.
8.7 Model Predictive Control 221
Hence, the optimal solution is (V(−1), V(0), V(1)) = (𝑣, 0, 𝑣), where 𝑣 ∈ (1, 2], since 0 < 𝛾 < 1.
It now remains to compute the optimal u for different values of x ∈ . We have for x = −1 that the
right-hand side of the Bellman equation is given by
⎧𝑣, u = −1,
⎪
(−1) + u + 𝛾 ⎨𝑣,
2 2
u = 0,
⎪0, u = 1.
⎩
We see that whatever value 𝑣 has it can never be that u = −1 is optimal. If 𝛾𝑣 < 1, it is optimal
to have u = 0, which can be seen to be the case when 0 < 𝛾 < 1∕2, and hence, for 1∕2 ≤ 𝛾 < 1, it
is optimal to take u = 1. A similar argument shows that when x = 1, u = 0 is optimal when 0 <
𝛾 < 1∕2 and u = −1 optimal when 1∕2 ≤ 𝛾 < 1. When x = 0 the right-hand side of the Bellman
equation is given by
⎧𝑣, u = −1,
⎪
0 + u + 𝛾 ⎨0,
2 2
u = 0,
⎪𝑣, u = 1,
⎩
and hence, u = 0 is optimal. We summarize our findings by noting that when 0 < 𝛾 < 1∕2 the opti-
mal solution is always zero. The future costs are discounted so much that it is not worth the effort
to take any action with the control signal. Otherwise, it pays off to steer away from a nonzero state
value to make it zero.
8.6.1 Approximations
The LP formulation is often not tractable in general, since there might be many variables. A remedy
to this is to approximate V(x) with, e.g. a feature-based linear regression Ṽ ∶ n × ℝp → ℝ as
̃ a) = aT 𝜑(x),
V(x,
where 𝜑 ∶ n → ℝp . The approximate optimization problem is
∑
maximize c(x)Ṽ (x, a),
x∈n
̃ a) ≤ f (x, u) + 𝛾 V(F(x,
subject to V(x, ̃ u), a), ∀(x, u) ∈ n × m such that u ∈ U(x),
with variable a. This might still be an intractable problem because of the many constraints. One way
to overcome this obstacle is to sample the constraints, i.e. omit some of them by just considering
(xs , us ) for s ∈ ℕr as above. Notice that the approach presented here can also be used to approxi-
mately evaluate a fixed policy 𝜇. Just replace u above with 𝜇(x). This might reduce the number of
constraints significantly and can then be used together with PI.
is an optimal feedback that results in closed-loop stability. The only difference as compared to our
previous presentation of the Bellman equation is that we have added a constraint on F(x, u).
For the case of f (x, u) = xT Qx + uT Ru, but when there still are constraints on the states
̂
and/or control signal, one might use V(x) = xT Px, where P solves the algebraic Riccati equation
in (8.11) for 𝛾 = 1 This solution will be optimal if x0 is such that there are no constraint
violations – otherwise not.
We remark that stability is not guaranteed for any of the approximations of this section without
further investigation.
x̃ k+N = 0,
224 8 Dynamic Programming
where x̃ k = xk is given; how will be defined below. We will call this problem the time k problem.
Denote the solution by
xk+1 = F(xk , uk ), k ∈ ℤ+ ,
with x0 given. This control strategy is often called model predictive control, since a model that pre-
dicts the values of the states are present in the time k problem. The time horizon N is for this reason
usually called the prediction horizon. Notice that only the optimal control signal corresponding to
time instant k in the solution of the time k problem is applied to the system to be controlled in the
MPC strategy. All the other computed control signals are disregarded, and the optimization is car-
ried out once again for k + 1, and so on. In this way feedback is obtained, even though the control
signal is not explicitly given as a feedback policy, i.e. we do not have an explicit function 𝜇 relating
the optimal uk to the state xk . We only have an implicit way to obtain the optimal uk once the state xk
is given. Hence, we have to resort to the on-line implementation to be carried out in real-time. This
can be infeasible for large-scale problems or short sampling times. How to overcome this obstacle
will be discussed later. We remark that the receding horizon optimal control problem can be cast
as a finite-dimensional optimization problem similarly as in (8.3). There are many different varia-
tions of the MPC strategy. One might add a penalty 𝜙(̃xk+N ) to the objective function. The constraint
x̃ k+N = 0 can be removed or relaxed to a less stringent constraint. This will usually make it possible
to use shorter prediction horizons N and still obtain feasibility of the time k problem. Stability of
MPC is investigated in Section 8.10. We will now consider an example.
Example 8.9 Consider F(x, u) = Ax + Bu with A ∈ ℝ3×3 and B ∈ ℝ3×2 , where the matrices A and
B are chosen randomly with entries drawn from a standard uniform distribution with support on
the interval [0, 1]. The incremental cost is f (x, u) = xT x + uT u, and the constraints are −𝟙 ⪯ xk ⪯ 𝟙
and −0.5 × 𝟙 ⪯ uk ⪯ 0.5 × 𝟙, where 𝟙 is a vector of ones of appropriate dimension. The initial point
[ ]T
is x0 = 0.9 −0.9 0.9 . In Figure 8.3, the cost for the infinite horizon criterion is plotted versus
3.6
3.4
3.2
0 2 4 6 8 10
Figure 8.3 The cost for the infinite horizon criterion as a function of the horizon for the finite time horizon
approximation (circles), and for MPC (triangle).
8.8 Explicit MPC 225
1 1 1
0.5 0.5 0.5
0 0 0
𝑥
Figure 8.4 The trajectories for the states (top), and the control signals (bottom), for the finite horizon
approximation with N = 3 (left), and N = 20 (middle), and for the MPC with N = 3 (right). The different
components of the vectors are shown as circles, crosses, and triangles, respectively.
different time horizons N for the finite time horizon approximation. It is seen that the cost decreases
as N becomes larger. In fact, it converges to the cost for the infinite time horizon problem. Notice
that there is no feasible solution for N ≤ 2. Also the results for MPC with different time horizons
are shown as comparison. We see that MPC is always performing better than the finite-horizon
open-loop strategy for the same N. The improvement is larger, the smaller N is.
In Figure 8.4, the trajectories for the states and the control signals for the finite horizon approx-
imation with N = 3 and N = 20 together with the MPC with N = 3 are shown. It is seen that for
the finite horizon approximation with N = 3 the control signals and the states are zero for k ≥ N.
We also see that the difference in behavior is not so large for the finite horizon approximation with
N = 20 and the MPC with N = 3.
Let and be block-diagonal matrices with Q and R, respectively, on the diagonal such that
∑
k+N−1
f (xi , ui ) = X̃ X̃ + Ũ T U.
̃
T
i=k
We assume that the sets X and U are defined via the inequalities
u Ũ ⪯ u , x X̃ ⪯ x ,
for some matrices u , u , x , and x . It is now straightforward to verify that (8.25) can be reformu-
lated as
minimize Ũ T Ũ + 2Ũ T xk + xkT ΦT Φxk ,
[ ] [ ]
u u (8.26)
subject to ̃
U⪯ ,
x Γ x − x Φx k
̃ where = ΓT Γ + and = ΓT Φ. Note that the last term in the objective
with variable U,
function can be omitted when carrying out the optimization. By completing the squares and trans-
forming the variables according to
√ ( )
z = 2 Ũ + −1 xk ,
an equivalent optimization problem is obtained, which is a convex multiparametric quadratic
program:
minimize 12 zT z,
(8.27)
subject to Gz ⪯ 𝑤 + S𝜃,
with variable z, where 𝜃 = xk , H is defined as above, and where
[ ] [ ] [ ]
1 u u −1
G= √ , S= −1 − Φ) , 𝑤= u .
2 x Γ x (Γ x
From this reformulation and the discussion in Section 5.6, we conclude that we can obtain an
explicit feedback for MPC in the case of quadratic objective function with affine constraints.
For this case, the feedback function is piecewise affine over a polyhedral partitioning of the
state-space, and it can be computed off-line. The only major on-line computational effort is related
to computing in which polyhedron the current state xk is in.
(X0 , … , Xk ) → uk (X0 , … , Xk ), i.e. u = (u0 , u1 , …) is a random process adapted to X. The initial state
X0 is assumed to have a probability distribution that is independent of the probability distribu-
tion of Wk , k ∈ ℤ+ . The sets and can be finite or infinite as in Section 8.1, but as discussed
in Section 5.7 care has to be taken in case these sets are not finite. The control signal is used to
control the dynamical system, i.e. the evolution of the states. To this end, we introduce the incre-
mental costs fk ∶ n × m → ℝ for k ∈ ℤN−1 . The incremental costs are functions of the states and
the control signal. We also introduce a cost associated with the final state XN called the final cost or
terminal cost 𝜙 ∶ n → ℝ. The finite-horizon stochastic optimal control problem is then the problem
of solving
[ N−1 ]
∑
minimize 𝔼 fk (Xk , uk ) + 𝜙(XN ) ,
k=0 (8.29)
subject to Xk+1 = Fk (Xk , uk , Wk ), k ∈ ℤN−1 ,
with variables (u0 , X1 , … , uN−1 , XN ), where u is adapted to X. This is clearly a multistage stochastic
optimization problem, and it has a special structure that we will now exploit. We will from now on
assume that the constraints have been used to eliminate the variables Xk , and hence, we only have
to consider the variables (u0 , … , uN−1 ) in the optimization problem.
where 𝔼Xk denotes conditional expectation with respect to the probability distribution for Xk given
(X0 , … , Xk−1 ) for k ∈ ℕN , and where 𝔼X0 is expectation with respect to X0 . Note that we do not have
any dependence on uN in the objective function, and hence, we do not need to minimize over uN .
We will skip the outer expectation since it does not affect the optimization problem, and we then
assume that we know the value of the initial state to be X0 = x0 . Hence, we consider
[ [ [N−1 ]]]
∑
min 𝔼X1 min 𝔼X2 · · · min 𝔼XN fk (Xk , uk ) + 𝜙(XN ) ,
u0 u1 uN−1
k=0
for all values of x0 . Once the optimal value of the objective function is known for all values of x0 ,
we can of course compute the mean over the probability distribution for X0 . We make use of the
additivity of the objective function and rewrite the minimization of the objective function as
[ [ ]]
[ ]
min 𝔼X1 f0 (X0 , u0 ) + min 𝔼X2 f1 (X1 , u1 ) · · · min 𝔼XN fN−1 (XN−1 , uN−1 ) + 𝜙(XN ) .
u0 u1 uN−1
after substitution of the dynamic equation. Since Wk is independent of Xl for l ≤ k it follows that
we equivalently have
[ ]
min 𝔼WN−1 fN−1 (xN−1 , uN−1 ) + 𝜙(FN−1 (xN−1 , uN−1 , WN−1 )) ,
uN−1
228 8 Dynamic Programming
where 𝔼Wk denotes the expectation with respect to the probability distribution for Wk .7 We now
realize a very important fact. When we carry out the optimization above with respect to uN−1 ,
this will only be a function of xN−1 and not of any previous values of xk for k < N − 1. This will
also be true when we continue to optimize over uN−2 , i.e. it will only depend on xN−2 , and so on.
Hence, the stochastic optimal control problem can be solved with the following stochastic dynamic
programming recursion. Define the value functions Vk ∶ n → ℝ as VN (x) = 𝜙(x) and
[ ]
Vk (x) = min 𝔼 fk (x, u) + Vk+1 (Fk (x, u, Wk )) , k ∈ ℤN−1 ,
u
where the optimal control is u⋆k = 𝜇k (xk ) where 𝜇k ∶ n → m for k ∈ ℤN−1 is given by
𝜇k (x) = argmin Qk (x, u),
u
Example 8.10 Consider the problem in (8.29) for the case when the dynamic equation is lin-
ear, i.e. Fk (x, u, 𝑤) = Ak x + Bk u + 𝑤, where Ak ∈ ℝn×n and Bk ∈ ℝn×m and where = = = ℝ.
We assume that W is a zero mean random process with Wi independent of Wj for i ≠ j with vari-
[ ]
ance 𝔼 Wk WkT = Σk ∈ 𝕊n+ . We assume that the incremental costs and the final cost are quadratic
functions given by fk (x, u) = xT Sk x + uT Rk u and 𝜙(x) = xT SN x, respectively, where Rk ∈ 𝕊m
++ and
Sk ∈ 𝕊n+ . Application of the stochastic dynamic programming recursion gives VN (x) = xT SN x and
Vk (x) = min Qk (x, u),
u
7 In some applications, it will be convenient to let Wk depend on Xk and then we should take expectation with
respect to the conditional probability function for Wk given Xk = xk instead.
8.9 Markov Decision Processes 229
where
[ ]
Qk (x, u) = xT Sx + uT Ru + 𝔼 Vk+1 (Ak x + Bk u + Wk ) .
We will make the guess that Vk (x) = xT Pk x + rk for some Pk ∈ 𝕊n and some rk ∈ ℝ. This is clearly
true for k = N with PN = SN and rN = 0. We now assume that it is true for k + 1. It then holds that
[ ]T [ ][ ]
x Sk + ATk Pk+1 Ak ATk Pk+1 Bk x
Qk (x, u) = + tr Pk+1 Σk + rk+1 .
u T
Bk Pk+1 Ak T
Rk + Bk Pk+1 Bk u
Generally, if the optimal policy is unaffected when a disturbance such as W is replaced with its
mean, we say that certainty equivalence holds.
Example 8.11 Consider the problem of ordering a quantity uk ∈ ℝ+ of an item at periods k rang-
ing from 0 to N − 1 such that a stochastic demand Wk ∈ ℝ is met. Denote by Xk ∈ ℝ the stock
available at the beginning of period k. Assume that Wk are independent random variables. The
stock evolves as
Xk+1 = Xk + uk − Wk .
There is a cost r ∶ ℝ → ℝ, which is a function of Xk , for keeping stock. Moreover, there is a pur-
chasing cost cuk , where c > 0. The objective is to minimize
[ ]
∑(
N−1
)
𝔼 r(XN ) + r(Xk ) + cuk ,
k=0
with respect to uk . We assume that r is a convex function that is bounded from below and such that
r(x) → ∞ as |x| → ∞. Moreover, we assume that
dr(x)
lim < −c.
x→−∞ dx
The stochastic dynamic programming recursion is
[ ]
Vk (x) = min 𝔼 r(x) + cu + Vk+1 (x + u − Wk ) .
u≥0
We notice that if Vk+1 is a convex function in x, then 𝜑k is a convex function in (x, u). This follows
since the argument z = x + u − Wk is an affine transformation and since taking expectation of a
function preserves convexity. Under the same assumption, it now follows that
r(x) + cu + 𝜑k (x, u)
is a convex function of (x, u), since it is the sum of convex functions. Moreover, it is bounded from
below if 𝜑k is bounded from below and
d𝜑k (x, u)
lim < −c.
u→−∞ du
This is implied by Vk+1 being bounded from below and
dVk+1 (x)
lim < −c.
x→−∞ dx
Under these assumptions, it follows by Exercise 4.9 that
( )
min cu + 𝜑k (x, u)
u≥0
is a convex function that it is bounded from below. Since Vk is the sum of this function and r, which
is convex and bounded from below, it follows that also Vk is convex and bounded from below. The
fact that also
dVk (x)
lim < −c
x→−∞ dx
follows from the same property for r and that Vk is the sum of r and a function that is bounded from
below. For k + 1 = N, we have that VN (x) = r(x) and hence, VN is convex, bounded from below and
satisfies
dVN (x)
lim < −c.
x→−∞ dx
By induction, it now follows that Vk is convex, bounded from below and satisfies
dVk (x)
lim < −c,
x→−∞ dx
for all k. Since cu + 𝜑k (x, u) is bounded from below, it has an unconstrained minimum that we
denote by u0k . The optimal u⋆k is obtained by projecting u0k onto the set [0, ∞). Hence, if the constraint
was not present, it would be desirable to order u0k . Let sk = xk + u0k , and we can write u0k = sk − xk .
Then we realize that
{
⋆ sk − xk , xk ≤ sk ,
uk = 𝜇k (xk ) =
0, xk > sk ,
where we can interpret sk as a target value, i.e. as long as the current stock is above the target value
we do not need to make any order. Otherwise, we should order the difference.
Example 8.12 We consider the problem of accepting offers Wk ∈ ℝ for selling an asset, where
Wk are independent random variables for k ∈ ℤN . If we accept an offer, we can invest the money at
a fixed interest rate of r > 0 for the remaining period of time. Otherwise, we wait for the next offer,
and make a new decision. This is called an optimal stopping problem. We can cast this as a stochastic
optimal control problem by letting the state space be = ℝ ∪ {t}, where the element t denotes
termination of the offers. We let the control space be = {u0 , u1 }, where u0 denotes keeping the
asset, and where u1 denotes selling the asset. We let the state evolve as
{
t, if Xk = t or Xk ≠ t and uk = u1 ,
Xk+1 = F(Xk , uk , Wk ) =
Wk , otherwise.
8.9 Markov Decision Processes 231
closed loop stability, in the sense that (Xk , uk ) is bounded in mean square, i.e. the second moments
of these random variables are bounded for all k ≥ 0. To simplify the presentation below, we will also
restrict ourselves to the case when the incremental cost is strictly positive definite. We define J ⋆ ∶
n → ℝ+ to be the minimal value for the optimization problem in (8.31) for initial value x, and we
define u⋆k to be the optimizing sequence of decisions or control signals that achieves this minimal
value.
Here, W = Wk for an arbitrary k. Then V(x) = J ⋆ (x) and with the Q-function Q ∶ n × m → ℝ
defined as
[ ]
Q(x, u) = 𝔼 f (x, u) + 𝛾V(F(x, u, W)) , (8.33)
it holds that u⋆k = 𝜇(Xk ), where 𝜇 ∶ n → m is given by
𝜇(x) = argmin Q(x, u) (8.34)
u
is an optimal feedback control. If in addition 𝛾 is sufficiently close to one, this feedback results in
closed-loop stability in the sense defined above. The proof of this result is given in Section 8.10. We
next consider an example which is called infinite horizon stochastic LQ control.
Example 8.13 Let us consider the case when F(x, u, 𝑤) = Ax + Bu + 𝑤 for matrices A ∈ ℝn×n and
B ∈ ℝn×m . We also assume that f (x, u) = xT Sx + uT Ru, where S ∈ 𝕊n++ and R ∈ 𝕊m
++ , and that Wk are
independent and define a weakly stationary random process with zero mean and with covariance
Σ ∈ 𝕊n+ . Clearly, we satisfy the assumptions on the functions f and F. We will make the guess that
V(x) = xT Px + r for some P ∈ 𝕊n++ and some r ≥ 0. Then
Q(x, u) = xT Sx + uT Ru + 𝛾(Ax + Bu)T P (Ax + Bu) + 𝛾(tr PΣ + r).
As in Example 8.4, Q is minimized for
u = −𝛾(R + 𝛾BT PB)−1 BT PAx.
Back substitution of this results in
( ( )−1 )
xT Px + r = xT S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA x
+ 𝛾(tr PΣ + r).
This equation holds if P is the solution to the discounted algebraic Riccati equation in (8.11), and if
r = tr(PΣ)∕(1 − 𝛾). We see that certainty equivalence holds also for infinite horizon stochastic LQ
control.
It is possible to extend the results on VI and PI to the stochastic setting in a straightforward man-
ner. What is needed to prove these extensions are the monotonicity and the contraction properties of
the stochastic versions of the Bellman operator and the Bellman policy operator, see Exercise 8.16.
Also, the LP formulation can be extended to the stochastic setting.
8.10 Appendix 233
8.10 Appendix
and if 𝛾 is sufficiently close to one, then V(x) = J ⋆ (x), where we recall that J ⋆ (x) is the optimal value
of the problem in (8.9) for initial value x. Moreover, u⋆k = 𝜇(xk ), where 𝜇 ∶ n → m with
𝜇(x) = argmin {f (x, u) + 𝛾V(F(x, u))}
u∈U(x)
where the last equality follows from the fact that uk is stabilizing and that V(0) = 0. Since the above
inequality holds with equality for uk = 𝜇(xk ) by the Bellman equation, we have that uk = 𝜇(xk ) is
optimal for the infinite-horizon optimal control problem.
234 8 Dynamic Programming
Vk+1 ≤ aVk + b,
where 0 < a < 1 for 𝛾 sufficiently close to one. From this, it follows that
and that
[ N−1 ] [ N−1 ]
∑ ∑ ( )
lim 𝔼 𝛾 f (Xk , uk ) ≥ lim 𝔼
k
𝛾 V(Xk ) − 𝛾V(Xk+1 ) ,
k
N→∞ N→∞
k=0 k=0
where the last equality follows from the fact that 𝔼V(XN ) ≤ c2 𝔼||XN ||22 + c2 < ∞ for all N ≥ 0 and
that 𝛾 < 1, and where the penultimate equality follows from the fact that X0 = x0 is known. Since
the above inequality holds with equality for uk = 𝜇(Xk ) by the stochastic Bellman equation, we have
that uk = 𝜇(Xk ) is optimal for the infinite horizon stochastic optimal control problem.
∑
k+N−1
= f (̃xi , ũ i ) + f (̃xk+N , ũ k+N ) − f (̃xk , ũ k ),
i=k
= Jk⋆ + f (0, 0) − f (̃x⋆k , ũ ⋆k ),
where f (0, 0) = 0 by assumption. Here, Jk⋆ denotes the optimal value of Jk . Hence,
⋆
0 ≤ Jk+1 ≤ Jk+1 = Jk⋆ − f (̃x⋆k , ũ ⋆k ).
It now follows that f (̃x⋆k , ũ ⋆k ) → 0, k → ∞, since otherwise, Jk⋆ → −∞, which is a contradiction.
Since f is strictly positive definite we have that
( )
f (̃x⋆k , ũ ⋆k ) ≥ 𝜖 ||̃x⋆k ||2 + ||ũ ⋆k ||2 ,
Exercises
8.1 Assume that we have a vessel whose maximum weight capacity is z and whose cargo is to
consist of different quantities of N different items. Let 𝑣k denote the value of the kth type of
item, and let 𝑤k denote the weight of the kth type of item.
236 8 Dynamic Programming
(a) Let xk be the used weight capacity of the vessel after the first k − 1 items have been
loaded and let the control uk be the quantity of item k to be loaded on the vessel. For-
mulate the dynamic equation:
xk+1 = Fk (xk , uk ),
describing the process.
(b) Determine the constraint set U(k, xk ) for the control signal uk .
(c) Formulate a DP recursion that solves the problem of finding the most valuable cargo
satisfying the maximal weight capacity. Observe that you do not need to solve the
problem.
8.3 A businessman operates out of a van that he sets up in one of the two locations each day.
If he operates in location i day k, where i = 1, 2, he makes a known and predictable profit
denoted rik . However, each time he moves from one location to the other, he pays a setup
cost c. The businessman wants to maximize his total profit over N days.
(a) The problem can be formulated as a shortest path problem (SPP), where node (k, i)
is representing location i at day k. Let s and e be the start node and the end node,
respectively. The costs of all edges are
– s to i1 with cost −ri1
– ik to ik+1 , i.e. no switch, with cost −rik+1 , k = 1, … , N − 1
k+1
– ik to ̄ik+1 , i.e. switch, with cost c − r k+1 , k = 1, … , N − 1
̄ik+1
– iN to e with cost 0,
where ̄i denotes the location that is not equal to i, i.e. 1̄ = 2 and 2̄ = 1. Draw a figure
to illustrate the SPP and the definitions of variables and parameters. Write down the
corresponding dynamic programming algorithm. Note that you do not have to solve the
problem.
(b) Suppose the businessman is at location i on day k − 1 and let
Rki = r̄ik − rik .
Show that if Rki ≤ 0, it is optimal to stay at location i, while if Rki ≥ 2c, it is optimal to
switch. You can use the following lemma.
Lemma: For every k = 1, 2, … , N, it holds that
|Vk (i) − Vk (̄i)| ≤ c,
where Vk (i) is the value of the optimal cost-to-go function at stage k for state i.
Exercises 237
8.5 Consider the problem of packing a knapsack with items labeled k ∈ ℕN of different value
𝑣k and weight 𝑤k . There is only one copy of each item, and the task is to pack the knapsack
such that the sum of the values of the items in the knapsack is maximized subject to the
constraint that the total weight of the items is less than or equal to the weight limit W.
The problem can be posed as a multistage decision problem in which at each stage k, it is
decided whether item k should be loaded or not. This decision can be coded in terms of the
binary variable uk ∈ {0, 1}, where uk = 1 in case item k is loaded and uk = 0, otherwise. If
xk denotes the total weight of the knapsack after the first k − 1 items have been loaded, then
the following relation holds:
xk+1 = xk + 𝑤k uk , x1 = 0,
for k ∈ ℕN . The constraint that xk+1 ≤ W can be reformulated in terms of uk as uk ≤ (W −
xk )∕𝑤k . From this, it follows that it is possible to calculate how to load the knapsack in an
optimal way using the dynamic programming recursion
{ }
Vk (x) = max 𝑣n u + Vk+1 (x + 𝑤n u) ,
u≤(W−x)∕𝑤n ,u∈{0,1}
with final value VN+1 (x) = 0. We will consider the case when W = 10 and when the values
of 𝑣k and 𝑤k are defined as in the table below, where N = 5.
k 1 2 3 4 5
𝑣k 2 8 7 1 3
𝑤k 1 5 4 3 2
(b) From the tables derived in (a) compute the optimal loading of the knapsack.
238 8 Dynamic Programming
(a) Compute an optimal feedback policy uk = 𝜇(xk ) for this problem when the control signal
constraint is neglected.
Hint: Try the value function V(x) = px2 with p > 0.
(b) Now, consider the case with constraints on the control signal. Compute an approxima-
tive solution by solving
{ }
minimize−1≤u≤1 x2 + u2 + V(x + u) ,
8.7 Show that the Bellman policy operator in (8.15) is a monotone operator and that it is a
contraction when 𝛾 < 1.
with given initial vale x0 . You may assume that f (x, u) = 𝜌x2 + u2 and F(x, u) = x + u, where
x and u are scalar-valued, and where 𝜌 > 0.
(a) Compute the optimal feedback policy using the Bellman equation by guessing that
V(x) = px2 for some p.
(b) In general, it is more tricky to solve the Bellman equation, and different iterative proce-
dures are available. Use VI, i.e. let Vk (x) be iteratively computed from
for k = 1, 2, … with V0 (x) = 0. Show that Vk (x) = pk x2 and that the minimizing argu-
ment is lk xk , where lk = −pk ∕(pk + 1) and where
(c) Now, instead use PI. In this method, one starts with an initial feedback policy 𝜇0 (x) and
repeats the following two steps iteratively for k = 0, 1, 2, …:
1. Compute Vk (x) such that
Assume that 𝜇0 (x) = l0 x and that Vk (x) = pk x2 . Show that 𝜇k+1 (x) = lk+1 x, where now
𝜌 + l2k
pk = ,
1 − (1 + lk )2
and
lk+1 = −pk ∕(pk + 1),
with l0 given.
(d) Compute the sequences pk and lk in (b) and (c) for k = 1, 2, … , 5 when 𝜌 = 0.5. Assume
that l0 = −0.1 for the method in (c). The iterates will converge to the solution in (a).
Which method converges the fastest?
8.9 In this exercise, we consider the LQ problem in Example 8.4. We will specifically consider
the case when m = 1 and n = 2, and the matrices
[ ] [ ]
0.5 1 0
A= , B= , R = 1 and S = I.
0 0.5 1
(a) Implement the Hewer iterations in Example 8.6. You may start with L0 = 0. Specifically
investigate how many iterations are needed for convergence of Lk .
(b) Implement the approach in Example 8.7. You may start with L0 = 0. How does the
choice of initial values xs and the number r of initial values affect the convergence of Lk .
How does the choice of N affect the convergence.
8.10 In this exercise, we investigate how to compute the explicit solution to the MPC problem
using the multiparametric toolbox MPT3 for MATLAB; see https://www.mpt3.org/.
Consider a second-order system
[ ] [ ]
0.8584 −0.0928 0.0928
xk+1 = xk + u ,
0.0464 0.9976 0.0024 k (8.38)
[ ]
yk = 0 1 xk .
This has been obtained from a continuous time system with transfer function
2
,
s2 + 3s + 2
by zero-order hold sampling with a sampling time of 0.05.
(a) Use MPT3 to calculate the explicit MPC for (8.38) using the weighted 2-norm for
the incremental costs defined by Q = I, R = 0.01, when N = 2 and for the following
constraints
[ ] [ ]
−10 10
−1 ≤ uk ≤ 1 and ⪯ xk ⪯ .
−10 10
How many regions are there? Present plots of the partitioning of the state-space for the
control signal, and for the time trajectory for the initial condition x0 = (1, 1).
(b) Find the number of regions in the control law for N = 1, 2, … , 13. How does the
number of regions depend on N, e.g. is the complexity polynomial or exponential?
Estimate the order of the complexity by computing
min ||𝛼N 𝛽 − nr ||22 and min ||𝛾(𝛿 N − 1) − nr ||22 ,
𝛼,𝛽 𝛾,𝛿
8.11 [14, Exercise 1.14] A farmer annually producing Xk units of a certain crop stores (1 − uk )Xk
units of his production, where uk ∈ [0, 1], and invests the remaining uk Xk units, which
increase the production for next year to a level of Xk+1 according to
Xk+1 = Xk + Wk uk Xk ,
where Wk are i.i.d. random variables that are independent of Xk and uk . Moreover, it holds
that 𝔼Wk = 𝑤̄ > 0. The total expected product stored over N years is given by
[ ]
∑
N−1
𝔼 XN + (1 − uk )Xk .
k=1
Show that the optimal solution that maximizes the expected product stored is independent
of xk and given by
⎧𝜇0 (x0 ) = · · · = 𝜇N−1 (xN−1 ) = 1, 𝑤̄ > 1,
⎪
⎨𝜇0 (x0 ) = · · · = 𝜇N−1 (xN−1 ) = 0, 0 < 𝑤̄ < 1∕N,
⎪
⎩𝜇0 (x0 ) = · · · = 𝜇N−k−1
̄ (xN−k−1 ̄ ) = 1, 𝜇N−k̄ (xn−k̄ ) = · · · = 𝜇N−1 (xN−1 ) = 0, 1∕N ≤ 𝑤̄ ≤ 1,
Xk+1 = (1 − 𝛿k )Xk + uk 𝛾k Xk ,
Yk+1 = (1 − 𝛿k )Yk + (1 − uk )𝛾k Xk ,
for k ∈ ℤN , where 𝛿k are i.i.d. random variables with values in [𝛿, 𝛿 ′ ], where 0 < 𝛿 ≤ 𝛿 ′ < 1,
where 𝛾k are i.i.d. random variables with values in [a, b], where a > 0. We assume that
0 < 𝛼 ≤ uk ≤ 𝛽 < 1. We may interpret Xk as the number of educators in a certain country at
time k and Yk as the number of research scientists. By means of incentives, a science policy
maker can determine the proportion uk of new scientists produced at time k who become
educators. The initial values of X0 and Y0 are known. We will derive the optimal policy for
maximizing 𝔼YN .
(a) Show that the value functions are given by Vk (x, y) = ck x + dk y for some ck , dk ∈ ℝ.
(b) Derive the optimal policy when 𝔼𝛾k > 𝔼𝛿k , and show that the optimal policy 𝜇k (x, y) is
independent of x and y.
Xk+1 = ak Xk + buk + Wk , k = 0, 1, … , N,
where ak , bk ∈ ℝ, and where Wk are i.i.d. random variables with a Gaussian distribution
with zero mean and variance 𝜎 2 < 1. We are interested in minimizing the objective function
[ ( )]
1 2 1∑( 2
N−1
)
𝔼 exp x + x + ruk 2
,
2 N 2 k=0 k
where r ∈ ℝ++ . Derive the optimal control law assuming that it is adapted to X.
Exercises 241
Hint: Start by showing that the following recursion for value functions Vk ∶ ℝ → ℝ,
[ ( ) ( )]
Vk (x) = min 𝔼Wk exp x2 + ru2 Vk+1 ax x + bk u + Wk ,
u
with final value VN (x) = x2 ∕2 provides the optimal solution. Also remember the results in
Exercise 3.10.
8.14 [14, Exercise 4.12] We want to run a machine to produce a certain item to meet a known
demand dk ∈ ℝ, k ∈ ℤN . The machine can be in a bad (B) state or a good (G) state. The state
of the machine evolves according to the transition probabilities
ℙ[ G ∣ G ] = λG , ℙ[ B ∣ G ] = 1 − λG , ℙ[ B ∣ B ] = λB , ℙ[ G ∣ B ] = 1 − λB ,
where λB , λG ∈ [0, 1]. Denote by Xk ∈ ℝ, k ∈ ℤN , the stock in period k. If the machine is in
̄ where ū > 0 is a known constant, and
a good state at period k, it can produce uk ∈ [0, u],
the stock evolves as
Xk+1 = Xk + uk − dk .
Otherwise, it evolves as
Xk+1 = Xk − dk .
where g ∶ ℝ → ℝ is a convex function bounded from below and such that g(x) → ∞ as
|x| → ∞. The objective function should be minimized. Show that the value functions in
the stochastic dynamic programming recursion are convex functions in x and that there
for each k is a target stock level Sk+1 such if the machine is in good state, it is optimal to
produce u⋆k ∈ [0, u]̄ that will bring Xk+1 as close as possible to Sk+1 .
Hint: Remember the footnote about when Wk depends on Xk in the derivation of the
stochastic dynamic programming recursion.
242 8 Dynamic Programming
where the assumptions on f , 𝛾, F, and W are the same as in Section 8.9. Show that if
V1 (x) ≤ V2 (x) for all x ∈ n , then
(V1 ) − (V2 ) ≤ 0.
Also show that is a contraction.
(b) Define the stochastic Bellman policy operator 𝜇 ∶ ℝ → ℝ as
n n
[ ]
𝜇 (V) = 𝔼 f (x, 𝜇(x)) + 𝛾V(F(x, 𝜇(x), W)) ,
where the assumptions on f , 𝛾, F, and W are the same as in Section 8.9. Show that if
V1 (x) ≤ V2 (x) for all x ∈ n , then
𝜇 (V1 ) − 𝜇 (V2 ) ≤ 0.
Also, show that 𝜇 is a contraction.
243
Part IV
Learning
245
Unsupervised Learning
We are now going to discuss unsupervised learning. This is about finding lower-dimensional
descriptions of a set of data {x1 , … , xN }. One simple such lower-dimensional description is the
mean of the data. Another one could be to find a probability function from which the data are the
outcome. We will see that there are many more lower-dimensional descriptions of data. We will
start the chapter by defining entropy, and we will see that many of the probability density functions
that are of interest in learning can be derived from the so-called “maximum entropy principle.”
Specifically, we will derive the categorical distribution, the Ising distribution, and the normal
distribution. There is a close relationship between the Lagrange dual function of the maximum
entropy problem and maximum likelihood (ML) estimation, which will also be investigated. Other
topics that we cover are prediction, graphical models, cross entropy, the expectation maximization
algorithm, the Boltzmann machine, principal component analysis, mutual information, and
cluster analysis. As a prelude to entropy we will start by discussing the so-called Chebyshev bounds.
maximize p(x)dx,
∫C
× ℝm+1 → ℝ be defined as
( )
∑ m
L[p, 𝜆] = p(x)dx + 𝜆i ai − p(x)fi (x)dx) .
∫C i=0
∫S
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
246 9 Unsupervised Learning
if
1 − inf f (x, 𝜆) ≤ 0, − inf f (x, 𝜆) ≤ 0.
x∈C x∈S⧵C
[ ] ∑m
Since 𝔼 fi (X) = ai , we also have that these conditions imply that supp ℙ[ X ∈ C ] ≤ i=0 𝜆i ai . We
can therefore compute the smallest possible such upper bound by solving the dual problem
∑
m
minimize a i 𝜆i ,
i=0
subject to 1 − inf f (x, 𝜆) ≤ 0,
x∈C
− inf f (x, 𝜆) ≤ 0,
x∈S⧵C
with variable 𝜆. This is a convex optimization problem in 𝜆, which follows by noting that
∑
m
inf f (x, 𝜆) = inf 𝜆i fi (x)
x∈C x∈C
i=0
is the infimum over a family of linear functions of 𝜆, and hence, it is a concave function of 𝜆. The
same argument applies to the function in the second constraint.
Example 9.1 Let S = ℝ+ , C = [1, ∞), f0 (x) = 1 and f1 (x) = x. Assume that it is known that
𝔼f1 (X) = 𝔼X = 𝜇, where 0 ≤ 𝜇 ≤ 1. Then the so-called Markov bound
ℙ[ X ≥ 1 ] ≤ 𝜇
holds. The result is derived in Exercise 9.1.
9.2 Entropy
Let us consider a probability space (Ω, , ℙ). Entropy measures the amount of uncertainty in a
probability distribution. Assume that we observe the values of a random variable X ∶ Ω → ℝ for
outcomes of experiments and estimate the mean of the random variable. What is then the most
likely distribution for the random variable? The maximum entropy principle says that it is the one
that maximizes the entropy among all possible probability distributions that have the same esti-
mated mean. To formalize this, we consider a finite sample space Ω = ℕn and the set of probability
functions defined by
{ }
∑n
n = p ∈ [0, 1] ∣ pk ≥ 0, k ∈ ℕn ;
n
pk = 1 ,
k=1
Example 9.2 In this example, we consider the static ranking of web pages. To formalize this,
we let V = ℕn represent a set of n web pages and define a graph G = (V, E), where the edge set
E ⊂ V × V contains directed edges (i, j) describing that there is a link from web page i to web page j.
For an example see Figure 9.1, where V = {1, 2, 3, 4}, and E = {(1, 2), (2, 4), (4, 1), (4, 3), (3, 1)}. The
most well-known way to model the ranking is called PageRank. It uses a Markov chain model, cf .
Section 3.10, in which all outgoing links are assigned equal transition probability. The transition
probability that a user at page i jumps to page j is given by
⎧1
⎪ , (i, j) ∈ E,
pij = ⎨ di
⎪ 0, (i, j) ∉ E,
⎩
248 9 Unsupervised Learning
where di is the out-degree of node i, i.e. the number of edges (i, j) ∈ E, where
1 2
j ∈ V. The PageRank is then defined as the stationary distribution of the
Markov chain, i.e. as the solution of 𝜋 T = 𝜋 T P, where P ∈ ℝn×n is the matrix
of transition probabilities pij at position i, j. This is an eigenvalue problem,
and 𝜋 is the normalized eigenvector corresponding to the eigenvalue that is
one. For the example in Figure 9.1, we have d = (1, 1, 1, 2) and 3 4
⎡ 0 1 0 0⎤
⎢ ⎥ Figure 9.1 Graph
0 0 0 1⎥
P=⎢ , showing the links
⎢ 1 0 0 0⎥ between different
⎢1∕2 0 1∕2 ⎥
0⎦
⎣ web pages.
with eigenvector 𝜋 = (0.2857, 0.2857, 0.1429, 0.2857). We see that web page
number three has the lowest ranking.
Another way to model the transition probabilities is based on a network flow approach. Let yij be
the number of users following link (i, j) ∈ E per unit time. Assume that the web traffic is in a state
of equilibrium so that the traffic out of a node is equal to the in-coming traffic per unit time, i.e.
∑ ∑
yij = yji , i ∈ V.
j∶(i,j)∈E j∶(j,i)∈E
We now define the probabilities pij = yi,j ∕Y , which allow us to write the equilibrium condition as
∑ ∑
pij = pji , i ∈ V. (9.2)
j∶(i,j)∈E j∶(j,i)∈E
Note that these probabilities are not the transition probabilities in the PageRank model. They are
∑
obtained by normalization with j∶(i,j)∈E pij .
One solution to (9.2) is
Hi
pij = , (i, j) ∈ E,
Y di
which agrees with the PageRank model after normalization. However, there are many more solu-
tions to the equilibrium condition, which can be interpreted as moment constraints. The maximum
entropy solution under the moment constraints is obtained by solving
∑
maximize − pij ln pij ,
(i,j)∈E
∑ ∑
subject to pij = pji , i ∈ V,
j∶(i,j)∈E j∶(j,i)∈E
∑
pij = 1,
(i,j)∈E
pij ≥ 0, (i, j) ∈ E,
with variables pij , (i, j) ∈ E. We will investigate this optimization problem more in Exercise 9.6.
9.2 Entropy 249
Solving this equation with respect to 𝜆 will give us the pf. We now realize that if we do not want to
parameterize the probability function in terms of the expected value b, then we can do it in terms
of 𝜆, i.e.
e−𝜆fk
pk = ∑n . (9.3)
−𝜆fl
l=1 e
250 9 Unsupervised Learning
The parameter 𝜆 is called the natural parameter. The probability function we have derived belongs
to the family of exponential probability functions. The distribution is known under several different
names: the categorical, Gibbs, or Boltzmann distribution. Note that we may normalize such that
⎧ ezk
⎪ ∑n−1 z , k ∈ ℕn−1 ,
⎪ 1 + l=1 e l
pk = ⎨
⎪ 1
∑n−1 z , k = n,
⎪ 1 + l=1 el
⎩
where zk = 𝜆(fn − fk ), k ∈ ℕn−1 , and this is the form of the categorical probability function that is
used in logistic regression, which is discussed in Section 10.4.
Example 9.3 A construction company is ordering lumber every month. The lumber comes in
three different grades. The construction company cannot decide which quality to order, but it
can observe that the prices are different per unit. The different prices are f1 = 1 for the lowest
grade, f2 = 1.1 for the middle grade, and f3 = 1.2 for the highest grade, respectively. They have also
observed that on average the price is b = 1.05. We can then use the maximum entropy principle to
estimate what the probabilities are that low-, middle-, or high-grade lumber is delivered. Let p1 be
the probability that low-grade lumber is delivered, p2 that medium-grade lumber is delivered, and
p3 that high-grade lumber is delivered. We then have the following moment constraint:
1 × p1 + 1.1 × p2 + 1.2 × p3 = 1.05.
From the function
G(𝜆) = (1 − 1.05)e−𝜆×1 + (1.1 − 1.05)e−𝜆×1.1 + (1.2 − 1.05)e(−𝜆×1.2) ,
which is shown in Figure 9.2, we see that G(𝜆) is zero for 𝜆 = 6.9, and hence, we can compute the
probabilities from (9.3) to be
p1 = 0.5386, p2 = 0.2701, p3 = 0.1913.
⋅10−5
2
1.5
1
𝐺
0.5
6 6.5 7 7.5 8
𝜆
p(x) ≥ 0, x ∈ x ,
with variable p, and where, with abuse of notation, m is the vector of first moments, and M is
the matrix of second moments, for which we do not specify the diagonal. Ignoring the inequality
constraints, we introduce the partial Lagrangian L ∶ ℝn × ℝm × 𝕊m × ℝ → ℝ via
( )
∑
L(p, 𝜆, Λ, 𝜇) = Hn (p) + 𝜆 T
xp(x) − m
x∈x
( ( )) ( )
∑ ∑
+ tr Λ T
xx p(x) − M +𝜇 p(x) − 1 ,
x∈x x∈x
where Λ has a zero diagonal because of the unspecified diagonal in the constraints. This is a concave
function of p. It holds that
𝜕L ( )
= − ln p(x) − 1 + 𝜆T x + tr ΛxxT + 𝜇,
𝜕p(x)
from which it follows that
( )
p(x) = exp 𝜆T x + tr(ΛxxT ) − 1 + 𝜇 ,
and hence p(x) ≥ 0. The Lagrange dual function g ∶ ℝm × 𝕊m × ℝ → ℝ follows by inserting the
expression for p into the Lagrangian. Minimizing this function with respect to 𝜇 results in choosing
𝜇 such that the probabilities sum up to one. With A = 1 − 𝜇 this holds if
( )
∑ ( T ( ))
A = ln exp 𝜆 x + tr Λxx T
.
x∈x
We see that A is a function of 𝜆 and Λ, and we, therefore, have A ∶ ℝm × 𝕊m → ℝ. The resulting
distribution is called the Ising distribution,1 where 𝜆 and Λ are the natural parameters. We introduce
1 Strictly speaking, it is the random variable that is defined by the probability function that we have derived which
has the Ising distribution.
252 9 Unsupervised Learning
the so-called energy function E ∶ {0, 1}m → ℝ given by E(x) = −𝜆T x − tr(ΛxxT ). We may then write
(∑ )
the probability function as p(x) = exp (−E(x) − A(𝜆, Λ)), where A(𝜆, Λ) = ln x exp(−E(x)) .
In order to relate the natural parameters to the moments m and M, we substitute 𝜇 = 1 − A(𝜆, Λ)
into the Lagrange dual function and obtain the function h ∶ ℝm × 𝕊m → ℝ given by
We proceed to minimize this function with respect to (𝜆, Λ). The optimality conditions are
𝜕h 𝜕A
= − m = 0,
𝜕𝜆 𝜕λ
𝜕h 𝜕A
= − M = 0,
𝜕Λ 𝜕Λ
where
∑ ∑
𝜕A x∈x x exp (−E(x)) x∈ x exp (−E(x) − A(𝜆, Λ))
= ∑ = ∑ x ,
𝜕𝜆 x∈x exp (−E(x)) x∈x exp (−E(x) − A(𝜆, Λ))
∑
= xp(x),
x∈x
∑ T ∑ T
𝜕A x∈x xx exp (−E(x)) x∈x xx exp (−E(x) − A(𝜆, Λ))
= ∑ = ∑ ,
𝜕Λ x∈x exp (−E(x)) x∈x exp (−E(x) − A(𝜆, Λ))
∑
= xxT p(x).
x∈x
Here, we do not consider the equations related to the diagonal of the second equation. This is
because we have constrained the diagonal of Λ to be zero. We see that the equations say that we
should match the moments. To solve the above equations with respect to (𝜆, Λ) is not easy in gen-
eral, and especially not when the dimension of x is large.
where m ∈ ℝn and M ∈ 𝕊+ are the first and second moments of the distribution, respectively. The
maximum entropy problem is
maximize H(p),
subject to xp(x)dx = m,
∫ ℝn
p(x)dx = 1,
∫ ℝn
p(x) ≥ 0, x ∈ ℝn ,
with variable p ∈ , which is a generalization of the problem we discussed in Example 7.3. We
define the partial Lagrangian functional L ∶ × ℝ × 𝕊n × ℝ → ℝ by
( )
L[p, 𝜆, Λ, 𝜇] = − p(x) ln p(x)dx + 𝜆T m − xp(x)dx
∫ ℝn ∫ℝn
( ( ))
1
+ tr Λ M − xxT p(x)dx
2 ∫ℝn
( )
+𝜇 1− p(x)dx ,
∫ℝn
where we have ignored the constraint p(x) ≥ 0. The first variation of the Lagrangian is
( )
1
𝛿L[𝛿p] = − ln p + 1 + 𝜆T x + xT Λx + 𝜇 𝛿p dx,
∫ ℝn 2
which should be nonpositive for all 𝛿p when p is optimal. Hence, the expression in the parenthesis
must vanish by the du Bois Raymond lemma, see Section 7.1, and the optimal pdf is
T x− 1 x T Λx
p(x) = e−1−𝜇−𝜆 2 , (9.8)
which clearly is nonnegative. We will in Section 9.9 verify that p is indeed maximizing the entropy
and not merely is a stationary point. The Lagrange dual function g ∶ ℝ × 𝕊n × ℝ → ℝ is defined by
T x− 1 x T Λx 1
g(𝜆, Λ, 𝜇) = e−1−𝜇−𝜆 2 dx + 𝜆T m + tr (ΛM) + 𝜇,
∫ ℝn 2
1 T Λ−1 𝜆− 1 T
= e−1−𝜇+ 2 𝜆 2
(x+Λ−1 𝜆) Λ(x+Λ−1 𝜆) dx
∫ ℝn
1
+ 𝜆T m + tr (ΛM) + 𝜇,
2
where the second equality follows from completing the squares and assuming that Λ is invertible.
We will later on see under which assumption we have invertibility. The Lagrange dual function is
a convex function, and we determine the 𝜇 that minimizes it by setting the partial derivative of g
with respect to 𝜇 equal to zero. We find that p(x) should integrate to one, i.e.
1 T Λ−1 𝜆− 1 T
p(x)dx = e−1−𝜇+ 2 𝜆 2
(x+Λ−1 𝜆) Λ(x+𝜆−1 Λ) dx,
∫ℝn ∫ℝn
1
2n∕2 e−1−𝜇+ 2 𝜆
T Λ−1 𝜆
2 2
= √ e−̄x1 d̄x1 × · · · × e−̄xn d̄xn ,
det Λ ∫ℝ ∫ℝ
1
e−1−𝜇+ 2 𝜆 Λ 𝜆 (2𝜋)n∕2
T −1
= √ = 1.
det Λ
254 9 Unsupervised Learning
( )
Hence, 𝜇 = 12 𝜆T Λ−1 𝜆 − 1 − 12 ln det Λ
(2𝜋)n
from which, we obtain
√
det Λ − 12 (x+Λ−1 𝜆)T Λ(x+Λ−1 𝜆)
p(x) = e .
(2𝜋)n
If we insert the optimal 𝜇 into the Lagrange dual function, we can define a function h ∶ ℝn × 𝕊n →
ℝ as
( ( ))
1 T −1 1 det Λ
h(𝜆, Λ) = g 𝜆, Λ, 𝜆 Λ 𝜆 − 1 − ln ,
2 2 (2𝜋)n
( )
1 T −1 1 det Λ 1
= 𝜆 Λ 𝜆 − ln + 𝜆T m + tr (ΛM) .
2 2 (2𝜋)n 2
Note that this is also a convex function. In order to find 𝜆 and Λ, we minimize this function, which
can be done by setting the derivatives with respect to 𝜆 and Λ equal to zero, i.e.
𝜕h
= Λ−1 𝜆 + m = 0,
𝜕𝜆
𝜕h 1 1 1 1 1 1
= − Λ−1 𝜆𝜆T Λ−1 − Λ−1 + M = − mmT − Λ−1 + M = 0,
𝜕Λ 2 2 2 2 2 2
where the second equality in the second equation follows from the first equation. Noting that
Σ = M − mmT is the covariance matrix, we immediately have that Λ = Σ−1 , and hence, the invert-
ibility of Λ is equivalent to the covariance matrix belonging to 𝕊n++ . Moreover, we have 𝜆 = −Σ−1 m
and
1 1 T −1
p(x) = √ e− 2 (x−m) Σ (x−m) .
(2𝜋)n det Σ
Note that we may just as well use the natural parameter (𝜆, Λ) instead of (m, Σ). If h is expressed
in terms of (m, Σ) instead of in terms of (𝜆, Λ) it will not be a convex function. As we will see, this
makes it convenient to use the natural parameters.
9.3 Prediction
Let fX,Y ∶ ℝm × ℝn → ℝ+ be the joint pdf of two random variables X and Y with marginal pdfs
fX ∶ ℝm → ℝ+ and fY ∶ ℝn → ℝ+ , and suppose we are given an observation x of X and would
like to predict a value y for Y . The fact that X and Y are not independent will be utilized.
Clearly, we could just compute fY |X ∶ ℝm × ℝn → ℝ+ , the conditional pdf for Y given X, which is
defined as
fX,Y (x, y)
fY |X (y|x) = ,
fX (x)
cf . (3.7). This relationship is the foundation for the Bayesian approach to statistics, a topic we return
to in Section 10.3. In Section 3.11, we derived the conditional pdf for an hidden Markov model
(HMM). If one wants a single value for the prediction, one may consider the argument y of fY |X (y|x)
that maximizes it for the observation x, i.e.
ŷ = argmax{fY |X (y|x)}.
y
This is called the maximum a posteriori (MAP) estimate. The reason for the name a posteriori is
that the conditional pdf is the pdf for Y resulting after we observe that X = x.
9.3 Prediction 255
with the function g as variable. This criterion is often called the mean squared error (MSE). We will
now show that this infinite-dimensional optimization problem has a simple solution in terms of a
conditional expectation. To this end, write the above objective function as
where the second equality follows by completing the squares. Thus, the above integral is minimized
by g(x) = 𝔼[Y|X = x], i.e. the conditional expectation of Y given X = x.
In general, neither the conditional pdf nor the conditional expectation is easy to compute. One
important exception is when fX,Y is the normal pdf, i.e.
1 1 T Σ−1 (z−𝜇)
fX,Y (z) = √ e− 2 (z−𝜇) ,
(2𝜋)m+n det Σ
where z = (x, y), 𝜇 = (𝜇x , 𝜇y ) and where
[ ]
Σx Σxy
Σ= T .
Σxy Σy
From Example 3.6, we then have that the conditional pdf is given by
1 1 T Σ−1 (y−𝜇 )
fY |X (y|x) = √ e− 2 (y−𝜇y|x ) y|x y|x
,
(2𝜋)n det Σy|x
where
The conditional expectation is given by 𝜇y|x , which is an affine function of x. Note that the max-
imizing argument y of fY |X (y|x) is y = 𝜇y|x , and hence, we have shown that the conditional mean
is also the MAP estimate for the normal distribution case. As a consequence, the Kalman filter
in Section 3.11 provides the MAP estimate as well as the estimate that minimizes the MSE of the
prediction when the noise sequences are Gaussian.
Another special case is a so-called Gaussian mixture model, which has a joint pdf of the form
∑
N
fX,Y (z) = 𝛼j fXi ,Yi (z),
i=1
where
1 1 T Σ−1 (z−𝜇 )
fXi ,Yi (z) = √ e− 2 (z−𝜇i ) i i
,
(2𝜋)m+n det Σi
256 9 Unsupervised Learning
∑
N
fX (x) = 𝛼i fXi (x),
i=1
where
1 1 T −1
fXi (x) = √ e− 2 (x−𝜇x,i ) Σx,i (x−𝜇x,i ) ,
(2𝜋)m det Σx,i
and hence, the conditional pdf is given by
∑N
i=1 𝛼i fX ,Y (x, y)
fY |X (y|x) = ∑N i i .
i=1 𝛼i fXi (x)
We again make use of the fact that fXi ,Yi (x, y) = fXi (x)fYi |Xi (y|x) and obtain
∑N ∑N
i=1 𝛼i fXi (x)∫ℝn yfYi |Xi (y|x)dy i=1 𝛼i fX (x)𝜇i (x)
𝔼[Y |X = x] = ∑N = ∑N i ,
i=1 𝛼i fXi (x) i=1 𝛼i fXi (x)
where
are the linear predictors for Yi given Xi = x. We see that the overall predictor of Y given X = x is a
convex combination of these linear predictors, where the weights are functions of x, and hence, it
is a nonlinear predictor.
with 𝛼i = 1∕3, i ∈ ℕ3 . Figure 9.3 shows the level curves of the pdf together with the nonlinear pre-
dictor given by the conditional expectation.
𝑦
0
−1
−3 −2 −1 0 1 2 3
𝑥
𝜕J
= −my + Amx + b = 0,
𝜕b
𝜕J
= −DTxy + ADx − my mTx + bmTx + Amx mTx = 0.
𝜕A
Inserting the expression for b from the first equation into the second equation and simplifying
results in A = DTxy D−1 T −1
x and b = my − Dxy Dx mx . It follows that
( )
Ax + b = my + DTxy D−1
x x − mx ,
which is in agreement with the normal distribution case in (9.3.1). We have just replaced the
moments of the normal distribution with the moments of the general distribution. Thus, if we are
satisfied with the best linear predictor, we only need to know the first- and second-order moments
( )
of the distribution. The minimal value of J is tr Dy − DTxy D−1
x Dxy . This value is in general larger
that the trace of the covariance of Y conditioned on X = x. The predictor for Y has the very nice
property that
[ ] [ ]
𝔼 AX + b = my + DTxy D−1 x 𝔼 X − mx = my ,
i.e. its expected value agrees with the expected value of Y . This is what is called an unbiased pre-
dictor.
258 9 Unsupervised Learning
where the latter formula only holds when Xi and Yi are independent. A formula for the general case
of Dxy is more complicated. For the mixture in Example 9.4, it holds that mx = −1∕3, my = 2∕3,
Dx = 71∕27, and Dxy = 24∕27. Hence, the affine predictor is given by
( )
2 24 1
Ax + b = + x+ .
3 71 3
Figure 9.4 shows the affine predictor together with the nonlinear predictor and the level curves of
the pdf.
1
𝑦
−1
−3 −2 −1 0 1 2 3
𝑥
9.4 The Viterbi Algorithm 259
𝑦
0
−2
−4 −2 0 2 4
𝑥
moments of the random variables with their estimates from the observations (xi , yi ), and then the
same formulas for A and b apply. Figure 9.5 shows 100 samples from the Gaussian mixture in
Example 9.4 together with both the true affine predictor and the affine predictor estimated from
the 100 samples. We see that the two predictors are close to one another.
We consider an HMM, as defined in Section 3.11. For ease of reference, the definition is repeated.
Consider two random processes X ∶ Ω → ℤ+ and Y ∶ Ω → ℤ+ that are correlated. The sets and
will be defined later. We will assume that X is a Markov process satisfying (3.10), and that Yj given
Xj are independent of Yk given Xk for j ≠ k. In Section 3.11, we derived the filtering equations for
the conditional pdf pXk |Ȳ k , where Ȳ k = (Y0 , … , Yk ). In this section, we are interested in predicting
or estimating X̄ k = (X0 , … , Xk ) from the observation ȳ k of Ȳ k using MAP estimation, cf . Section 9.3.
Note that we are interested in estimating X̄ k rather than just Xk . This is often called a smoothing
problem. To this end, the joint probability function or pdf pX̄ k ,Ȳ k ∶ k+1 × k+1 → ℝ+ for (X̄ k , Ȳ k ) is
needed.2 We also need the conditional probability function or pdf for Ȳ k given X̄ k : pȲ k |X̄ k ∶ k+1 ×
k+1 → ℝ+ . Because of the conditional independence, it can be expressed as
∏
k
pȲ k |X̄ k (̄yk |̄xk ) = pYi |Xi (yi |xi ),
i=0
where pYi |Xi ∶ × → ℝ+ are the conditional probability functions or pdfs for Yi given Xi . We also
define the marginal probability function or pdf for X̄ k : pX̄ k ∶ k+1 → ℝ+ . From the above assump-
tions, it follows that
∏
k
pX̄ k ,Ȳ k (̄xk , ȳ k ) = pȲ k |X̄ k (̄yk |̄xk )pX̄ k (̄xk ) = pYi |Xi (yi |xi )pX̄ k (̄xk ). (9.13)
i=0
2 We notice that maximizing this joint pdf will result in the MAP estimate, since the only difference between the
conditional pdf used for MAP estimation and the joint pdf is a normalization with the marginal pdf for the
observations.
260 9 Unsupervised Learning
We now make use of the multiplication theorem, see Section 3.2, and obtain with obvious defini-
tions of the involved functions
pX̄ k (̄xk ) = pX0 (x0 )pX1 |X0 (x1 |x0 )pX2 |X0 ,X1 (x2 |x0 , x1 ) · · · pXk |X0 ,…,Xk−1 (xk |x0 , … , xk−1 ),
= pX0 (x0 )pX1 |X0 (x1 |x0 )pX2 |X1 (x2 |x1 ) · · · pXk |Xk−1 (xk |xk−1 ),
where the last equality follows from the Markov property. From (9.13) it then follows that
Therefore,
⋮
× max {pX2 |X1 (x2 |x1 )pY1 |X1 (y1 |x1 )
x1
× max {pX1 |X0 (x1 |x0 )pY0 |X0 (y0 |x0 )pX0 (x0 )}}}}}.
x0
We now introduce functions Vk ∶ → ℝ+ for k ∈ ℤN defined via V0 (x) = pY0 |X0 (y0 |x)pX0 (x) and the
recursion
and that the optimal x̄ k is such that xi−1 is the maximizing u above for i ∈ ℕk , and xk is the maximiz-
ing x above. The recursions above are summarized as the famous Viterbi algorithm in Algorithm 9.1.
We remark that it was much easier to derive the Viterbi algorithm than to derive the filtering
equations in Section 3.11. The reason for this is that we are only interested in computing the MAP
estimate and not in obtaining the conditional probability function or pdf.
There is also a logarithmic version of the Viterbi algorithm which is obtained by defining
Ji ∶ → ℝ for 0 ≤ i ≤ k as Ji (x) = − log Vi (x). It then follows that the recursion reads
Ji (x) = − log pYi |Xi (yi |x) + min {− log pXi |Xi−1 (x|u) + Ji−1 (u)},
u
9.5 Kalman Filter on Innovation Form 261
with initial value J0 (x) = − log pY0 |X0 (y0 |x) − log pX0 (x). This is known to often have better numerical
properties.
A typical example of an HMM for = ℝn and = ℝp is obtained by considering the random
processes defined by the recursion
Xk+1 = Fk (Xk , Vk ),
(9.14)
Yk = Gk (Xk , Ek ),
where Fk ∶ ℝn × ℝn → ℝn , Gk ∶ ℝn × ℝp → ℝp , and where Ek are i.i.d. p-dimensional random vec-
tors and Vk are i.i.d. n-dimensional random vectors, and where X0 ∈ ℝn is a random vector with
known distributions. We will assume that X0 is independent of Ek and Vk for all k ≥ 0. It is straight
forward to verify the Markov property and the conditional independence property.
An important special case of the HMM in (9.14) is obtained when Fk (x, 𝑣) = Ax + 𝑣 and Gk (x, e) =
Cx + e, where A ∈ ℝn×n , and C ∈ ℝp×n , and where X0 , Vk , and Ek all have Gaussian distributions
with expectations x̄ , 0, and 0, respectively, and covariances R0 , R1 , and R2 , respectively. We derived
the Kalman filter in Section 3.11 for this HMM. We will see that we can obtain similar recursions
for the MAP estimate, and we will show that they actually provide the same estimate. We have
where is defined as in Example 3.4, i.e. all the involved pdfs are Gaussian. Applying the loga-
rithmic Viterbi recursion, we find that
1 1
J0 (x) = (x − x̄ 0 )T R−1 ̄ 0 ) + (y0 − Cx)T R−1
0 (x − x 2 (y0 − Cx), (9.15)
2 2
{ }
1 1
Ji (x) = (y − Cx)T R−1
2 (yk − Cx) + min 1 (x − Au) + Ji−1 (u) ,
(x − Au)T R−1 (9.16)
2 k u 2
for i ∈ ℕk modulo constant terms. Finally, we obtain the optimal xk as the solution of
minimize Jk (x).
262 9 Unsupervised Learning
We will now find a more explicit solution to the above recursions by verifying that
1 T
Ji (x) =x Pi x + qTi x + ri ,
2
for some Pi ∈ 𝕊n+ , qi ∈ ℝn , and ri ∈ ℝ. For i = 0 this holds with
P0 = R−1 T −1
0 + C R2 C, q0 = −R−1
0 x̄ 0 − CT R−1
2 y0 . (9.17)
The actual value of the constant term r0 will be of no interest, and this is true for the whole sequence
ri . The argument of the min operator on the right-hand side of (9.16) is a strictly convex function
of u, and hence, its minimizer must satisfy the stationary condition
−AT R−1
1 (x − Au) + Pi−1 u + qi−1 = 0.
where we again omit the constant terms. By making use of the definition of Gi−1 , the expressions
can be simplified to
2 C + R1 − R1 AGi−1 A R1 ,
Pi = CT R−1 −1 −1 −1 T −1
2 yi + R1 AGi−1 qi−1 .
qi = −CT R−1 −1 −1
f
2 yi − Yk APi−1 qi−1 ,
qi = −CT R−1 −1
(9.19)
where
f ( )
−1 T −1
Yi = R1 + APi−1 A . (9.20)
This also shows that Pi is positive definite. The estimate at the final time i = k is now given by
xk = −Pk−1 qk .
The above recursions and this expression can be used to obtain the solution for any value of k, i.e.
we can obtain the solution for the problem ending at k + 1 from the solution for the problem ending
9.5 Kalman Filter on Innovation Form 263
at k with just one more step in the recursion. We summarize the Kalman filter on the innovation
form in Algorithm 9.2. It is a good idea to avoid computing the inverse of Pk . It is better to use a
Cholesky factorization. Sometimes the algorithm is presented with qk having the opposite sign. We
may also use the fact that u = xi−1 to obtain the estimates of xi for i ∈ ℤk−1 from
( T −1 )
i−1 A R1 xi − qi−1 ,
xi−1 = G−1 (9.21)
which are the so-called “smoothed estimates.” Here, we may use the SMW identity to express
G−1
i−1
as
( )
−1 T −1 −1
G−1 −1 −1 T
i−1 = Pi−1 − Pi−1 A R1 + APi−1 A Pi−1 .
The smoothed estimates will be different for different values of k, i.e. we cannot find the smoothed
solution for the problem ending at k + 1 from the solution of the problem ending at k without
re-running the backward recursion in (9.21). The more intuitive explanation for this is that the
new measurement yk+1 affects all the smoothed estimates.
f ( )−1
Yk+1 ← R1 + APk−1 AT
f
qk+1 ← −Yk+1 APk−1 qk − CT R−1 y
2 k+1
end
The form of the Kalman filter that we have derived is called the information form. It has advan-
tages when the inverses of the covariance matrices are sparse. It is also advantageous when we have
0 → 0. Then we can initialize
little information about the initial value of the state and need to let R−1
f
with Y0 = 0. However, the Kalman filter is often presented in another way, cf . Section 3.11, which
we will now derive from the innovation form summarized in Algorithm 9.2. Let us define Σk = Pk−1
( )−1
f f
and Σk = Yk . From the update formula for Pk in (9.18), we then have
( )−1
f
Σ−1
k
= C T −1
R 2 C + Σk .
f
We will now show that the recursion for qk in (9.19) is related to recursions for xka and xk ,
defined as
( )−1 ( )
f f f f
xka = xk + Σk CT CΣk CT + R2 yk − Cxk ,
f
xk+1 = Axka .
f
with x0 = x̄ 0 , and where the estimate at time k is given by xk = xka . More precisely, we will show
that with the definition qk = −Pk xka , the above recursions are the same as the recursion for qk that
we derived previously in (9.19). It holds that
( ( )−1 )
f T f T f
Pk xk = Pk I − Σk C CΣk C + R2 C xk
( )−1
f f
+ Pk Σk CT CΣk CT + R2 yk ,
where we in the first equality have used the SMW identity and where we in the second equality
f f
have added and subtracted Yk (Yk + CT R−1
2
C)−1 inside the parenthesis. Thus, we have that
( ) f
2 C xk + Pk Σk C R2 yk ,
Pk xk = Pk I − Σk CT R−1 T −1
f
2 yk ,
a
−1
= Yk APk−1 Pk−1 xk−1 + CT R−1
which agrees with the recursion for qk in (9.19). It is also straightforward to show using similar
techniques that the initial values agree. The Kalman filter on standard form is summarized in
Algorithm 3.1.
In this section, we are going to investigate decoding of a coded message. We will specifically con-
sider the so-called convolutional codes, for which a signal u ∈ {0, 1}ℤ+ is coded into y ∈ {0, 1}ℤ+
using a convolution
∑
n
yk = ci uk−i ,
i=1
where ci ∈ {0, 1} for i ∈ ℕn represents the code. The above summation is carried out as modulo
two. We assume that uk = 0 for k < 0 to make the convolution well defined. It is then possible to
introduce a state xk ∈ {0, 1}n such that
It is then straightforward, cf . Section 10.1, to see that the ML problem of estimating uk for k ∈ ℤN−1
is equivalent to solving the optimal control problem
∑
N
minimize (rk − Cxk )2 ,
k=0
subject to xk+1 = Axk + Buk , k ∈ ℤN−1 ,
with variables (u0 , x1 , … , uN−1 , xN ), where x0 = 0, see [84]. Omura used dynamic programming
as presented in Chapter 8 to solve the problem. This does, however, not result in a very practi-
cal algorithm, since a solution for N cannot be used to solve a problem where N is replaced with
N + 1. This is often of importance in decoding applications. i[109] proposed an algorithm that does
not suffer from this limitation. It is based on performing dynamic programming forward in time
instead of backward in time. This can be derived from the general approach of partially separa-
ble optimization problems as presented in Section 5.5.3 We introduce for k ∈ ℤN−1 the functions
fk ∶ {0, 1}n × {0, 1} × {0, 1}n → ℝ as
where ID ∶ {0, 1}n × {0, 1} × {0, 1}n → ℝ is the indicator function for the set
We also let 𝜙 ∶ {0, 1}n → ℝ be defined as 𝜙(x) = (yN − Cx)2 . With this notation, we may write the
optimal control problem above as
∑
N−1
minimize 𝜙(xN ) + fk (xk , uk , xk+1 ),
k=0
with variables (u0 , x1 , … , uN−1 , xN ), where x0 = 0. This is clearly a partially separable optimization
problem.
We then introduce the functions Vk ∶ {0, 1}n → ℝ defined as
{ }
Vk+1 (x+ ) = min Vk (x) + fk (x, u, x+ ) , k ∈ ℤN−1 ,
u,x
where V0 (x) = 0. We also define 𝜇k+1 ∶ {0, 1}n → {0, 1} × {0, 1}n as the minimizing argument in
the above minimization, i.e.
{ }
𝜇k+1 (x+ ) = argmin Vk (x) + fk (x, u, x+ ) .
x,u
The function VN (x) + (rN − Cx)2 is then finally minimized with respect to x to obtain the optimal xN .
After this has been done, all optimal variables can be recovered from the recursion
3 We may interpret this as a variation of the Viterbi algorithm if we take yk in the Viterbi algorithm equal to rk ,
consider the state transition probability to be degenerate, and introduce an additional variables uk to optimize over.
266 9 Unsupervised Learning
with variables (x, u). Because of the very specific structure of the constraints, we can say much
more. To avoid cluttering the notation, we look at n = 3. Then we have that the constraints are
u = x1+ ,
x1 = x2+ ,
x2 = x3+ ,
and hence, the only optimization variable is x3 . Therefore, the optimization problem is
minimize (rk − c1 x2+ − c2 x3+ − c3 x3 )2 + Vk ((x2+ , x3+ , x3 )),
where the minimizing argument x3 will be a function of x2+ and x3+ . Also, notice that Vk+1 will only
be a function of x2+ and x3+ . The minimizing argument is therefore
+
⎡ x1 ⎤
⎢ + ⎥
⎢ x2 ⎥
𝜇k (x) = ⎢ ⎥.
+
⎢ x3 ⎥
⎢ ⎥
⎣x3 (x2 , x3 )⎦
+ +
Only the last component is nontrivial to compute, and it can be done by enumerating of all possible
values of x2+ and x3+ . We now realize that we need tables for Vk and 𝜇k that have 2n−1 entries.
In the practical use of Viterbi decoding, the value of N is not fixed, but it is increasing and rep-
resents time. The decoding is done with some fixed delay d measured from N, i.e. it is uN−d that is
estimated. Thus, we need to store one table for VN and d tables for 𝜇k , where N − d + 1 ≤ k ≤ N. In
case d2n−1 is large, this could be costly. Approximations for how to circumvent this were proposed
by Viterbi; see, e.g. [110].
where
( ( ))
∑ ∑ ∑
A(𝜆, Λ) = ln exp 𝜆k xk + Λi,j xi xj .
x∈x k∈ℕn (i,j)∈E
We notice that we do not specify all the entries of neither Λ nor the matrix M of second moments. If
we let Λi,j = 0 for (i, j) ∉ E, then we may express the probability function in terms of tr(ΛxxT ). The
9.7 Graphical Models 267
Lagrange dual function will therefore look the same, and so will the optimality conditions, except
for the fact that those related to (i, j) ∉ E are omitted, i.e. we have
𝜕h 𝜕A
= − m = 0,
𝜕𝜆 𝜕𝜆
𝜕h 𝜕A
= − Mij = 0, (i, j) ∈ E.
𝜕Λij 𝜕Λij
Also, for this case, it is in general difficult to solve the optimality conditions.
𝐶
𝐴
Figure 9.6 A graph where the subset of nodes in C separates the subset of nodes in A from the ones in B.
where pA∣C ∶ ℝ|A|+|C| → ℝ+ and pB∣C ∶ ℝ|B|+|C| → ℝ+ are the conditional pdfs and where
pC ∶ ℝ|C| → ℝ+ is the marginal pdf for the variables indexed by C. This means that the random
variables indexed by A and B conditioned on the random variables indexed by C are independent,
which is called conditional independence. This is the motivation for the name Markov random field
for a distribution specified as above.
From the above definitions, it follows that Λ must have the following structure:
⎡× 0 × 0⎤
⎢ ⎥
0 × × 0⎥
Λ=⎢ .
⎢× × × ×⎥
⎢0 0 × ×⎥⎦
⎣
We realize that we with no loss of generality may consider the components indexed by D to be part
of A and/or B, or we may assume that V = A ∪ B ∪ C, which we from now on do, see Figure 9.6.
From this, it follows that Λ must have the following structure:
⎡× 0 ×⎤
Λ = ⎢ 0 × ×⎥ ,
⎢ ⎥
⎣× × ×⎦
which is called an arrow structure. To prove the conditional independence property, we partition
the covariance matrix Σ as
⎡ ΣA ΣAB ΣAC ⎤
Σ = ⎢ΣTAB ΣB ΣBC ⎥ .
⎢ T ⎥
⎣ΣAC ΣTBC ΣC ⎦
We also let
[ ] [ ] [ ] [ ]T
Σ̃ Σ̃ Σ Σ Σ ΣAC
Σ̃ = T1 12 = TA AB − AC Σ−1 ,
Σ̃ 12 Σ̃ 2 ΣAB ΣB ΣBC C ΣBC
which is the covariance matrix for the variables indexed by A and B conditioned on the variables
indexed by C. Hence, it only remains to prove that Σ̃ 12 = 0. From the formula for the inverse of a
blocked matrix in (2.54), it follows that
[ −1 ]
Σ̃ ×
−1
Σ = .
× ×
We once again apply this formula to obtain that
( )−1 ( )−1
⎡ Σ̃ 1 − Σ̃ 12 Σ̃ −1 Σ̃ T
− ̃
Σ − ̃
Σ ̃
Σ ̃
−1 T
Σ Σ̃ 12 Σ̃ 2 ⎤
−1
Σ̃ = ⎢ ⎥.
−1 2 12 1 12 2 12
( )−1
⎢−Σ̃ −1 Σ̃ T Σ̃ − Σ̃ Σ̃ −1 Σ̃ T ⎥
⎣ 2 12 1 12 2 12 × ⎦
9.8 Maximum Likelihood Estimation 269
and hence, Σ̃ 12 = 0, which is what we wanted to prove. Because of this, (9.23) holds.
There may be many more ways to partition V such that the above property holds. It is possible to
show that the pdf in general can be factorized as
∏
p(x) = fC (xC ),
C∈
for some functions fC ∶ ℝ|C| → ℝ+ , where is the set of all cliques of G, i.e. the set of all complete
subgraphs of G, see [112]. Here, C has a different meaning than before. It is possible to take to be
the set of maximal cliques of G, where a clique is maximal if it is not contained in any other clique.
We will later discuss how this structure may be utilized in more detail. The above Markov property
holds for general graphical models defined on undirected graphs, and specifically, also for the Ising
model.
We have already seen how we can estimate distributions by maximizing entropy. Now, we will
consider the problem of estimating parameters in a distribution by maximizing the so-called “like-
lihood function.” This is called maximum likelihood (ML) estimation. For a probability function
p ∈ n that depends on a parameter 𝜆 ∈ ℝ, we may define the likelihood function ∶ ℝN × ℝ →
[0, 1] based on N samples xk , i ∈ ℕN , of the random variable X ∶ ℕn → ℝ as
∏
N
(x1 , … , xN ; 𝜆) = pi ,
i=1
[ ]
where pi = ℙ X = xi . Then an estimate of 𝜆 is obtained by maximizing , or equivalently, the
logarithm of .
Note that the subindex i now refers to the ith sample and not the ith component of x. If we take
∑N ∑N
m = N1 i=1 xi and M = N1 i=1 xi xiT in Section 9.2, then minimizing h(𝜆, Λ) = A(𝜆, Λ) − 𝜆T m −
tr(ΛM) is equivalent to maximizing the likelihood function. This results also hold for the case
when we define the Ising model on a graph.
∏
N
(x1 , … , xN ; 𝜃) = p(xi , 𝜃).
i=1
As before, the estimate of 𝜃 is obtained by maximizing . Now, consider the normal distribution
discussed in Section 9.2, and suppose that 𝜃 = (𝜆, Λ) ∈ ℝn × 𝕊n . The log-likelihood function is then
1 ∑(
N
)T ( )
ln (x1 , … , xN ; 𝜃) = − xi + Λ−1 𝜆 Λ xi + Λ−1 𝜆
2 i=1
( )
N det Λ
+ ln ,
2 (2𝜋)n
( (N ))
1 ∑ ∑N
N
= − tr Λ xi xiT
− 𝜆T xi − 𝜆T Λ−1 𝜆
2 i=1 i=1
2
( )
N det Λ
+ ln . (9.24)
2 (2𝜋)n
If we take
1∑ 1∑ T
N N
m= x, M= xx
N i=1 i N i=1 i i
( )
in Section 9.2, then minimizing h(Λ, Λ) = 12 𝜆T Λ−1 𝜆 − 12 ln det Λ
(2𝜋)n
+ 𝜆T m + 12 tr (ΛM) is equivalent
to maximizing the likelihood function. Hence, the relationship between maximum entropy and
ML estimation is the same also for continuous distributions. Note that the problem of minimizing
h is a convex optimization problem. The solution has already been derived in Section 9.2, and with
Σ = M − mmT , we have Λ = Σ−1 and 𝜆 = −Σ−1 m. Thus, the solution to the ML problem is simply
the sample mean and covariance in case we use the nonnatural parameterization. We may also
consider ML estimation for the normal distribution when it is defined on a graph, and we obtain
similar results.
9.8.4 Generalizations
There are several ways in which we could generalize the ML problem. For example, if we have prior
information such as upper or lower bounds on the matrix Λ, e.g.
Bl ⪯ Λ ⪯ Bu ,
9.9 Relative Entropy and Cross Entropy 271
with Bl , Bu ∈ 𝕊n++ , then we have a convex constraint that easily can be incorporated when we min-
imize h(𝜆, Λ). Also, an upper bound 𝜅max on the condition number of Λ can be incorporated by
noting that it is equivalent to the existence of u > 0 such that uI ⪯ Λ ⪯ 𝜅max uI. We may also include
prior information by modifying m and M. For example, if we have reason to believe from previous
experience that m0 and M0 are good values, then we could take
1∑
N
m = 𝛼m0 + (1 − 𝛼) x,
N i=1 i
1∑ T
N
M = 𝛽M0 + (1 − 𝛽) xx ,
N i=1 i i
with 𝛼, 𝛽 ∈ [0, 1], where the values of 𝛼 and 𝛽 are related to how much we trust our prior informa-
tion as compared to the information in the data {x1 , … , xN } that we have collected.
Sometimes it is of interest to quantify the difference between different distributions, and this is what
relative entropy, or equivalently the Kullback–Leibler divergence does. We will give the definition for
two pdfs p and q defined on ℝn with obvious modifications for probability functions. Let
{ }
|
ℝn |
n = p ∈ ℝ | p(x)dx = 1, p(x) ≥ 0, ∀x ∈ ℝ n
.
|∫ℝn
Define the relative entropy D ∶ n × n → ℝ from q to p as
q(x)
D(p, q) = − p(x) ln dx.
∫ℝn p(x)
D(p, q) ≥ 0,
with equality if and only if p(x) = q(x) for almost all x. From this, we see that the relative entropy
measures the “difference” between two pdfs. However, it is not a metric, since in general D(p, q) ≠
D(q, p). The proof of Gibbs’ inequality is based on Jensen’s inequality that says that for a convex
function 𝜑 ∶ ℝ → ℝ, any function f ∶ ℝn → ℝ, and any pdf p, it holds that
( )
𝜑 f (x)p(x)dx ≤ 𝜑 (f (x)) p(x)dx.
∫ℝn ∫ℝn
This is just a slight modification of Jensen’s inequality in Exercise 4.8, where dx is replaced with
p(x)dx. If we let f (x) = q(x)∕p(x), and 𝜑(f ) = − ln(f ) in Jensen’s inequality, we find that
q(x)
D(p, q) ≥ − ln p(x)dx = − ln q(x)dx = 0.
∫ℝn p(x) ∫ℝn
We will now use Gibbs’ inequality to show that the pdf
( )
1
p(x) = exp −1 − 𝜇 − 𝜆T x − xT Λx
2
272 9 Unsupervised Learning
in (9.8) indeed maximizes the entropy. Suppose that q is another pdf that satisfies the same moment
constraints, i.e.
xp(x)dx = xq(x)dx = m,
∫ℝn ∫ℝn
= −D(q, p) + H(p),
and since D(q, p) ≥ 0 with equality when q = p, it follows that H(q) ≤ H(p) with equality when
q = p. Thus, p maximizes the entropy under the given moment constraints.
C(p, q) ≥ H(p),
with equality if and only if p(x) = q(x) for almost all x. We also realize that cross entropy is the
expected value of − ln q with respect to the pdf p, i.e.
[ ]
C(p, q) = −𝔼 ln q ,
with variable 𝜃, where xk , k ∈ ℕN , are the observed data. If we assume that xk are observations of
a random variable with pdf p, then the objective function is proportional to an SAA of the cross
entropy.
9.10 The Expectation Maximization Algorithm 273
with variable 𝜃. Unfortunately, it is sometimes complicated to write down , and the expression
for its gradient with respect to 𝜃. Hence, the evaluations of function and gradients might be
time-consuming and/or error-prone to implement. However, in case we had some more observa-
tions z ∈ ℝM , then sometimes the ML problem with x = (y, z) as observation happens to not suffer
from the above difficulties. This is often the case when some data are missing because of errors
in the collection of the data, so-called missing data, or in case there are data that are difficult to
measure, so-called latent variables.
We will now show how to circumvent the above problems. To this end, let fX ∶ ℝN × ℝM ×
ℝ → ℝ+ be the joint pdf for the random variable X = (Y , Z) with parameter 𝜃, of which we have
p
the partial observation Y = y.4 We also define the conditional pdf fZ|Y ∶ ℝN × ℝM × ℝp → ℝ+ via
fX (y, z; 𝜃) fX (y, z; 𝜃)
fZ|Y (z|y; 𝜃) = = .
fY (y; 𝜃) (y; 𝜃)
This follows since is the marginal pdf for the observation y. Moreover, we consider an arbitrary
pdf q ∈ M . Note that this is not the marginal pdf for the latent variable z. We now consider the
infinite-dimensional optimization problem
with variables 𝜃 and q, where D is the relative entropy. The second term in the objective function
depends on 𝜃, since fZ|Y is a function of 𝜃. We have that only the second term in the objective
function depends on q, and therefore, we may first carry out the maximization with respect to this
term over q with the trivial maximum q = fZ|Y with D(fZ|Y , fZ|Y ) = 0. Then we are left with the ML
problem originally defined, and hence, the problems are equivalent, i.e. we can trivially obtain the
solution for one of them from the other. Furthermore, it holds that
where H is the entropy and C is the cross-entropy. The above results follow from the fact that cross
entropy is the sum of entropy and relative entropy, and from the definition of the conditional pdf
for Z given Y . We now apply block-coordinate accent to the optimization problem above, i.e. we
repeat the following steps:
This is not a procedure that in general is guaranteed to converge to the optimal solution. However,
our derivation shows that we can never obtain a worse value of the objective function by iterating
4 In this section, we do not have several observations of a scalar-valued random variable. Instead, we have one
observation of a vector-valued random variable. The former case can be treated as a special case of the vector-valued
case by taking each component of the vector-valued random variable to be a random variable with the same
distribution.
274 9 Unsupervised Learning
as above. A detailed discussion of the convergence properties is given in, e.g. [117]. The first step
above is trivial, since by the original formulation of the objective function, we immediately have
that q = fZ|Y is optimal. In the second step q is fixed, and therefore, it is by the reformulation of
the objective function equivalent to maximizing −C(fZ|Y , fX ) with respect to 𝜃, since the term H(q)
only depends on the previous value of 𝜃 which we will denote 𝜃. ̄ Hence, we first need to evaluate
the function Q ∶ ℝp × ℝp → ℝ given by
[ ]
̄ = −C(fZ|Y , fX ) = 𝔼 ln(fX (Y , Z; 𝜃))|Y = y; 𝜃̄ ,
Q(𝜃, 𝜃)
i.e. we need to compute the conditional expected value of the log of the joint pdf fX . This is often
easy, and it is called the expectation step in the expectation maximization (EM) algorithm. After
this, we need to solve
̄
maximize Q(𝜃, 𝜃),
with variable 𝜃, which is often easier to solve than the original ML problem. This is called the
maximization step. We will later exemplify the claims regarding what is difficult and easy to com-
pute. The E-step is sometimes approximated using an SAA also called empirical cross-entropy,
i.e.
1∑
M
[ ]
𝔼 ln fX (Y, Z; 𝜃)|Y = y; 𝜃̄ ≈ ln fi ,
M i=1
where fi = fX (y, zi ; 𝜃) is obtained by drawing a sample zi from the conditional pdf fZ|Y . This is called
Monte Carlo EM.
where 𝛼 ∈ Δk . In other words, the random variable Y may be viewed as an overall population that
is derived from a mixture of subpopulations.
An important special case is when the k components are Gaussian random variables. The corre-
sponding mixture model is called a Gaussian mixture model (GMM), and the mixture pdf can be
expressed as
∑
k
fY (y; 𝜃) = 𝛼j (y, 𝜇j , Σj ), (9.25)
j=1
where 𝜃 represents the model parameters (𝛼j , 𝜇j , Σj ), j ∈ ℕk . The problem of computing a ML esti-
mate of the model parameters based on a given set of m independent observations, y1 , … , ym ∈ ℝd ,
9.11 Mixture Models 275
subject to 𝛼 ∈ Δ , k
Σj ⪰ 0, j ∈ ℕk ,
with variables (𝛼j , 𝜇j , Σj ), j ∈ ℕk . This problem is generally nonconvex and intractable, but local
optimization methods may be used in the pursuit of a local maximum. We note that the ML
estimation problem becomes much easier if, in addition to the m observations y1 , … , ym , we are
given labels z1 , … , zm ∈ ℕk such that zi identifies which of the k components the ith observation
originates from. However, such labels are typically not available.
We will now show how the EM algorithm can be used to derive a relatively simple iterative pro-
cedure that converges to a local maximum. To this end, we introduce a discrete random variable
Z which takes the value j with probability 𝛼j , j ∈ ℕk , i.e. Z is a latent variable that identifies one of
the k components. Moreover, we define the pdf of Y given Z = z as
∑
k
fY |Z (y|z; 𝜃) = 𝛿j (z) (y, 𝜇j , Σj ),
j=1
It is easy to check that (9.25) follows from (9.26) by marginalizing over Z. Moreover, the probability
function of Z conditioned on Y may be expressed as
∑k
fY ,Z (y, z; 𝜃) j=1 𝛿j (z)𝛼j (y, 𝜇j , Σj )
fZ|Y (z|y; 𝜃) = = ∑k . (9.27)
fY (y; 𝜃)
l=1 𝛼l (y, 𝜇l , Σl )
Now, given m observations y1 , … , ym and model parameters 𝜃, ̄ e.g. an initial guess or the param-
eters from the previous iteration of the EM algorithm, the E-step of the EM algorithm may be
expressed as
[ m ]
∑
̄ =𝔼
Q(𝜃, 𝜃) ln(fY ,Z (Yi , Zi ; 𝜃)) ∣ Y1 = y1 , … , Ym = ym ; 𝜃̄ ,
i=1
∑
m
[ ]
= 𝔼 ln(fY ,Z (Yi , Zi ; 𝜃)) ∣ Yi = yi ; 𝜃̄ ,
i=1
∑∑
m k
= ̄ ln(fY ,Z (yi , j; 𝜃)),
fZ|Y (j|yi ; 𝜃)
i=1 j=1
where (Yi , Zi ), i ∈ ℕm , are independent pairs of random variables with the same joint pdf as that of
̄ as
(Y , Z). Using (9.27), we can write Q(𝜃, 𝜃)
∑∑
m k
( )
̄ =
Q(𝜃, 𝜃) 𝑤̄ ij ln 𝛼j (yi , 𝜇j , Σj ) , (9.28)
i=1 j=1
5 Note that Y is a continuous random variable and Z is a discrete random variable, for which we define a joint pdf.
276 9 Unsupervised Learning
̄ is the probability that yi originates from the jth mixture component under
where 𝑤̄ ij = fZ|Y (j|yi , 𝜃)
the mixture model defined by the model parameters 𝜃. ̄
The M-step of the EM algorithm is the problem of maximizing Q(𝜃, 𝜃) ̄ with respect to the model
parameters 𝜃. This is a block separable optimization problem, which follows by writing Q(𝜃, 𝜃) ̄ as
( )
∑ k
∑
k
∑
m
( )
Q(𝜃, 𝜃)̄ = c̄ j ln(𝛼j ) + 𝑤̄ ij ln (yi , 𝜇j , Σj ) ,
j=1 j=1 i=1
∑m
where we define c̄ j = i=1 𝑤̄ ij . Thus, the update of 𝛼 may be expressed as
{ }
∑
k
𝛼 + = argmax c̄ j ln(𝛼j )| 𝛼 ∈ Δk ,
𝛼 j=1
which is concave in 𝜇j for a fixed Σj and concave in Σ−1 j for a fixed 𝜇j . The first-order optimality
conditions are
∑m
∑
m
( )
𝑤̄ ij Σ−1
j (y i − 𝜇 j ) = 0, 𝑤̄ ij Σj − (yi − 𝜇j )(yi − 𝜇j )T = 0,
i=1 i=1
1∑
m
𝜇j+ = 𝑤̄ y , j ∈ ℕk , (9.31)
c̄ j i=1 ij i
and the update covariance matrix is
1∑
m
Σ+j = 𝑤̄ (y − 𝜇j+ )(yi − 𝜇j+ )T , j ∈ ℕk . (9.32)
c̄ j i=1 ij i
We note that the weights 𝑤̄ ij are positive, and hence, Σ+j is nonsingular if and only if span({y1 −
𝜇j+ , … , ym − 𝜇j+ }) = ℝd . Equivalently, Σ+j is singular if and only if
dim aff({y1 , … , ym }) < d.
For example, this is the case if the number of observations m is less than or equal to d or if all
observations lie on a hyperplane.
Example 9.6 We now illustrate the use of GMMs to approximate the pdf of a random variable
based on m independent observations. Figure 9.7 shows examples in one and two dimensions. The
one-dimensional example in Figure 9.7a shows the GMM pdf for a model with k components and
with parameter estimates based on m = 1000 observations and computed using the EM algorithm.
The observations are shown as a normalized histogram. The model seemingly fits the histogram
well when we use a mixture model with three components. The two-dimensional example in
Figure 9.7b shows m = 500 observations as dots and an ellipse for each component of the GMM
obtained using the EM algorithm. Each ellipse defines the superlevel set containing 95% of the
probability mass for the corresponding component. We see that the model with four components
visually appears to be a reasonable approximation.
9.12 Gibbs Sampling 277
0 0 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
(a)
𝑘=2 𝑘=3 𝑘=4
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−5 0 5 −5 0 5 −5 0 5
(b)
Figure 9.7 Observations and estimated GMM with k components based on maximum likelihood
estimation: (a) one-dimensional GMMs and (b) two-dimensional GMMs.
2 2
𝑥2
𝑥2
1 1
0 0
0 2 4 0 2 4
𝑥1 𝑥1
(a)
Direct sampling Gibbs sampling
4 4
3 3
2 2
𝑥2
𝑥2
1 1
0 0
0 1 2 3 4 0 1 2 3 4
𝑥1 𝑥1
(b)
Figure 9.8 Realizations of K = 50 samples obtained via direct sampling and Gibbs sampling; the ellipse
marks the superlevel set of fX that contains 95% of the probability mass, and o marks the starting point
x(0) = (2, 1∕2) used in the Gibbs sampler. (a) Low correlation: 𝜎 = 0.2, 𝜃 = 𝜋∕30, and 𝜌 ≈ 0.45. (b) High
correlation: 𝜎 = 0.1, 𝜃 = 𝜋∕4, and 𝜌 = 12∕13.
∑N
where Ē ∶ Nx × ℝm × 𝕊m → ℝ is given by E(x,
̄ 𝜆, Λ) = i=1 E(xi ). The conditional pdf for h given
𝑣 is s ∶ N𝑣 × Nh × ℝm × 𝕊m → ℝ given by
̄
r(𝑣, h, 𝜆, Λ) e−E(𝑣,h,𝜆,Λ)
s(𝑣, h, 𝜆, Λ) = ∑ = ∑ ̄ ,
h r(𝑣, h, 𝜆, Λ)
−E(𝑣,h,𝜆,Λ)
he
where 𝑣 and h are defined such that 𝑣 × 𝑣 = x . The summation over h above is carried
out over all h ∈ h . From now on, we tacitly assume that all summations are carried over all the
elements in the corresponding sets unless otherwise stated. We define Q ∶ ℝm × 𝕊m → ℝ via
∑ ( )
Q(𝜆, Λ) = Es ln r = ̄ h, 𝜆, Λ) − NA(𝜆, Λ) ,
s(𝑣, h, 𝜆− , Λ− ) −E(𝑣,
h
∑
̄ h, 𝜆, Λ) − NA(𝜆, Λ),
= − s(𝑣, h, 𝜆− , Λ− )E(𝑣,
h
where (𝜆− , Λ− )
are the values of the parameters from the previous iteration in the EM algorithm.
In order to maximize Q, the gradient with respect to (𝜆, Λ) is typically needed. We have
𝜕Q ∑ 𝜕 Ē 𝜕A
= − s(h, 𝑣, 𝜆− , Λ− ) −N ,
𝜕𝜆 h
𝜕𝜆 𝜕𝜆
280 9 Unsupervised Learning
∑ ∑
N
∑
= s(h, 𝑣, 𝜆− , Λ− ) xi − N 𝜉p(𝜉),
h i=1 𝜉
∑
N
∑ ∑
= si (hi , 𝑣i , 𝜆− , Λ− )xi − N 𝜉p(𝜉),
i=1 hi 𝜉
Similarly, we obtain
𝜕Q ∑∑ ∑
N
= si (hi , 𝑣i , λ− , Λ− )xi xiT − N 𝜉𝜉 T p(𝜉).
𝜕Λ i=1 h 𝜉
i
Note that above x = (𝑣, h), and that 𝑣 is known and that we sum over all possible values of the
hidden variables h. Hence, it is not trivial to compute the gradients if h and 𝜉 are high dimensional
because of the fact that we need to sum over many variables. We notice that the expressions for
the gradients of Q are differences of expectations, and therefore, it is possible to approximate them
using Monte Carlo techniques like Gibbs sampling; see, e.g. [38]. Note that the gradients can be
computed as sums of gradients for each observation i. We will later discuss what further structure
can be utilized when we have a graphical Ising model.
1∑
N
J(W, c1 , … , cN ) = ||x − W T ci ||22 .
2 i=1 i
minimize J(W, c1 , … , cN ),
subject to WW T = I,
with variables W, c1 , … , cN . The solution to this problem will give us a lower-dimensional descrip-
tion of the original data. The rows of W are called the principal components, and the analysis we
carry out is called principal component analysis (PCA).
9.14.1 Solution
This is a convex problem with respect to ci for fixed value of W. Hence, we first carry out this
minimization. The optimal values are obtained by setting the gradient equal to zero, i.e.
𝜕J
= WW T ci − Wxi = 0, i ∈ ℕN .
𝜕ci
9.14 Principal Component Analysis 281
Since WW T = I, we get that ci = Wxi . We realize that the principal components are used to com-
press the original data xi to the lower-dimensional data ci . Back substitution into J gives
1 ∑ T(
N
)2
J(W, Wx1 , … , WxN ) = x I − W T W xi .
2 i=1 i
Since I − W T W is a projection matrix, we may remove the square. From this, we see that it is equiv-
alent to maximize
1∑ T T
N
1( T ) 1 ( )
x W Wxi = trX XW T W = tr WX T XW T ,
2 i=1 i 2 2
where
⎡ x1T ⎤
⎢ ⎥
X = ⎢ ⋮ ⎥.
⎢ T⎥
⎣xN ⎦
Let X = UDV T be a singular value decomposition such that the diagonal matrix D has the elements
sorted in decreasing order. Let Y = V T W T . Then we may define the criterion above as J̃ ∶ℝn×m → ℝ+ ,
where
1 ( )
J̃ (Y ) = tr Y T D2 Y ,
2
and define the optimization problem equivalently as
maximize J̃ (Y ),
subject to Y T Y = I,
where we have made use of the fact that WW T = I if and only if Y T Y = I. This is a nonconvex
optimization problem. It can be shown that the gradient of the constraint function h ∶ ℝn×n → 𝕊n
defined by h(Y ) = I − Y T Y is full rank for all orthogonal Y , and hence, the linear independence
condition is satisfied for the necessary optimality conditions in Section 4.7. To see this notice that
it can be shown that
( )T
𝜕 svec h(Y ) 𝜕 svec h(Y )
𝜕(vec Y )T 𝜕(vec Y )T
is a diagonal matrix with positive diagonal. It is important to only consider the symmetric vector-
ization of h since otherwise, the condition does not hold. Because of this the Lagrange multiplier
has to be a symmetric matrix. Define the Lagrangian L ∶ ℝn×m × 𝕊m → ℝ as
( ( ))
L(Y , Λ) = J̃ (L) + tr Λ I − Y T Y .
Then a necessary condition for Y to be optimal is that there exist Λ such that the gradient of the
Lagrangian vanishes, i.e.
𝜕L
= D2 Y − Y Λ = 0.
𝜕Y
[ ] [ ]T [ ]
Let Z be such that Y Z is orthogonal, i.e. Y Z Y Z = I, and multiply the above equation with
[ ]T
Y Z from the left to obtain the equivalent equations
Y T D2 Y − Λ = 0,
Z T D2 Y = 0.
282 9 Unsupervised Learning
Note that the first equation always has a solution Λ for any Y , and hence, the necessary conditions
of optimality are equivalent to existence of Z such that
[ ]T [ ]
Y Z Y Z = I,
Z T D2 Y = 0.
[ ] [ ]
Clearly, one solution to these equations is Y Z = I. It is actually optimal. Moreover, Y Z =
blkdiag(X1 , X2 ), for any orthogonal X1 ∈ ℝm×m and X2 ∈ ℝ(n−m)×m are also optimal. We will show
this by noting that the objective function can be written as
1∑
N
||Dyi ||22 ,
2 i=1
[ ]
where Y = y1 · · · ym . Now we start by optimizing with respect to y1 . We have
⎡ d1 y11 ⎤
⎢ ⎥
dy
Dy1 = ⎢ 2 21 ⎥ ,
⎢ ⋮ ⎥
⎢d y ⎥
⎣ n n1 ⎦
where di are the diagonal elements of D, which we remember are ordered such that di ≥ dj when
i < j. The constraint Y T T = I implies that yT1 y1 = 1, and therefore, it is optimal to take y1 = e1 , which
is the first basis vector. All the other yi , for 2 ≤ i ≤ m has to have its first component equal to zero
in order to be orthogonal to y1 . Because of this, we by repeating the arguments above, find that
y2 = e2 , i.e. we pick out the second largest diagonal element of D. The remaining yi now have to
have the first
[ ] two components equal to zero in order [ to ]be orthogonal to y1 and y2 . We thus conclude
I X
that Y = is optimal. We then notice that Y = 1 with X1 orthogonal will result in the same
0 0
objective function value. This follows from the fact that
([ ]) ([ ]T [ ])
X 1 X1 X 1 ( ) 1 ( )
J̃ 1
= tr D2 1 = tr X1T D21 X1 = tr X1 X1T D21 ,
0 2 0 0 2 2
where D = blkdiag(D1 , D2 ). Hence, the PCA picks out the components of xi corresponding to the
m largest singular values of X T .
Example 9.8 In this example, we will perform PCA analysis on the Fisher Iris flower data set
[39]. It contains measurements of four different characteristics of three different iris species. There
are 150 rows in the data set. Each subset of 50 rows corresponds to three different iris species. Each
of the four columns corresponds to the a different characteristic. We preprocess the data by sub-
tracting the mean value of each column from all the values in that column. We also divide all values
in a column with the standard deviation of its column values. This is then the X-matrix that we use
for PCA. Hence, each row of it is a scaled observation xiT . We compute the two principal compo-
nents corresponding to the largest singular values, and then compute the compressed data ci ∈ ℝ2 .
In Figure 9.9, we have plotted the first component of each ci versus its second component. The
different species are marked differently in the plot. The PCA analysis makes it possible to visualize
high-dimensional data in low-dimensional plots and, hence, makes the data more understandable.
𝑐𝑖,2
−2
−3 −2 −1 0 1 2 3
𝑐𝑖,1
A quantity that is closely related to entropy is mutual information. For a joint pdf r ∶ ℝm × ℝn → ℝ+
of two random variables with marginal pdfs p ∶ ℝm → ℝ+ and q ∶ ℝn → ℝ+ we define the mutual
information I ∶ m+n → ℝ+ as
r(x, y)
I(r) = r(x, y) ln dxdy,
∫ℝm ×ℝn p(x)q(y)
where m+n is defined as in Section 9.9.
variables Z = WX and Y = Z + E, where W ∈ ℝm×n with m < n.6 We can interpret Z as information
that is transmitted over a channel W with additive noise E. We would like to choose W to maximize
the mutual information between Y and Z, i.e. between what is transmitted and what is received.
We realize that (Y , Z) has a zero mean normal pdf r with covariance
[ ]
WΣW T + I WΣW T
.
WΣW T WΣW T
We let p and q be the marginal pdfs for Y and Z and define J ∶ ℝm×n → ℝ+ as
J(W) = I(r).
Z̄ D2 Ȳ = 0,
T
[ ]
where Z̄ is such that Ȳ Z̄ is orthogonal. The second equation follows since I + Ȳ D2 Ȳ is invertible
T
for all Ȳ . From the formula A(I + A)−1 = I − (I + A)−1 , we may rewrite the above equations as
( )−1
I − I + Ȳ D2 Ȳ
T
− Λ = 0,
Z̄ D2 Ȳ = 0.
T
6 Note that the dimensions of X and Y are not the same as the dimensions of x and y in the definition of mutual
information in Section 9.14.
9.15 Mutual Information 285
This now shows that the first equation has a solution in terms of Λ for any Ȳ , since
( )−1
I − I + Ȳ D2 Ȳ is symmetric for any Ȳ . Hence, the optimality conditions simplify to the
T
[ ]T [ ]
Ȳ Z̄ Ȳ Z̄ = I.
These are similar to the optimality conditions for PCA if we identify Σ with X T X.7 However, the
objective functions are not the same, and therefore, we have to proceed slightly differently.
[ ] [ ]
As in Section 9.14, Ȳ Z̄ = I is a solution to the optimality conditions and so is Ȳ Z̄ =
blkdiag(X1 , X2 ) with X1 ∈ ℝm×m and X2 ∈ ℝ(n−m)×(n−m) orthonormal. It is straightforward to verify
that the objective function evaluates to
1∑ (
m
)
J̃ (Ȳ ) = ln 1 + d2k .
2 k=1
[ ]
Can we consider more general Ȳ Z̄ ? Any orthogonal matrix with determinant equal to one can
be written as a product of Givens rotations. Notice that there is no restriction in assuming the
determinant to be equal to one, since we multiply both from left and right. A Givens rotation is
defined as G ∶ ℕn × ℕn × [0, 2𝜋] → ℝn×n , where
⎡1 ··· 0 ··· 0 ··· 0⎤
⎢⋮ ⋱ ⋮ ⋮ ⋮⎥
⎢ ⎥
⎢0 ··· c · · · −s ··· 0⎥
G(i, j, 𝜃) = ⎢ ⋮ ⋮ ⋱ ⋮ ⋮⎥,
⎢ ⎥
⎢0 ··· s ··· c ··· 0⎥
⎢⋮ ⋮ ⋮ ⋱ ⋮⎥
⎢ ⎥
⎣0 ··· 0 ··· 0 ··· 1⎦
where c = cos 𝜃, s = sin 𝜃, and where c and s are positioned on the ith and jth rows and columns,
cf . (2.17). We assume that i < j. From this, it follows that
2
⎡ d1 ··· 0 ··· 0 ··· 0 ⎤
⎢⋮ ⋱ ⋮ ⋮ ⋮ ⎥
⎢ ⎥
⎢0 ··· c2 d2i + s2 d2j ··· −scd2i + scd2j ··· 0 ⎥
G(i, j, 𝜃) D G(i, j, 𝜃) = ⎢ ⋮
T 2
⋮ ⋱ ⋮ ⋮ ⎥.
⎢ ⎥
⎢0 ··· −scd2i + scd2j ··· s2 d2i + c2 d2j ··· 0 ⎥
⎢⋮ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣0 ··· 0 ··· 0 · · · d2n ⎦
We understand that it is only if i ≤ m and j > m that we do not have the case G(i, j, 𝜃) =
blkdiag(X1 , X2 ) discussed above. For these values of i and j, the top left m × m-dimensional block
of the above matrix is given by
7 We will discuss this in more detail later on. There should actually be a normalization with N.
286 9 Unsupervised Learning
We realize that the Givens rotation replaces the ith diagonal element d2i with c2 d2i + s2 dj , which is a
convex combination of the values d2i and d2j . Since dj < di it follows that c2 d2i + s2 dj < d2i , and hence,
the Givens rotation has resulted in a smaller value of the objective function value. Any further rota-
tions for which i ≤ m and j > m will only further decrease the objective function value. Any [ ]other
X
rotation will not affect the objective function value. From this, we realize that all Ȳ = 1 with
0
X1T X1 = I are optimal, and hence, maximizes the mutual information under the constraint that
Ȳ Ȳ = I. There are actually several local stationary points. The reason for this is that for 𝜃 = k𝜋∕2,
T
k ∈ ℕ it holds that −scd2i + scd2j = 0, and hence, Z̄ DȲ = 0 irrespective of what values i and j have.
T
Z̄ D2 Ȳ = 0,
T
[ ] [ ]
where Z̄ is such that Ȳ Z̄ has full column rank. Notice that Ȳ Z̄ is not necessarily a square
matrix. Similarly as above, we may rewrite the above equations as
( )−1
I − I + Ȳ D2 Ȳ − Ȳ Ȳ diag(𝜆) = 0,
T T
Z̄ D2 Ȳ = 0.
T
[ ]
It is straightforward to verify that Ȳ Z̄ = I satisfy also these optimality conditions. However, any
[ ]
orthogonal Ȳ Z̄ will not satisfy them, since for orthogonal Ȳ it must hold that I + Ȳ DȲ is diago-
T
nal for there to exist 𝜆 that satisfies the equation. There are however nonorthogonal Ȳ that satisfy
[ ]T
them. Consider Ȳ = 𝟙 0 , where 𝟙 ∈ ℝm is a vector of all ones. This means that all yi = e1 , where
e1 is the first basis vector in ℝn . It is straightforward to verify that 𝜆 = d21 ∕(1 + md21 )𝟙 satisfies the
necessary optimality conditions for this Ȳ . It actually holds that we may take yi equal to any basis
vector for ℝn , and that they may be linearly dependent.
Now, the question arises if such a Ȳ where some of the columns are not linearly independent may
be optimal and not only constitute a stationary point. To investigate this, we consider the case when
[ ]T
m = 2, d1 > d2 and let Ȳ = 𝟙 0 . Then ln det (I + Ȳ D2 Ȳ ) = ln(1 + 2d21 ) ≈ 2d21 for small values
T
[ ]
I
of d21 . For Ȳ = we have ln det (I + Y T D2 Y ) = ln(1 + d21 + d22 + d21 d22 ) ≈ d21 + d22 + d21 d22 for small
0
values of d21 and d22 . It is easy to find values of d1 > d2 for which the second approximation is smaller
9.15 Mutual Information 287
2𝜋
3𝜋
2
𝜋
𝜃2
𝜋
2
0 𝜋 𝜋 3𝜋 2𝜋
2 2
𝜃1
Figure 9.10 Level curves for the objective function between 2.15 (dark gray) and 2.4 (light gray).
than the first. Hence, we realize that when the signal-to- noise ratio is small, i.e. when Σ is small
as compared to I, then it is better to only consider the signal with the largest variance. In general
there are cases in-between, where one should pick out more than one of the largest, but not all m.
Also there are cases when it optimal for the vectors yi to have an angle in-between them such that
they are neither aligned nor orthogonal [28]. [A small] example with
[ m= ] n = 2, d1 = 2, and d2 = 1 is
cos 𝜃1 cos 𝜃2
visualized in Figure 9.10, where we let y1 = and y2 = for 𝜃1 , 𝜃2 ∈ [0, 2𝜋] and plot
sin 𝜃1 sin 𝜃2
the level curves of the objective function values. It is seen that
[ ] [ ] [ ] [ ]
1 1 1 0 0 1 0 0
Ȳ = ; ; ;
0 0 0 1 1 0 1 1
correspond to saddle points. There are several optima, and one is given by
[ ]
̄ 0.83 0.83
Y= ,
−0.56 0.56
which corresponds to 𝜃1 = 5.7 and 𝜃2 = 0.6. Note that the other optima have the property[ that ] the
X
angle between y1 and y2 are all the same. If the signal-to-noise ratio is large, then Ȳ = 1
with
0
X1 having orthonormal columns satisfies the necessary optimality conditions also without orthog-
onality constraints. This follows from the fact that the first equation of the optimality conditions in
the limit when D is much larger than I is given by
I − Ȳ Ȳ diag(𝜆) = 0.
T
The problem discussed is a special case of optimization on a matrix manifold, which is discussed
in more detail in [1].
1 ̄T ̄
matrix of X with Σ = N
X X, where
⎡ x1T ⎤
⎢ T⎥
⎢x ⎥
X̄ = ⎢ 2 ⎥ .
⎢⋮⎥
⎢xT ⎥
⎣ N⎦
We then let X̄ = UDV T be a singular value decomposition of X. From this it follows that
( 2
)
1 TD
J(W) = ln det I + Ȳ Ȳ ,
2 N
where Ȳ = V T W T . The approximation of X̄ will be XW
̄ T W, where W T W is not necessarily a projec-
̄ T W will still be m. Notice that we cannot relax the condition
tion matrix. However, the rank of XW [ T]
̄ 𝟙
on orthogonality in the principal component analysis without obtaining Y = as the optimal
0
8
solution. The signal-to-noise ratio will not help in making the correct choice from an information
point of view. This is why optimizing mutual information seems to be more appropriate.
1∑ ∑ ∑
K
f (C) = d(xi , xj ),
2 k=1 i∈ j∈
k k
where k = {i ∈ ℕN ∶ C(i) = k}. Define the mean vectors for each cluster as
1 ∑
mk = x , k ∈ ℕK ,
Nk i∈ i
k
1∑ ∑ ∑ ∑ ∑
K K
f (C) = d(xi , xj ) = Nk d(xi , mk ),
2 k=1 i∈ j∈ k=1 i∈
k k k
∑
if d is the squared Euclidian norm. Notice that for any yk ∈ ℝn , it holds that i∈k d(xi , mk ) ≤
∑
i∈k d(xi , yk ) with equality for yk = mk and hence, an equivalent optimization problem is
minimize F(C, y1 , … , yK ),
8 Notice that what is called X̄ and Ȳ in this section is called X and Y in the section on PCA. The reason is that we
used X and Y for random variables in this section.
Exercises 289
𝑥2
2
0
0 1 2 3 4 5 6
𝑥1
The sets k are functions of the encoder C. The so-called K-means algorithm tries to solve the above
formulation using block coordinate descent, i.e. it iteratively solves the following two optimization
problems:
1. For fixed C, minimize F(C, y1 , … , yK ) with respect to y1 , … , yK .
2. For fixed (y1 , … , yK ) minimize F(C, y1 , … , yK ) with respect to C.
The first problem is a least-squares problem with solution equal to the average of the xi over
i ∈ k , i.e.
1 ∑
yk = x.
Nk i∈ i
k
This is the reason for the name K-means. The second problem also has an explicit solution given
by assigning observation xi to cluster k if d(xi , yk ) ≤ d(xi , yj ) for all j ≠ k. The algorithm can unfor-
tunately be trapped in local minima.
Example 9.9 We now consider a problem in two dimensions for which N = 30. We want to per-
form the clustering using K = 3. We initialize yk as the first three values of xi that we are given.
They actually come from the same cluster, showing that initialization is not extremely critical. The
result is shown in Figure 9.11, where we see that we get excellent clustering.
Exercises
9.1 Consider the special case of finding a Chebyshev bound as detailed in Section 9.1 when
S = ℝ+ , C = [1, ∞), f0 (x) = 1 and f1 (x) = x. Assume that it is known that 𝔼f1 (X) = 𝔼X = 𝜇,
where 0 ≤ 𝜇 ≤ 1. Show that this implies the so-called Markov bound
ℙ[ X ≥ 1] ≤ 𝜇.
290 9 Unsupervised Learning
9.2 Show how the function h in (9.5) for the Ising distribution can be derived by back-
substitution of 𝜇 = 1 − A(𝜆, Λ) into the Lagrange dual function.
Hint: You need to first obtain the Lagrange dual function by back substitution of p into the
Lagrangian.
9.3 Consider minimizing (9.12) with respect to (A, b). Show that the solution is the same as the
solution obtained by minimizing the criterion in (9.11) if mx , my , Dx , and Dxy are replaced
with their sample averages
1∑ 1∑
N N
̂x =
m x, ̂y=
m y,
N i=1 i N i=1 i
and
∑ N
∑ N
̂x = 1
D x̄ x̄ T , ̂ xy = 1
D x̄ ȳ T ,
N i=1 i i N i=1 i i
respectively, where x̄ i = xi − m
̂ x and ȳ i = yi − m
̂ y.
9.4 [22, Exercise 4.57] Consider two discrete random variables X and Y , where (X, Y ) ∶ ℕn ×
ℕm → ℝ. The mutual information I ∶ mn → ℝ is defined as
∑ ∑
n m
pX,Y (i, j)
I(pX,Y ) = pX,Y (i, j) log ,
i=1 j=1
pX (i)pY (j)
where pX,Y ∶ ℕn × ℕm → [0, 1] is the joint probability function, and where pX ∶ ℕn → [0, 1]
and pY ∶ ℕm → [0, 1] are the marginal probability functions for X and Y , respectively.
(a) Show that the mutual information can be expressed as
∑ ∑
n m
pY |X (j|i)
I(pX,Y ) = pX (i)pY |X (j|i) log ∑n ,
i=1 j=1 k=1 pX (k)pY |X (j|k)
subject to y = Px,
𝟙T x = 1,
x ≥ 0,
Exercises 291
with variables (x, y). Here, is it assumed that we use log2 . Let m = 2, n = 2 and consider
the channel defined by
[ ]
1−p p
P= ,
p 1−p
and show that
C = 1 + plog2 p + (1 − p)log2 (1 − p).
9.5 Let f be a pdf defined on ℝ+ and let F be the corresponding distribution function
x ∞
F(x) = ∫0 f (t)dt. The expected value is 𝜇 = ∫0 xf (x)dx. The Lorenz curve LF ∶ [0, 1] → ℝ is
defined as
u
1
LF (u) = F −1 (x)dx.
𝜇 ∫0
If F is the income distribution in a country, then L(u) represents the fraction of the total
income which is in the hands of the uth fraction of the population with the lowest income.
The Gini index G ∶ [0, 1]ℝ → [0, 1], defined as
1 ( ) 1
G(F) = 2 u − LF (u) du = 1 − 2 LF (u)du
∫0 ∫0
is twice the area between the Lorenz curve of the line of perfect equality LF0 (u) = u. The
index is zero for perfect equality and close to one if most of the income is with a very small
portion of the population. We are interested in finding the probability density function f
that maximizes the entropy for given values of the mean 𝜇 and of the Gini index 𝛾, i.e. we
want to solve
∞
minimize f (x) ln f (x)dx,
∫0
∞
subject to f (x)dx = 1,
∫0
∞
xf (x)dx = 𝜇,
∫0
G(F) = 𝛾.
Plot the probability density functions obtained from the maximum entropy solution for
the three countries.
9.6 In this exercise we consider the static ranking of web pages in Example 9.2 and the corre-
sponding maximum entropy problem.
(a) Show that the solution of the maximum entropy problem is pij = exp(−𝜆i + 𝜆j )∕Φ(𝜆),
∑
(i, j) ∈ E, where Φ(𝜆) = (i,j)∈E exp(−𝜆i + 𝜆j ), and where 𝜆 ∈ ℝn are the natural param-
eters. Notice that you get different answers depending on how you rewrite the equilib-
rium conditions as equations with a zero right-hand side.
Exercises 293
(b) Let ai = exp(−𝜆i ), i ∈ V and let A = diag(ai ). Moreover, define the sparse matrix
M ∈ ℝn×n as
{
Φ(𝜆)−1 , (i, j) ∈ E,
Mij =
0, (i, j) ∉ E.
Show that P = AMA−1 .
(c) The so-called iterative scaling or matrix balancing approach for finding the optimal solu-
tion to the maximum entropy problem is based on the expression above for P and the
∑
fact that Φ(𝜆) = (i,j)∈E ai ∕aj . It is an iterative algorithm and can be described as follows,
where superscripts are used to denote iteration index. Given a tolerance 𝜖 > 0, initialize
∑
as a(0) = 1.0, Z (0) = (i,j)∈E 1, and then repeat starting with k = 0:
i ( )
1. p(k)
ij
= a(k)
i
∕ a(k)j
Z (k) , (i, j) ∈ E
∑
2. 𝜌(k)
i
= j∶(i,j)∈E p(k)
ij
, i∈V
∑
3. 𝜎i(k) = j∶(j,i)∈E p(k)
ji
, i∈V
( )1∕2
4. 𝜂i(k) = 𝜎i(k) ∕𝜌(k)
i
, i∈V
i.e. choose 𝜖 for the iterative scaling approach to match the accuracy you get from
MOSEK with its default settings.
(f) It is also possible to solve the maximum entropy problem using an interior-point solver.
An especially good solver for this problem can be downloaded at http://web.stanford
.edu/group/SOL/software/pdco/. Compare the performance of this solver with the two
previous ones for the static ranking of web pages. Also, here try to make the comparison
fair with respect to tolerances. Check the link to the presentation at ISMP2003 on the
above web page. It seems that they are able to solve maximum entropy problems for
web traffic with roughly 50 000 nodes using MATLAB. Are you able to?
9.7 Consider a primitive clinic in a village. People in the village have the property that they are
either healthy or have a fever. They can only tell if they have a fever by asking the doctor in
the clinic. The doctor makes a diagnosis of fever by asking patients how they feel. Villagers
only answer that they feel normal, dizzy, or cold. This defines a HMM (X, Y ) with Xk ∈ =
{𝛼1 , 𝛼2 } and Yk ∈ = {𝛽1 , 𝛽2 , 𝛽3 }, where 𝛼1 =healthy, 𝛼2 =fever, 𝛽1 =normal, 𝛽2 =cold, and
𝛽3 =dizzy. Introduce the notation:
Also, let the matrices A and B be defined such that element (i, j) of the matrix is equal to aij
and bij , respectively. Consider the case when
[ ] [ ]
0.7 0.3 0.5 0.4 0.1
A= , B= ,
0.4 0.6 0.1 0.3 0.6
and assume that ℙ[X0 = 𝛼1 ] = 0.6 and ℙ[X0 = 𝛼2 ] = 0.4. The doctor has for a patient
observed the first day normal, the second day cold, and the third day dizzy. What is the
most likely value of the condition for the patient for the different days?
9.8 Show that the smoothing problem in Section 9.4 can be formulated as an optimal control
problem
∑
N
minimize 𝜙(x0 ) + fk (xk , uk ),
k=1
subject to xk−1 = Fk (xk , uk ), k ∈ ℕN ,
with variables (x0 , u1 , … , uN , xN ) for some fk and Fk . Notice that the time index is running
in reverse order as compared to a standard control problem.
Hint: Take the logarithm of the joint pdf pX̄ N ,Ȳ N .
9.9 Recall the Gaussian mixture model from Section 9.11, which involves a pdf of the form
∑
k
fY (y; 𝜃) = 𝛼j (y, 𝜇j , Σj ),
j=1
9.10 Implement the EM algorithm for estimating the model parameters of a Gaussian mix-
ture model; cf . Section 9.11. Test your implementation on a set of samples from a
one-dimensional Gaussian mixture model with three components. The following MATLAB
code illustrates how to generate m = 1000 samples y1 , … , ym from a mixture of three
univariate Gaussian distributions:
⎡ 1 𝛾 1⎤
Σ = ⎢𝛾 2 1⎥
⎢ ⎥
⎣ 1 1 3⎦
is positive semidefinite and such that its inverse has a zero in the position of the variable 𝛾.
9.12 In this exercise, we will perform PCA analysis on the Fisher Iris flower data that we investi-
gated in Example 9.8. The data set can be download in MATLAB with the command load
fisheriris.mat. It contains measurements of four different characteristics of three dif-
ferent iris species in each column. There are 150 rows in the data set. Each subset of 50 rows
corresponds to three different iris species. You should preprocess the data by subtracting
the mean value of each column from all the values in that column. Then you should divide
all values in a column with the standard deviation of its column values. This is then the
X-matrix that you should use for PCA. Hence, each row of it is an observation xiT . You should
compute the two principal components corresponding to the largest singular values, and
then compute the compressed data ci ∈ ℝ2 . Finally, plot for each ci its second component
versus its first component. Make sure to mark the different species differently in your plot.
You should obtain the same plot as in Figure 9.9. Would it have been possible to separate the
different species from one another using this plot in case you had not known from which
species the data originated?
9.13 In this exercise you are asked to implement the K-means algorithm in Section 9.16 for cluster
analysis. You should try out the algorithm on the Fisher Iris flower data set that you used in
Exercise 9.12. Preprocess the data set in the same way as you did in that exercise, i.e. make
sure that all columns have zero mean and unit standard deviation.
296 9 Unsupervised Learning
(a) Try the algorithm using two and three clusters, respectively, i.e. for the cases K = 2, 3.
How does the algorithm cope? You do not need to make any plots. It is enough to inves-
tigate the resulting encoders, and how good they perform.
(b) Instead of using the 4-dimensional data that you used in (a) instead use the
2-dimensional data ci that resulted from the PCA analysis in Exercise 9.12. Try
the cases K = 2, 3.
(c) Instead of using the 2-dimensional data that you used in (b) instead use a 3-dimensional
data ci that you obtain from PCA analysis similarly as in Exercise 9.12 by instead using
the three principal components corresponding to the three largest singular values. Try
the cases K = 2, 3 again.
(d) Relate your results in (a)–(c) to what you visually observed in Exercise 9.12.
297
10
Supervised Learning
In this chapter, we will discuss supervised learning. What distinguishes supervised learning
problems from unsupervised learning problems is that the data come in pairs, i.e. we may say
(xk , yk ) ∈ ℝ × ℝ for k ∈ ℕN and we would like to find a relationship between the pairs of data. We
will start with linear regression. This does not mean that the data pairs are related to one another
in a linear way. Instead, it is the class of functions that we consider that is parameterized in a linear
way. First, we will do this in a finite-dimensional space, and there we will also discuss statistical
interpretations and generalizations such as maximum likelihood estimation, maximum a poste-
riori estimation, and regularization. We will then also do regression in an infinite-dimensional
space, i.e. in a Hilbert space. We will see that this is equivalent to maximum a posteriori estimation
for so-called Gaussian processes. Then we will discuss classification both using linear regression,
logistic regression, support vector machines, and the restricted Boltzmann machine. The chapter
is finished off with artificial neural networks and the so-called back-propagation algorithm. We
also discuss a form of implicit regularization known as dropout.
1 ∑( T
N
)2
minimize a 𝛽(xk ) − yk (10.2)
2 k=1
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
298 10 Supervised Learning
with variable a ∈ ℝm . This is a linear LS problem. We may consider more general functions than
monomials, i.e. we let 𝜑j ∶ ℝ → ℝ, j ∈ ℕm , be any functions and define
∑
m
f (x) = aj 𝜑j (x) = aT 𝛽(x),
j=1
( )
with 𝛽(x) = 𝜑1 (x), … , 𝜑m (x) . Often, one lets 𝜑1 (x) = 1. It is also possible to generalize the
regression model to x ∈ ℝn . We just need to define 𝜑j ∶ ℝn → ℝ, j ∈ ℕm , and 𝛽 ∶ ℝn → ℝm .
An important special case is when n = m and 𝛽(x) = x.
Another generalization is when f (x) is vector-valued. We consider the case when f ∶ ℝn → ℝp is
given by
f (x) = A𝛽(x),
( )
where A ∈ ℝp×m , 𝛽 ∶ ℝn → ℝm with 𝛽(x) = 𝜑1 (x), … , 𝜑m (x) and 𝜑j ∶ ℝn → ℝ, j ∈ ℕm . The LS
criterion is then the sum of the LS criteria for each row in the regression model, and the LS problem
can be written as
1∑ ‖
N
y − A𝛽(xk )‖
2
minimize
2 k=1 ‖ k ‖2 ,
with variable A. The solution to this problem is closely related to the SAA approximation of the
single-stage stochastic optimization problem for the affine predictor in Section 9.3. This is one of
the motivations for calling f (x) a predictor – it can be used to predict values of yk when only xk is
known.
Example 10.1 We are given pairs of data (xk , yk ) that happen to satisfy yk = sin xk . We are not
aware of this relationship, and instead, we want to find a linear regression model as in (10.1) that
solves (10.2) for 𝛽(x) = (1, x, x2 , x3 , x4 ). We solve the resulting normal equations as in (5.3), and then
plot the resulting polynomial and compare it with the 11 points we used for fitting, see Figure 10.1.
The fit is pretty good inside the interval of the available data. We do not expect it to be very good
outside this interval. The polynomial is only of fourth degree.
0.5
0
𝑦
−0.5
−1
0 1 2 3 4 5 6
𝑥
Figure 10.1 Plot showing the fourth degree polynomial (solid line), and the 11 data points (+), used to fit
the polynomial to a sinusoidal.
10.1 Linear Regression 299
( ) ∏N
∏
N
( )
y1 , … yN ; a = fYk (yk ) = fEk yk − aT 𝛽(xk ) .
k=1 k=1
pdf fA ∶ ℝm → ℝ+ . It is natural to assume that A and Ek are independent, in which case the joint
pdf f ∶ ℝN × ℝm → ℝ+ is given by
∏N
( )
fY ,A (y, a) = fA (a) fEk yk − aT 𝛽(xk ) ,
k=1
with variable a. Compared to the ML problem, the only difference is that we have added the term
− ln fA (a) to the objective function. This can be interpreted as a regularization of the LS problem.
Different pdfs fA result in different regularizations, and hence, reflect different prior knowledge.
We now consider the generalization of regression to infinite dimension, i.e. we let 𝛽(x) =
( ) ( )
𝜑1 (x), 𝜑2 (x), … , where 𝜑i ∶ ℝn → ℝ, i ∈ ℕ. For a = a1 , a2 , … , where ai ∈ ℝ, i ∈ ℕ, we say that
∑∞ ∑∞
a ∈ 𝓁2 if i=1 a2i < ∞. We define the inner product ⟨⋅, ⋅⟩𝓁2 ∶ 𝓁2 × 𝓁2 → ℝ as ⟨a, b⟩𝓁2 = i=1 ai bi . The
corresponding norm || ⋅ ||𝓁2 ∶ 𝓁2 → ℝ+ is defined as ||a||𝓁 = ⟨a, a⟩𝓁2 . For functions f ∶ ℝn → ℝ,
2
2
we define the space of square integrable functions L2 , i.e. the set of functions such that
f (x)2 dx < ∞.
∫ℝn
We define the inner product ⟨⋅, ⋅⟩L2 ∶ L2 × L2 → ℝ by
⟨f , g⟩L2 = f (x)g(x)dx.
∫ℝn
The corresponding norm || ⋅ ||L2 ∶ L2 → ℝ+ is defined by ||f ||2L = ⟨f , f ⟩L2 . We remark that both 𝓁2
2
and L2 are Hilbert spaces. Now, suppose that 𝜑i ∈ L2 , i ∈ ℕ is a family of orthonormal functions, i.e.
1 ∑(
N
)2 𝜈
minimize yk − f (xk ) + ||f ||2L2 ,
2 k=1 2
10.2 Regression in Hilbert Spaces 301
1∑ 2 𝜈
N
minimize e + ||a||2𝓁
2 k=1 k 2 2
∑
∞
subject to ek = yk − ai 𝜑i (x), k ∈ ℕK
i=1
with variables (a, e), where e = (e1 , … , eN ). To ease the notation, we will write this as
1 𝜈
minimize eT e + aT a
2 2
subject to e = y − Xa,
where X is the infinite-dimensional matrix given by
⎡ 𝛽 T (x1 ) ⎤
X = ⎢ ⋮ ⎥.
⎢ T ⎥
⎣𝛽 (xN )⎦
We then introduce the Lagrangian L ∶ ℝN × 𝓁2 × ℝN → ℝ defined by
1 T 𝜈
L(e, a, 𝜆) = e e + aT a + 𝜆T (e − y + Xa) .
2 2
Completing the squares in the Lagrangian results in
( )T ( )
1 𝜈 1 1
L(e, a, 𝜆) = (e + 𝜆)T (e + 𝜆) + a + XT𝜆 a + XT𝜆
2 2 𝜈 𝜈
1 1
− 𝜆T 𝜆 − 𝜆T XX T 𝜆 − 𝜆T y.
2 2𝜈
Thus, for a given 𝜆, the Lagrangian is minimized by taking
1
e = −𝜆, a = − X T 𝜆,
𝜈
which yields the Lagrange dual function g ∶ ℝN → ℝ defined by
1 1
g(𝜆) = − 𝜆T 𝜆 − 𝜆T XX T 𝜆 − 𝜆T y.
2 2𝜈
We call this function the kernel function. We then define the matrix ∈ 𝕊N with elements
ij = K(xi , xj ). We may then write the dual optimization problem as
( )
1 1
maximize − 𝜆T I + 𝜆 + 𝜆T y,
2 𝜈
with variable 𝜆, which has solution
( )−1
1
𝜆 = − I + y.
𝜈
We may also write f (x) in terms of the Kernel function as
1∑
N
1
f (x) = aT 𝛽(x) = − 𝜆T X𝛽(x) = − 𝜆 K(xk , x).
𝜈 𝜈 k=1 k
302 10 Supervised Learning
The fact that the solution to this infinite-dimensional regression problem can be obtained from a
finite-dimensional optimization problem by introducing the kernel function is sometimes called
the kernel trick.
From the above expression for the regressor f (x), we realize that one could just as well start with a
series expansion in terms of the kernel function. A natural question that arises is for what functions
K there are orthonormal 𝜑i . The answer is that K should be a positive semidefinite kernel, i.e. for
any N ∈ ℕ, and any ci ∈ ℝ and xi ∈ ℝn , i ∈ ℕN , it should hold that
∑ ∑
N N
ci cj K(xi , xj ) ≥ 0.
i=1 j=1
This result is known as Mercer’s theorem, [94, p. 96]. Popular choices are dth degree polynomi-
( )d
als given by K(x, x̄ ) = 1 + xT x̄ and radial basis functions that are functions that only depend on
||x − x̄ ||2 . The dth degree polynomials correspond to finitely many orthonormal 𝜑i . The Gaussian
radial basis kernel
( )
1
K(x, x̄ ) = exp − 2 ||x − x̄ ||22 ,
2𝜎
where 𝜎 ∈ ℝ++ , corresponds to an infinite series of orthonormal 𝜑i [94].
(2𝜋)N det R
1 1 T Σ−1 (a−𝜇 )
fA|Y (a|y) = √ e− 2 (a−𝜇a|y ) a|y a|y
,
(2𝜋)n det Σa|y
where 𝜇a|y = Σa X T R−1 y is the conditional mean for A given Y = y. Clearly, the conditional pdf is
maximized by a = 𝜇a|y , which proves that the MAP estimate is the conditional mean for Gaussian
distributions, as we saw in Section 9.3.
We realize that Σa only appears in terms of expressions of the form zT Σa z̄ for z ∈ ℝn and z̄ ∈ ℝn .
Because of this we could instead write the predictor in terms of the covariance function K ∶ ℝn ×
ℝn → ℝ+ defined by K(z, z̄ ) = zT Σa z̄ . We then let ∈ 𝕊N+ be defined by ij = K(xi , xj ), and we let
𝜂k , k ∈ ℕN , be defined by 𝜂k = K(xk , x). Then the predictor can be written as
( )−1
f (x) = 𝜂 T Σe + y.
This is the same predictor as we obtain when we do regression in Hilbert spaces in Section 10.2 if
we let Σe = 𝜈I and if we define the orthonormal functions 𝜑i such that the resulting kernel func-
tion is K(z, z̄ ) = zT Σa z̄ . We notice that this is the covariance between AT z and AT z̄ . Thus, we can
generalize the regression model in (10.4) by instead considering
Yk = (xk ) + Ek ,
with ∶ ℝn → ℝ, where we specify the covariance between (z) and (̄z) by specifying the covari-
ance function for any z ∈ ℝn and z̄ ∈ ℝn . We still assume that the joint distribution is Gaussian.
This defines a zero mean real-valued Gaussian random process on ℝn . Such a process is some-
times called a random field, cf . Section 3.9. Notice that any of the kernel functions discussed in
Section 10.2 may be used. The Gaussian radial basis function is called the squared exponential
covariance function. Another common covariance function is
( )
1
K(z, z̄ ) = exp − ||z − z̄ ||2 ,
𝜎
which defines the so-called Ornstein–Uhlenbeck process, where 𝜎 ∈ ℝ++ . When the covariance
function only depends on z − z̄ , the process is stationary, and when it only depends on ||z − z̄ ||2 is
also isotropic. Stationary and isotropic together is sometimes called homogeneous. In practice, these
properties reflect the differences, or rather the lack of them, in the behavior of the process given
the location of the observer. Actually, there are many more possibilities, see [94]. We specifically
304 10 Supervised Learning
0.5
0
𝑦
−0.5
−1
0 5 10 15 20
𝑥
Figure 10.2 Plot showing the Gaussian regression model (solid line), and the 11 data points (+), used to
compute the model.
Example 10.2 Here we again consider data from a sinusoidal, but this time collected from
three periods of the sinusoidal. We take Σe = 0, since we have no measurement errors of
the sinusoidal. We use the squared exponential covariance function with parameter 𝜎 = 1. In
Figure 10.2, the predicted values together with the 11 data points used to compute the predictor are
shown.
10.4 Classification
In classification problems, we are given pairs of data (xk , yk ) for k ∈ ℕN with xk ∈ ℝn , and where
yk is qualitative in the sense that is it belongs to a discrete finite set with cardinality K. Without
loss of generality, we may take this set to be ℕK . We say that the data (xk , yk ) belongs to class l
if yk = l. These types of data are sometimes also called categorical or discrete as well as factors.
We are interested in finding functions fl ∶ ℝn → ℝ, l ∈ ℕK , that are such that fl (xk ) > 0 if yk = l
{ }
and fl (xk ) < 0 if yk ≠ l. The set x ∣ fl (x) = 0 then separates class l from the other classes. To the
left in Figure 10.3, we are shown two classes of data that can easily be separated with, e.g. a
straight line. In the same figure to the right, we see two classes that cannot be separated with any
connected line.
zl (x) = aTl x + bl ,
10.4 Classification 305
4
4
2
𝑥2
𝑥2
2
0
0
−1 0 1 2 3 −2 0 2 4 6
𝑥1 𝑥1
Figure 10.3 Plots showing data points from two classes marked with + for the first class and ⚬ for the
second class. In the left plot, the data points for the two classes are well separated. In the right plot, the
data points for the two classes are mixed.
where al ∈ ℝn and bl ∈ ℝ. The objective is to choose the regression parameters (al , bl ) such that zl
is close to one if x belongs to class l and otherwise close to zero. We obtain this by solving the LS
problem
( )
1∑ ∑ ( T ∑(
K
)2 )2
minimize a x + bl − 1 + T
al xk + bl ,
2 l=1 k∶y =l l k k∶y ≠l
k k
with variables a = (a1 , … , aK ) and b = (b1 , … , bK ). Then we use the functions 𝛿l ∶ ℝn → ℝ defined
by 𝛿l (x) = zl (x) as discriminant functions, i.e. we classify x to belong to class l if 𝛿l (x) > 𝛿k (x) for all
k ≠ l. Hence, we get fl (x) = 𝛿l (x) − max k≠l 𝛿k (x). Unfortunately, this method is prone to give bad
results for K ≥ 3. However, there is a simple remedy which is to consider polynomial regression
models, and it might be necessary to have polynomial terms up to degree K − 1, see [54]. More gen-
eral basis functions may also be considered in this framework just as for the curve fitting problem
in Section 10.1. The LS problems will still be linear.
Example 10.3 We consider an example where K = 2 and where n = 2. There are in total 20 data
points from each class, and hence, N = 40. In Figure 10.4, the data points are shown together with
the line defined by 𝛿1 (x) = 𝛿2 (x), which separates the two classes.
−1 0 1 2 3
𝑥1
306 10 Supervised Learning
where we consider pl ∶ ℝK−1 → [0, 1] for l ∈ ℕK to be functions of the scalars zl ∈ ℝ. Notice that
any value of zl will make this a valid distribution. We let zl (x) = aTl x + bl as above and define the
likelihood function ∶ ℝNn × ℝ(K−1)(n+1) → [0, 1] by
∏
K
∏
(x1 , … , xN ; a1 , … , aK−1 , b1 , … , bK−1 ) = pl (z1 (x), … , zK−1 (x)).
l=1 k∶yk =l
which is a concave function. This follows from Exercises 4.4 and the fact that concavity
is preserved under an affine transformation. We use the functions 𝛿l ∶ ℝn → ℝ defined by
𝛿l (x) = pl (aT1 x + b1 , … , aTK−1 x + bK−1 ) as discriminant functions similarly as above, i.e. we let
fl (x) = 𝛿l (x) − max k≠l 𝛿k (x).
The classifiers discussed so far work with discriminant functions, and then the function fl that
separates the classes is obtained indirectly as fl (x) = 𝛿l (x) − max k≠l 𝛿k (x). In Section 10.5, we will
discuss how to work directly with the functions fl .
It is always possible to formulate feasibility problems as optimization problems, and there are
many ways of doing this. One is to define the function g ∶ ℝ → ℝ+ by
g(y) = max(0, y),
and define the optimization problem
∑
N
minimize g(1 − yk f (xk ; a, b)) (10.5)
k=1
with variables (a, b). Clearly, the optimal value is zero if and only if the two classes can be separated
by a hyperplane. This formulation is the basis for Hebbian learning, cf . Exercise 10.6, where we
carry out the minimization using a subgradient algorithm. The function g is called the rectifier
function or the rectified linear unit (ReLU) function.
with variables (a, b, 𝜉), where 𝜈 ≥ 0 can be used to make a trade-off between the amount of mis-
fit and the distance to the hyperplane. This is the formulation that is the basis of support vector
machines (SVMs). The optimization problem is a convex optimization problem and because of this
it is also tractable. We remark that in the linear regression we may use nonlinear functions of xk
instead of xk itself without changing the type of optimization problems. However, the two classes
will then not be separated by a hyperplane but by a general surface.
𝜈 ∑N
∑N
∑N
( ) ∑N
= ||a||22 − 𝜆k yk aT 𝛽(xk ) − 𝜆k yk b + 1 − 𝜆k − 𝜇k 𝜉k + 𝜆k .
2 k=1 k=1 k=1 k=1
𝜕L ∑N
= 𝜈a − 𝜆k yk 𝛽(xk ) = 0,
𝜕a k=1
i.e.
1∑
N
a= 𝜆 y 𝛽(x ).
𝜈 k=1 k k k
This follows from the fact that the Lagrangian is convex in a. We know from complementary
slackness that 𝜆k = 0, if the first constraint in (10.6) is satisfied strictly at optimality. This means
that the optimal solution a⋆ only depends on xk if yk f (xk ; a⋆ , b⋆ ) = 1 − 𝜉k⋆ , and these vectors xk
are called support vectors. Back substitution into the Lagrangian gives the Lagrange dual function
g ∶ ℝN × ℝN → ℝ given by
∑
N
1 ∑∑
N N
g(𝜆, 𝜇) = 𝜆k − 𝜆k 𝜆l yk yl 𝛽 T (xk ), 𝛽(xl )
k=1
2𝜈 k=1 l=1
∑N
with dom g = {(𝜆, 𝜇)|1 − 𝜆k − 𝜇k = 0, k ∈ ℕN , k=1 𝜆k yk = 0}. Since Slater’s condition is fulfilled
for (10.6), we may obtain the optimal (𝜆, 𝜇) by solving the dual optimization problem
maximize g(𝜆, 𝜇)
subject to 𝜆k + 𝜇k = 1, k ∈ ℕN
∑N
𝜆k yk = 0
k=1
𝜆k ≥ 0, 𝜇k ≥ 0, k ∈ ℕN .
with variables (𝜆, 𝜇). This is a convex optimization problem with quadratic objective function
and simple inequality constraints, which can be solved efficiently. There are actually special
purpose algorithms that are very efficient, e.g. the so-called sequential minimal optimization
algorithm [91].
10.5 Support Vector Machines 309
Example 10.4 We consider an example where we have 20 points x1 , … , x20 ∈ ℝ2 of one class and
20 points x21 , … , x40 ∈ ℝ2 of another class. We use the parameter value 𝜈 = 20 and compute a and
b for the function f (x) = aT x + b that should separate the two classes. In Figure 10.5, we see the
points together with the line f (x) = 0, which almost separates the two classes of points. It is easy
to see that they cannot be separated with a straight line, but the SVM solution only misclassifies a
few points.
1∑
N
f (x; a, b) = 𝜆 y 𝛽(x )T 𝛽(x) + b,
𝜈 k=1 k k k
4
𝑥2
−2 0 2 4 6
𝑥1
Figure 10.5 Plot showing the data points from the two classes marked with + for the first class and ⚬ for
the second class. Also shown is the line f (x) = 0 that almost separates the two classes.
310 10 Supervised Learning
Figure 10.6 Graph showing how the visible and hidden layers in
𝑣1 an RBM are connected.
ℎ1
𝑣2
ℎ2
𝑣3
ℎ3
𝑣4
10.6 Restricted Boltzmann Machine 311
where
∑
e𝜆1 𝑣+𝜆2 h+2𝑣
T T T𝛬
Z= 12 h .
h,𝑣
The convention is that if there are no limits for the summations, the sum should be taken over
all possible values of the variables. It follows that the conditional probability function for h
given 𝑣 may be factorized. To see this, notice that the marginal probability function for 𝑣 can be
written as
∑ e𝜆1 𝑣 ∑ (𝜆2 +2𝛬T12 𝑣)T h e𝜆1 𝑣 ∑∏ (𝜆2 +2𝛬T12 𝑣)Ti hi
T T
p(𝑣, h) = e = e
h
Z h Z h i
( )
e𝜆1 𝑣 ∏∑ (𝜆2 +2𝛬T12 𝑣)Ti hi e𝜆1 𝑣 ∏
T T
Z i i 1+e
sj (hj , 𝑣) = ∏ ( ),
(𝜆2 +2𝛬T12 𝑣)Ti
i 1+e
and the overall conditional probability function is obtained as the product of these factors. The
expression of the factors can be simplified considerably. Notice that they are for hj = 0 and hj = 1
given by
1 exj
sj (0, 𝑣) = ∏ xi
, sj (1, 𝑣) = ∏ xi
,
i (1 + e ) i (1 + e )
∏
where xi = (𝜆2 + 2𝛬T12 𝑣)i . Since sj (0, 𝑣) + sj (1, 𝑣) = 1, we have that i (1 + exi ) = exj + 1, and hence
exj 1
sj (1, 𝑣) = = = 𝜎(xj ),
1 + exj 1 + e−xj
where 𝜎 ∶ ℝ → ℝ is the logistic function, which is an example of a so-called sigmoid function.
𝜕Q ∑ ∑
= 2 s(h, 𝑣, 𝜆− , 𝛬− )𝑣hT − p(𝜉𝑣 , 𝜉h )𝜉𝑣 𝜉hT ,
𝜕𝛬12 h 𝜉
312 10 Supervised Learning
∏ ∑
where s(h, 𝑣, 𝜆− , 𝛬− ) = k sk (hk , 𝑣), and where 𝜉 = (𝜉𝑣 , 𝜉h ). We let p𝑣 (𝜉𝑣 ) = 𝜉h p(𝜉𝑣 , 𝜉h ) be the
marginal probability function for the observations. We may then write the gradients as
[ ] [ ]
𝜕Q ∑ 𝑣 ∑ ∑ 𝜉
= s(h, 𝑣, 𝜆− , 𝛬− ) − p𝑣 (𝜉𝑣 ) s(𝜉h , 𝜉𝑣 , 𝜆− , 𝛬− ) 𝑣
𝜕𝜆 h
h 𝜉𝑣 𝜉h
𝜉h
𝜕Q ∑ ∑ ∑
= 2 s(h, 𝑣, 𝜆− , 𝛬− )𝑣hT − p𝑣 (𝜉𝑣 ) s(𝜉h , 𝜉𝑣 , 𝜆− , 𝛬− )𝜉𝑣 𝜉hT .
𝜕𝛬12 h 𝜉 𝜉
𝑣 h
We realize that
∑ ∑∏ ∑∏
s(h, 𝑣, 𝜆− , 𝛬− )hl = sk (hk , 𝑣)hl = sk (hk , 𝑣)sl (hl , 𝑣)hl
h h k h k≠l
∏∑ ∑
= sk (hk , 𝑣) sl (hl , 𝑣)hl = sl (1, 𝑣),
k≠l hk hl
where the last equality follows from the fact that sk (0, 𝑣) + sk (1, 𝑣) = 1 and sl (0, 𝑣) × 0 = 0. From
this, it follows that
𝜕Q ∑
=𝑣− p𝑣 (𝜉𝑣 )𝜉𝑣
𝜕𝜆1 𝜉𝑣
( )
𝜕Q ∑
= si (1, 𝑣) − p𝑣 (𝜉𝑣 )si (1, 𝜉𝑣 )
𝜕𝜆2 i 𝜉𝑣
( )
𝜕Q ∑
= 2sj (1, 𝑣)𝑣i − p𝑣 (𝜉𝑣 )sj (1, 𝜉𝑣 )(𝜉𝑣 )i .
𝜕𝛬12 i,j 𝜉 𝑣
The first term in each of the expressions is cheap to evaluate. The second terms can be approximated
using Monte Carlo methods. In [38], the so-called contrastive divergence method based on Gibbs
sampling is described. We want to draw a sample 𝜉𝑣 from p𝑣 . We initialize the Gibbs sampler with
𝜉𝑣(0) = 𝑣, cf . Section 9.12. Because of the graphical structure of the RBM, we then draw a sample
h(1) from the conditional distribution of h given 𝑣, i.e. from sj (hj , 𝜉𝑣(0) ) above. We then finally draw a
sample 𝜉𝑣(1) from the conditional distribution of 𝑣 given h, where we use h = h(1) . This conditional
distribution can be obtained similarly to sj (hj , 𝑣) above. Here we have run the Gibbs sampler for
only one step, but it is, of course, possible to run more steps. However, empirical evidence suggests
that one step is enough. After the samples have been obtained, we approximate the above sums as
∑ (1) ∑ ∑
𝜉𝑣 , si (1, 𝜉𝑣(1) ), sj (1, 𝜉𝑣(1) )(𝜉𝑣(1) )i ,
𝜉𝑣 𝜉𝑣 𝜉𝑣
where zi ∈ ℝni is the output of layer i. The function Φi ∶ ℝni−1 → ℝni is called the propagation
function. Typically, it can be a linear or an affine function, and then we may write
Φi (x) = Wi x + 𝑣i , i ∈ ℕL , (10.7)
where Wi ∈ ℝni ×ni−1 and 𝑣i ∈ ℝni . The input to the next layer is obtained by
xi = hi (zi ), i ∈ ℕL−1 , (10.8)
where hi ∶ ℝni → ℝni is called the activation function. It is often the case that this function can be
written as
( )
hi (z) = hi1 (z1 ), … , hini (zni )) , (10.9)
and each component hij is typically a saturation function or a sigmoid function, i.e. a function
𝜎 ∶ ℝ → [0, 1] such that limt→∞ 𝜎(t) = 1 and limt→−∞ 𝜎(t) = 0. Another popular choice is the ReLU,
which was defined in Section 10.5. Figure 10.7 shows an example of a neural network.
We now define the functions fi ∶ ℝni−1 × ℝpi → ℝni , i ∈ ℕL , as
fi (xi−1 , 𝜃i ) = hi (Φi (x)) = hi (Wi xi−1 + 𝑣i ), (10.10)
where
([ ])
𝜃i = vec Wi 𝑣i ∈ ℝpi . (10.11)
It follows that
xi = fi (xi−1 ; 𝜃i ), i ∈ ℕL . (10.12)
Finally, we define f ∶ × ℝn0×···× ℝp1 → ℝpL ℝnL
as f (x0 ; 𝜃1 , … , 𝜃L ) = xL . With x = x0 ∈ ℝn0 and
∑ L
𝜃 = (𝜃1 , … , 𝜃L ) ∈ ℝp , where p = i=1 pi , we have hence defined a nonlinear regression model or
predictor f (x; 𝜃). The recursive structure of the predictor is illustrated in Figure 10.8.
Figure 10.7 A neural network with four hidden layers. The nodes represent the activation functions, and
the edges illustrate how the output from one layer is propagated to the next.
𝜃1 𝜃2 𝜃𝐿
𝑥1 𝑥2 𝑥 𝐿− 1
𝑥0 𝑓1 𝑓2 ⋯ 𝑓𝐿 𝑥𝐿
Figure 10.8 Figure illustrating the recursive definition of the ANN predictor.
314 10 Supervised Learning
where 𝜙p,q ∶ [0, 1] → [0, 1] are continuous increasing functions, and where g ∶ ℝ → ℝ is a con-
tinuous function [64, 72]. As a result, a two-layer ANN is sufficient to represent any continuous
function. However, very little is known about how to choose g. A result by Cybenko [30] says that
any continuous function f ∶ [0, 1]n → ℝ can be approximated arbitrarily well as
∑
N
f (x) ≈ 𝛼j 𝜎(aTj x + bj ) (10.13)
j=1
if N is sufficiently large, and where 𝜎 is any continuous sigmoid function, and 𝛼j ∈ ℝ, aj ∈ ℝn and
bj ∈ ℝ, j ∈ ℕN .
1∑
N
V(𝜃) = ||y − f (xk ; 𝜃)||22 , (10.14)
2 k=1 k
the minimization of which is a nonlinear LS problem.
Finally, we interpret the optimization problem for Hebbian learning in (10.5) as a one-layer ANN.
Let 𝑣1 = b, W1 = aT and h(z1 ) = g(z1 ). The training data is (xk , yk ) ∈ ℝn × {−1, 1}, k ∈ ℕN and the
function to maximize is the sum of the outputs of the ANN over all training data.
Because of these interpretations, it easy to see how one can generalize logistic regression, the
RBM, and Hebbian learning using multilayer ANNs.
x0 ← x
for i ← 1 to L do
zi ← Wi xi−1 + 𝑣i
xi ← hi (zi )
end
𝜕hL (zL )
HL ← 𝜕zLT
𝜕f ([ ] )
𝜕𝜃L
← HL T
xL−1 1 ⊕I
P ← HL WL
for i ← L − 1 to 1 do
𝜕h (z )
Hi ← 𝜕zi T i
i
P ← PHi
𝜕f ([ ] )
𝜕𝜃iT
←P T
xi−1 1 ⊕I
P ← PWi
end
which has infinitely many solutions if the system of equations is consistent. It would then be natural
to look for a solution that minimizes the norm ||a||2 . In Exercise 10.11, we show that if X has full
row rank, then this solution is given by
( )−1
a = X T XX T y.
We will now see how the incremental stochastic optimization method in (6.72) can be used to obtain
this least-norm solution by solving the optimization problem
1∑
N
minimize V (aT xk , yk ),
N k=1 k
with variable a, [120]. Here Vk ∶ ℝ × ℝ → ℝ are functions such that Vk (0, 0) = 0 and such that the
incremental method converges to a vector a that satisfies Xa = y. This holds true for many convex
functions. One example is
1
Vk (zk , yk ) = (yk − zk )2 , (10.16)
2
which corresponds to the LS criterion. We realize that fk (a) = Vk (aT xk , yk ) to get agreement with
the optimization problem in (5.46). The incremental method reads
ak+1 = ak − tk ∇fik (ak ).
We have that
𝜕Vik (aT xik , yk )
∇fik (a) = xik ,
𝜕zik
and hence, if we start with a0 = 0, we find that the incremental method converges to a = X T 𝛼 for
some 𝛼 ∈ ℝN . Together with Xa = y, we obtain the equation
XX T 𝛼 = y,
which has a unique solution if X has full row rank. It then follows that
( )−1
a = X T XX T y,
which is the least-norm solution. This could also have been obtained by solving the optimization
problem
∑
N
minimize (yk − aT xk )2 + 𝛾||a||22 ,
k=1
with variable a, which is a regularized LS problem, and then letting 𝛾 → 0, see Exercise 10.11.
10.8.2 Dropout
Another mechanism that results in implicit regularization is the concept of dropout. We will study
it for the linear regressor in (10.15). We consider Vk (z, y) given by (10.16), and the resulting LS
problem is
1∑
N
minimize (y − aT xk )2 ,
2 k=1 k
with variable a ∈ ℝn . Dropout is achieved by replacing ai , i.e. the ith component of a with 𝛿i ai ,
where 𝛿i is an outcome of a random variable Δi with a Bernoulli distribution, i.e. P(Δi = 1) = pi
and P(Δi = 0) = 1 − pi . We assume that Δi is independent of Δj for i ≠ j. Let Δ = diag(Δ1 , … , Δn )
and 𝛿 = diag(𝛿1 , … , 𝛿n ). We may then write the kth term in the objective function as the outcome of
1
(y − xkT Δa)2 . We actually also assume that we have different outcomes 𝛿 of Δ for each k that are
2 k
318 10 Supervised Learning
independent of one another, but since we only need to analyze on fixed value of k to understand how
the incremental method performs with dropout, we will neglect all dependence on k and consider
V d ∶ ℝn → ℝ defined as
1
V d (a) = (y − xT Δa)22 ,
2
for y ∈ ℝ and x ∈ ℝn . Once we have determined a, we would like to use this parameter to predict y
given a new value of x. We therefore define the predictor Ŷ ∶ ℝn × {0, 1}n → ℝ as Ŷ (x; Δ1 , … , Δn ) =
xT Δa. This is, however, not so useful, since it involves the random variable Δi . We are more inter-
ested in the expected value of this predictor, which is often called the ensemble average predictor
ŷ ∶ ℝn → ℝ defined as
[ ]
ŷ (x) = 𝔼 Ŷ (x; Δ1 , … , Δn ) = xT Pa,
where P = diag(p1 , … , pn ). The interesting question is if running the incremental method on the
problem
[ ]
minimize 𝔼 V d (a) ,
with variable a will result in a good ensemble average predictor. We therefore introduce the function
V e ∶ ℝn → ℝ as
1
V e (a) = (y − xT Pa)2 ,
2
which we like to be small for the ensemble average predictor to be good. We will now
compare the gradient of this function with the expected value of the gradient of V d . Remem-
ber that this expected value is what determines the behavior of the incremental method,
cf . Section 6.8. We have
𝜕V e (a)
= −(y − xT Pa)Px
𝜕a
𝜕V d (a)
= −(y − xT Δa)Δx.
𝜕a
From this, we obtain
[ ]
𝜕V d (a) [ ]
𝔼 = −yPx + 𝔼 ΔxxT Δ a,
𝜕a
[ ]
where element (i, j) of 𝔼 ΔxxT Δ is given by
{
[ ] pi pj xi xj , i ≠ j
𝔼 Δi Δj xi xj =
pi xi2 , i = j.
This implies that
[ ]
𝔼 ΔxxT Δ = PxxT P + diag(𝜎12 x12 , … , 𝜎n2 xn2 ),
where 𝜎i2 = pi (1 − pi ) is the variance of Δi . This means that we have
[ ]
𝜕V d (a) 𝜕V e (a)
𝔼 = + diag(𝜎12 x12 , … , 𝜎n2 xn2 )a,
𝜕a 𝜕a
which is the gradient of
1
V e (a) + aT diag(𝜎12 x12 , … , 𝜎n2 xn2 )a. (10.17)
2
We see that the second term is a ridge regularization, and this explains how dropout implicitly
provides regularization. The largest possible regularization is obtained when pi = 1∕2, since this
value maximizes 𝜎i2 . For more information about dropout see [8].
Exercises 319
Exercises
(a) Show that this can be done with the following updated formula:
( )
aN+1 = aN + PN+1 xN+1 yN+1 − xN+1T
aN ,
where PN−1 = XNT XN and where PN+1−1 T
= PN−1 + xN+1 xN+1 .
(b) Show that the recursion above for PN can be equivalently written as
1
PN+1 = PN − T
T
PN xN+1 xN+1 PN .
1 + xN+1 PN xN+1
320 10 Supervised Learning
You may use the matrix inversion lemma which says that for matrices A, C, U, and
V such that the dimensions are compatible and such that the inverses below exist it
holds that
( )−1
(A + UCV)−1 = A−1 − A−1 U C−1 + VA−1 U VA−1 .
10.3 Consider the MAP optimization problem in (10.3) restated for ease of reference below
∑N (√ )
1 ( )2
minimize yk − a T
𝛽(xk ) + N ln 2𝜋𝜎 − ln fA (a),
k=1
2𝜎 2
with variable a, where fA is the pdf for the prior.
(a) Consider the case of a double-sided exponential distribution given by
∏m
1 −|ai |∕𝜆
fA (a) = e ,
i=1
2𝜆
for some 𝜆 > 0, and show that an equivalent optimization problem is
∑N
1( )2 ∑m
minimize yk − aT 𝛽(xk ) + c |ai |,
k=1
2 i=1
with variable a for some constant c > 0. This is known as lasso regularization.
(b) Consider the case when each component Ai of the random variable A has a uniform
distribution on the interval [−𝜆, 𝜆], where 𝜆 > 0. Show that an equivalent MAP opti-
mization problem is
∑N
1( )2
minimize yk − aT 𝛽(xk )
k=1
2
subject to ||a||∞ ≤ 𝜆,
with variable a.
10.4 Show that the logistic regression problem in Section 10.4 can be equivalently written as a
conic optimization problem involving the exponential cone.
10.5 Consider Example 10.2 and write a MATLAB code the reproduces the result in the
example.
10.8 Consider the Fisher Iris data set that you investigated in Exercise 9.13 using PCA. You
are going to use the compressed data ci that you obtained using the two principal com-
ponents corresponding to the two largest singular values. You are asked to classify these
using an SVM as in (10.6) with f being an affine function, i.e. f (c) = aT c + b, where a ∈ ℝ2
and b ∈ ℝ are the parameters that define the separating straight line. The data set has
three classes, and the SVM is only able to separate between two classes. However, you will
instead carry out three classifications, where you classify species 1 against species 2 and 3,
species 2 against species 1 and 3, and finally species 3 against species 1 and 2. In this way,
you will obtain three separating straight lines. Plot the lines on top of the figure you plot-
ted in Exercise 9.13. How well do the lines separate the different species? Are there species
that cannot be classified? Are there species that are not uniquely classified? How does the
results depend on the choice of the regularization parameter 𝜈? Notice that you may use
different values of 𝜈 in the three different classifications.
Hint: You can directly solve the primal problems using, e.g. YALMIP, since it is of low
dimension.
1∑
n
minimize 𝜙(aTi 𝑤) + 𝛾 g(𝑤) (10.18)
n i=1
with variable 𝑤 ∈ ℝd and where 𝜙(t) = max (0, 1 − t) and g(𝑤) = (1∕2)||𝑤||22 . The problem
data are ai ∈ ℝd for i = 1, … , n, and 𝛾 > 0 is a parameter.
(a) We start by deriving a dual coordinate ascent method for the problem (10.18).
(1) Show that the dual problem is equivalent to the problem
∑n
1
maximize − 𝜙∗ (−xi ) − ||AT x||22 (10.19)
i=1
2n𝛾
(b) Show that the proximal operator associated with gi (𝑤) = 𝜙(aTi 𝑤) + 𝛾g(𝑤) can be
expressed as
{ 𝛾 }
1
proxtgi (𝑤) = argmin 𝜙(aTi u) + ||u||22 + ||u − 𝑤||22
u 2 2t
⎧ 1
⎪ 1+t𝛾 (𝑤 + tai ) t(||ai ||22 − 𝛾) < 1 − aTi 𝑤
⎪ ( 1−aTi 𝑤+t𝛾
)
=⎨ 1 𝑤+ ai −t𝛾 ≤ 1 − aTi 𝑤 ≤ t(||ai ||22 − 𝛾)
1+t𝛾 ||ai ||22
⎪ 1
⎪ 𝑤 1 − aTi 𝑤 < −t𝛾.
⎩ 1+t𝛾
Hint: Derive the optimality condition associated with the prox-problem and show that
u⋆ has the form u⋆ = (1 + t𝛾)−1 (𝑤 − 𝛽tai ), where 𝛽 ∈ 𝜕𝜙(aTi u⋆ ).
(c) The so-called linear softmargin support vector machine training problem is a special
case of the problem (10.18). Specifically, if we partition 𝑤 as 𝑤 = (𝑤, ̃ b) ∈ ℝd−1 × ℝ
and define ai = (yi xi , yi ), where xi ∈ ℝ d−1 corresponds to a so-called feature vec-
tor with corresponding label yi ∈ {−1, 1}, then aTi 𝑤 = yi (xiT 𝑤̃ + b), and hence,
𝜙(aTi 𝑤) = max (0, 1 − yi (xiT 𝑤̃ + b)). Given a vector of labels y = (y1 , … , yn ) and a
matrix X ∈ ℝn×(d−1) with rows x1T … , xnT , we can express A as
[ ]
A = diag(y) X 𝟙 .
The file classification.mat is an example of such a data set.
The solution to the SVM training problem defines a hyperplane that can be used to
define a classifier: the function f (x) = sgn(𝑤̃ T x + b) provides a label prediction for an
unlabeled feature vector x.
(1) Implement an incremental proximal method for solving (10.18), and test your
implementation on the provided dataset.
(2) Implement a dual coordinate ascent method for solving (10.19), and test your
implementation on the provided dataset.
(3) Compare the two methods: plot the objective value as a function of the number of
iterations, e.g. integer multiples of n iterations.
10.10 The soft margin support vector machine-training problem is a convex quadratic problem:
𝛾
minimize 𝟙T 𝑣 + ||𝑤||22
2
subject to diag(y)(A𝑤 + 𝟙b) ⪰ 𝟙 − 𝑣 (10.20)
𝑣⪰0
with variables 𝑤 ∈ ℝn , b ∈ ℝ, and 𝑣 ∈ ℝn , and with problem data A ∈ ℝm×n and
y ∈ {−1, 1}m . The optimal 𝑣 can be expressed in terms of 𝑤 and b as
𝑣 = max (0, 𝟙 − diag(y)(A𝑤 + 𝟙b)).
The Lagrangian is given by
𝛾
L(𝑣, 𝑤, b, z, 𝜆) = 𝟙T 𝑣 + ||𝑤||22 + zT (𝟙 − 𝑣 − diag(y)(A𝑤 + 𝟙b)) − 𝜆T 𝑣,
2
and the dual problem can be expressed as
1
maximize − zT diag(y)AAT diag(y)z + 𝟙T z
2𝛾
subject to yT z = 0 (10.21)
0 ⪯ z ⪯ 𝟙.
Exercises 323
zi⋆ = 0 ⇔ yi ui ≥ 1⎫
⎪
0 < zi⋆ < 1 ⇔ yi ui = 1⎬ for i = 1, … , m (10.22)
zi⋆ = 1 ⇔ yi ui ≤ 1⎪
⎭
where u = A𝑤 + 𝟙b = 𝛾 −1 AAT diag(y)z + 𝟙b.
The dual problem (10.21) can be solved using the so-called sequential minimal optimization
(SMO) method. Suppose zk is the value of the dual variable at the beginning of iteration k.
The SMO iteration then consists of the following three steps:
● Working set selection: select two dual variables zk,i and zk,j (i ≠ j) where at least one of the
two variables violates the optimality conditions (10.22).
● Two-variable subproblem: compute zk+1,i and zk+1,j by solving the dual problem (10.21)
with
z = zk + (zi − zk,i )ei + (zj − zk,j )ej . (10.23)
● Update intercept: compute bk+1 such that the updated dual variables zk+1,i and zk+1,j
satisfy the optimality conditions (10.22).
The main advantage of this seemingly simple method is that the two-variable subproblem
is cheap to solve. Each iteration increases the dual objective, and the method stops when
all dual variables satisfy the optimality conditions within some tolerance. The method can
be shown to converge, but the working set selection can have a significant impact on per-
formance: there are m(m − 1)∕2 potential working sets at each iteration, so a good working
set selection heuristic is crucial. A popular heuristic is the so-called maximal violating pair
working set selection rule which chooses a pair (i, j) as
i ∈ argmin{yl − ul }, j ∈ argmax{yl − ul }, (10.24)
l∈I1 l∈I2
where
I1 = {l | zl > 0, yl = 1} ∪ {l | zl < 1, yl = −1}
I2 = {l | zl > 0, yl = −1} ∪ {l | zl < 1, yl = 1}.
This selection heuristic can be motivated from the optimality conditions (10.22): yl ul must
satisfy yl ul ≤ 1 if zl > 0, and similarly, yl ul ≥ 1 if zl < 1. By multiplying both sides of both
yl ul ≤ 1 and yl ul ≥ 1 by yl , we can express the optimality conditions as
0 ≤ yl − ul , l ∈ I1
0 ≥ yl − ul , l ∈ I2 .
It follows that z satisfies the optimality conditions when (i, j) is chosen according to maxi-
mal violating pair heuristic (10.24) and yi − ui ≥ 0 ≥ yj − uj .
(a) Derive a simple method for solving the two variable subproblem with working set (i, j).
You may assume that zk is feasible.
Hint: The constraint yT z = 0 implies that zi and zj can be parameterized as zi (t) = zk,i +
tyi and zj (t) = zk,j − tyj , and hence, z(t) = zk + t(ei yi − ej yj ).
324 10 Supervised Learning
(b) Derive an expression for bk+1 so that (zk+1,i , uk+1,i ) and (zk+1,j , uk+1,j ) satisfy the optimal-
ity conditions (10.22).
(c) Derive a recursive update of u, i.e. express uk+1 in terms of uk , bk , and bk+1 .
(d) Implement the SMO algorithm with the working set selection heuristic (10.24) and
test it on the same data as in the previous exercise.
10.12 Implement a stochastic gradient algorithm that uses dropout. Generate pairs of data
(xk , yk ) ∈ ℝn × ℝ for k ∈ ℕN randomly in the following way. First, generate a0 ∈ ℝn
randomly from a standardized Gaussian distribution. Then generate xk randomly from a
standardized Gaussian distribution and also generate ek randomly from a standardized
Gaussian distribution. Then let yk = xkT a0 + 0.1ek . Use dropout probabilities pi = 0.5 in
the following stochastic gradient method
ak+1 = ak + tk (yik − xiT 𝛿a)𝛿xik ,
k
where 𝜎 2 = pi (1 − pi ).
10.13 When dropout was used with an incremental method for solving the LS problem in
Section 10.8, we realized that the predictor was given by the expected value of the stochastic
predictor. This could be computed exactly. For general ANNs, it is only possible to approx-
imately compute the expectation as will be discussed in this exercise. Let Δi , i ∈ ℕn , be
random variables with a Bernoulli distribution, i.e. P(Δi = 1) = pi and P(Δi = 0) = 1 − pi .
Exercises 325
Show that
̂
G(Z) ( )
= 𝜎 𝔼Ŷ ,
̂ + G′ (Z)
G(Z) ̂
where the left-hand side is called the normalized weighted geometric mean.
(b) Show that
̂
G(Z)
̂ ≤
G(Z) ̂
≤ 𝔼Z,
̂ + G′ (Z)
G(Z) ̂
[ ]
if 0 < Ẑ i ≤ 1∕2. Hence, 𝜎(𝔼Ŷ ) is a good approximation for 𝔼Ẑ = 𝔼 𝜎(Ŷ ) for the case
when 0 < Ẑ i ≤ 1∕2.
Hint: Use the Ky Fan inequality from Exercise 4.10.
10.14 We will use the Deep Learning Toolbox in MATLAB to fit an ANN to pairs of data (xi , yi ).
We will use a two-layer network implementing the function in (10.13). The following com-
mands read in the data, define a net with two layers, three hidden neurons, train the net,
326 10 Supervised Learning
display the net, compute the predicted outputs of the net, and plot a comparison of the
predicted values ŷ i and the original values of yi versus the values xi :
[x,y] = simplefit_dataset;
net = fitnet(3);
net = train(net,x,y);
view(net),
haty = net(x);
plot(x,haty,'-',x,y,'--');
Play around with the number of hidden neurons and try to figure out how many are needed
to get an almost perfect fit.
327
11
Reinforcement Learning
Reinforcement learning is an area of machine learning concerned with how agents should take
actions in an environment in order to maximize the notion of a so-called cumulative reward.
The idea goes back to the work by the Russian Nobel laureate Ivan Pavlov on classical conditioning
where he showed that you can train dogs by rewarding or punishing them. To formalize this, we
consider the dog to be in different states xk at different stages k in time. The value of the state xk+1
depends on the current state xk and the action uk we take during our training. Each and every state
xk is also related to a reward rk . It is then desirable to maximize the sum, possibly discounted, of
rewards over some set of stages, say {0, 1, … , N − 1}, where N could be infinity. In reinforcement
learning, the action is determined by an agent, and the next state and the reward are determined
by the environment. One can view reinforcement learning as a way of solving the optimal control
problems in Chapter 8 without explicitly knowing the details of the mathematical model that
describes how the state evolves. The control signal in optimal control corresponds to the action in
reinforcement learning. The incremental costs in optimal control are the negative of the rewards
in reinforcement learning.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
328 11 Reinforcement Learning
to directly learn the Q-function instead of computing it using Fk from a learned value function Vk .
We will show that the following recursion holds
( )
̄ = fk (x, u)
Qk (x, u) ̄ + min Qk+1 Fk (x, u),
̄ u , (11.2)
u
starting with k = N − 2 and finishing with k = 0, and where QN−1 (x, u) = fN−1 (x, u) + 𝜙(FN−1 (x, u)).
The above recursion follows from the fact that (8.4) and (8.5) implies that
Vk+1 (x) = min Qk+1 (x, u),
u
and hence,
̄ = min Qk+1 (Fk (x, u),
Vk+1 (Fk (x, u)) ̄ u),
u
for k ∈ ℤN−1 . This is a linear regression model, but we could also have considered a more general
model as we did in Section 8.2. We leave the details of generalizing to this case for the reader to
carry out. We then consider samples (xks , usk ) ∈ n × m , where s ∈ ℕr , and define
𝛽N−1
s
= 𝜙(FN−1 (xN−1
s
, usN−1 )),
and
̃ k+1 (Fk (xs , us ), u, ak+1 ),
𝛽ks = min Q (11.3)
u k k
where ak+1 is a known value from a previous iterate defined below. We see that we do not need
an analytical expression for Fk in order to define 𝛽ks for k ∈ ℤN−1 . It is enough to know the value
of Fk (xks , usk ) which is the next value of the state in a simulation. Moreover, depending on how the
feature vectors are chosen, the minimization above could become very tractable. This can be a great
advantage for the approximation method based directly on the Q-function. After this, we define the
following LS problem:
1 ∑( ̃ s s
r
)2
minimize Q (x , u , a) − fk (xks , usk ) − 𝛽ks , (11.4)
2 s=1 k k k
with variable a for k ∈ ℤN−1 . Denote the optimal solution by ak . The iterations start with k = N − 1
and go down to k = 0, where we alternate between solving (11.4) and (11.3). Once all parameters
11.1 Finite Horizon Value Iteration 329
ak have been computed, the approximate optimal control is u⋆k = 𝜇k (xk ), where 𝜇k ∶ n → m is
given by
( )
̃ k x, u, ak .
𝜇k (x) = argmin Q (11.5)
u
We notice that using the Q-function instead of using the value function as in Section 8.2 comes at
the price of also having to sample the control signal space m .
Example 11.1 We will in this example perform fitted value iteration for the optimal control prob-
lem in Example 8.1 based on the iteration of the Q-functions. We will as in Example 8.3 specifically
consider the case when m = 1 and n = 2, and we let Ak , Bk , Rk , Sk be independent of k and we write
them as A, B, R, S. Since we know that the Q-functions are quadratic, we will use a feature vec-
tor that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u), where x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the
indices refer to components of the vector and not to time. We let
̃ k (x, u, a) = aT 𝜑(x, u),
Q
( )T ( )
𝛽ks = x+s P̃ k+1 − r̃ k+1 q̃ −1 r̃ T x+s ,
k+1 k+1
where x+s = Axks + Busk . We then obtain ak for k ∈ ℤN−1 as the solution to the linear LS problem in
(11.4):
r ( )2
1∑ T s s ( )T ( )T
minimize 𝜑 (xk , uk )a − xks Sxks − usk Rusk − 𝛽ks .
2 s=1
Notice that the subindex k for a now refers to stage index. The solution ak satisfies the normal
equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛾k ,
where
It is here crucial to choose (xks , usk ) such that ΦTk Φk is invertible. We realize that we need r ≥ 6 for this
to hold. Moreover, we need to choose (xks , usk ) sufficiently different to have ΦTk Φk well conditioned.
330 11 Reinforcement Learning
For a general, n we need r ≥ (m + n)(m + n + 1)∕2. Compared to Example 8.3 we require a larger
value of r. The optimal feedback function is by (11.5) and (11.6) given by
𝜇k (x) = −̃q−1
k k
r̃ T x,
where we again have used the results in Example 5.7.
and
̄ = min 𝛾Q(F(x, u),
𝛾V(F(x, u)) ̄ u).
u
̄ to both sides of the above equation, the desired equation for Q is obtained.
By adding f (x, u)
We introduce the Bellman Q-operator TQ ∶ ℝ × → ℝ × defined as
n m n m
̄ = f (x, u)
TQ (Q)(x, u) ̄ + min 𝛾Q (F(x, u),
̄ u) . (11.8)
u
11.2.1 Q-Learning
We will now see how VI can be generalized by developing an alternative iterative method for solving
(11.7). To this end, we introduce the error function e ∶ ℝ × → ℝ × defined as
n m n m
e(Q) = Q − TQ (Q),
where TQ is defined in (11.8). With this definition, we may write the equation in (11.7) as e(Q) = 0.
We now apply the following standard algorithm
Qk+1 = Qk − tk e(Qk ), k ∈ ℤ+ , (11.10)
to find a root of the equation e(Q) = 0. You can initialize with Q0 = f , but there are better ways.
The step lengths tk should satisfy tk ∈ (0, 1]. We will prove convergence of the above iterations to a
solution Q⋆ of e(Q) = 0, i.e. Q⋆ = TQ (Q⋆ ). Let Δk = Qk − Q⋆ . Then it holds that
( )
Δk+1 = (1 − tk )Δk + tk TQ (Qk ) − TQ (Q⋆ ) .
From this it follows for tk ∈ (0, 1] that
||Δk+1 ||∞ ≤ (1 − tk )||Δk ||∞ + tk ||TQ (Qk ) − TQ (Q⋆ )||∞ ≤ (1 − tk + tk 𝛾)||Δk ||∞ ,
where the last inequality follows from the contraction property of TQ for 𝛾 ∈ [0, 1) shown in
Exercise 11.6. We see that we have a contraction for Δk for tk ∈ (0, 1], if 𝛾 ∈ (0, 1), and therefore,
Δk converges to zero proving that Qk converges to Q⋆ . We remark that we recover VI for tk = 1.
The algorithm is summarized in Algorithm 11.3.
It is possible to instead of in each iteration k consider all values of (x, u) ∈ n × m only consider
one sample (xk , uk ) at a time. These samples could be generated in a cyclic order or in a randomized
cyclic order such that each sample is visited infinitely many times. We assume that n × m is a
finite set. Then it holds that
[ ]
Qk+1 (xk , uk ) = Qk (xk , uk ) − tk Q(xk , uk ) − f (xk , uk ) − min 𝛾Q(F(xk , uk ), u) , (11.11)
u
converges to a solution of e(Q) = 0 as k goes to infinity when tk ∈ (0, 1] and 𝛾 ∈ (0, 1). The results
follow trivially from the convergence proof above. The algorithm has similarities with the coordi-
nate descent method in Section 6.9, since Q is only updated for one discrete value in each iteration.
This algorithm is often referred to as Q-learning, and it is summarized in Algorithm 11.4.
It is possible to develop VI methods for the stochastic infinite horizon optimal control problem
in (8.33) using the Q-function in (8.35) which satisfies the Bellman equation for the Q-function in
(11.32). The above algorithm can be used by replacing F(xk , uk ) with F(xk , uk , 𝑤k ), where 𝑤k is a
∑∞ ∑∞
realization of the random process W. If the step sizes satisfies k=0 tk = ∞ and k=0 tk2 < ∞, then
the algorithm can be shown to converge in a stochastic sense [114].
332 11 Reinforcement Learning
Reinforcement learning based on PI is called self-learning. The policy evaluation step is referred to
as a critic and the policy improvement is referred to as an actor. These types of methods are called
actor-critic methods. In case parametric approximations using artificial neural networks ANNs are
involved, the actor and critic are called actor networks and critic networks, respectively. In order to
do PI for the Q-function formulation in (11.7), we define Qk ∶ n × m → ℝ as Qk (x, u) = f (x, u) +
𝛾Vk (F(x, u)). It then follows that
( )
Vk (x) = Qk x, 𝜇k (x) ,
from (8.16), and therefore
( )
Vk (F(x, u)) = Qk F(x, u), 𝜇k (F(x, u)) .
Multiply with 𝛾 and add f (x, u) to obtain that Qk is the solution of
( )
Qk (x, u) = f (x, u) + 𝛾Qk F (x, u) , 𝜇k (F(x, u)) . (11.12)
This is the policy evaluation step in terms of the Q-function. We then obtain a new feedback policy
by solving
𝜇k+1 (x) = argmin Qk (x, u) , (11.13)
u
which is the policy improvement step in terms of the Q-function. These iterations are exactly the
same as the iterations in (8.16)–(8.17) except that we need to solve for a function Qk that depends
also on the control signal in the policy iteration step. The PI algorithm for the Q-function is sum-
marized in Algorithm 11.5.
Example 11.2 We consider the infinite horizon LQ control problem in Example 8.4. We guess
that
[ ]T [ ][ ]
x Uk Wk x
Qk (x, u) = ,
u WkT Vk u
for some
[ ]
Uk Wk
∈ 𝕊+m+n ,
WkT Vk
where Vk ∈ 𝕊m
++ . It then follows from (11.13) that
𝜇k (x) = −Lk+1 x,
where Lk+1 = Vk−1 WkT . The recursion for Qk in (11.12) is seen to be satisfied if
[ ] [ ] [ ]T [ ][ ]
Uk Wk S 0 [ ]T I Uk Wk I [ ]
= +𝛾 A B A B ,
WkT Vk 0 R −Lk WkT Vk −Lk
for a given Lk . This is an algebraic Lyapunov equation which has a positive semidefinite solution
since
[ ]
S 0
,
0 R
is positive semidefinite. This assumes that
[ ]
√ I [ ]
𝛾 A B ,
−Lk
√
has all its eigenvalues strictly inside the unit disc. This is true if 𝛾(A − BLk ) has all its eigenvalues
strictly inside the unit disc by Exercise 11.1. As in Example 8.6, it can be shown that if we start with
a stabilizing L0 , then all subsequent Lk , will also be stabilizing. Moreover, it can be shown that all
Vk are positive definite so that the inverse in the formula for Lk exists.
where xi+1 = F(xi , 𝜇k (xi )) for i ∈ ℕN−1 , and x1 = F(x0 , u0 ). In case N is large and 𝜇k is stabilizing,
we have that xN is close to zero and that also Qk (xN ) is close to zero. We realize that the only differ-
ence for the approximate evaluation of the Q-function as compared to the approximate evaluation
of the value function in (8.18) is that the first incremental cost is evaluated using u0 and not 𝜇k (x0 ).
334 11 Reinforcement Learning
where xi+1 = F(xi , 𝜇k (xi )) for i ∈ ℕN−1 , and x1 = F(xs , us ). We then find the approximation of Qk by
solving
1 ∑( ̃ s s
r
)2
minimize Q(x , u , ak ) − 𝛽ks ,
2 s=1
with variable ak . After this, we use the following exact policy improvement step
( )
𝜇k+1 (x) = argmin Q ̃ x, u, ak . (11.14)
u
A drawback of this method compared to when using value functions is that the reuse of trajectories
is more problematic, since sufficiently many different values of the control signal might then not
be explored. For a more detailed discussion on this see [17, Section 5.3].
Example 11.3 We will in this example consider the optimal control problem in Example 8.4.
We will specifically consider the case when m = 1 and n = 2. Since we know that the value func-
tion is quadratic, we will use a feature vector that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u), where
x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the indices refer to components of the vector and not to
time. We let
̃ u, a) = aT 𝜑(x, u),
Q(x,
1 ∑( T s s
r
)2
minimize 𝜑 (x , u )a − 𝛽ks ,
2 s=1
with variable a. The solution ak satisfies the normal equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛽k ,
where
⎡𝜑T (x1 , u1 )⎤ ⎡𝛽k1 ⎤
Φk = ⎢ ⋮ ⎥, 𝛽k = ⎢ ⋮ ⎥ ,
⎢ T r r ⎥ ⎢ r⎥
⎣ 𝜑 (x , u ) ⎦ ⎣𝛽k ⎦
with
( )T ( )T ∑ (
N−1
)
𝛽ks = xs Sxs + us Rus + 𝛾 i xiT Sxi + 𝜇k (xi )T R𝜇k (xi ) ,
i=1
11.3 Policy Iteration 335
where x1 = Axs + Bus and xi+1 = Axi + B𝜇k (xi ) for i ∈ ℕN−2 with initial values xs , s ∈ ℕr . It is here
crucial to choose (xs , us ) such that ΦTk Φk is invertible. We realize that we need r ≥ 6 for this hold.
Moreover, we need to choose (xs , us ) sufficiently different for ΦTk Φk to be well conditioned. For a
general n, we will need r ≥ (m + n)(m + n + 1)∕2.
From Example 5.7, we realize that the solution to (11.14) is given by
̃ k (x, u, ak ) = −̃q−1 r̃ T x,
𝜇k+1 (x) = argmin Q k k
u
assuming that q̃ is positive. Here q̃ k and r̃ k are defined from ak . We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = q̃ −1
k k
r̃ T . It is a good idea to start with some L0 such that 𝜇0 is stabilizing.
11.3.2 SARSA-Algorithm
We will now present an alternative approach. We define 𝜀 ∶ n × m × m × ℝp → ℝ as
̃ u, a) − f (x, u) − 𝛾 Q
𝜀(x, u, 𝑣, a) = Q(x, ̃ (F(x, u), 𝑣, a) .
Then the error we obtain when we approximate Qk in (11.12) with Q ̃ can be written as
𝜀(x, u, 𝜇(F(x, u)), a), where 𝜇 = 𝜇k . We then define the LS problem
1∑
N−1
minimize 𝜀(x , u , 𝜇 (x )), a)2 ,
2 i=0 i i k i+1
with variable a and solution ak , where
( )
xi+1 = F xi , ui
ui = 𝜇k (xi )
with x0 given. Then 𝜇 is updated as in (11.14). For each value of k, we should choose different
initial values x0 . This should be done such that many different states and controls are explored. It is
common to add a realization 𝑤 of a zero mean white noise random process W to the control signal
in order to obtain more exploration, i.e. we use
( )
xi+1 = F xi , ui + 𝑤i
ui = 𝜇k (xi ).
It is important to gather as much information that the above LS problems has a unique solu-
tion. Also when updating a, the information from the previous batches should not be disregarded.
Hence, batch-recursive LS should be used.
Sometimes k = i is used, i.e. the policy improvement step is carried out for each i. Then a pure
recursive LS technique can be used to solve the LS problem. These types of algorithms are often
referred to as state-action-reward-state-action (SARSA) algorithms. They are possible to run in real
time by letting xi+1 be given by measurements from a real system when the control signal ui has
been applied. This will result in what is known as adaptive control, see [7].
Exact recursive LS techniques are only available for linear LS problems, but there are approxi-
mate techniques available for nonlinear LS problems for the case when Q ̃ as a nonlinear function.
It should be stressed that the theoretical convergence properties of the above schemes are not well
understood.
Example 11.4 We will in this example consider the optimal control problem in Example 8.4.
We will specifically consider the case when m = 1 and n = 2. Since we know that the value function
336 11 Reinforcement Learning
is quadratic, we will use a feature vector that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u, 1), where
x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the indices refer to components of the vector and not to
time. We let
̃ u, a) = aT 𝜑(x, u),
Q(x,
where a ∈ ℝ7 . With P̃ ∈ 𝕊2 , r̃ ∈ ℝ2 and q̃ ∈ ℝ defined as
[ ] ⎡a1 a4 a5 ⎤
P̃ r̃
= ⎢a4 a2 a6 ⎥ ,
r̃ T q̃ ⎢ ⎥
⎣a5 a6 a3 ⎦
we may write
[ ]T [ ][ ]
̃ x P̃ r̃ x
Qk (x, u, a) = + a7 . (11.16)
u r̃ T q̃ u
Here we have added a constant term compared to what we did in Example 11.3, since we have a
random process W involved. From Example 5.7, we realize that the solution to (11.14) is given by
̃ k (x, u, ak ) = −̃q−1 r̃ T x,
𝜇k+1 (x) = argmin Q k k
u
assuming that q̃ k is positive. Here q̃ k and r̃ k are defined from ak We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = q̃ −1
k k
r̃ T . With
x̄ i+1 = 𝜑(xi , ui ) − 𝛾𝜑(xi+1 , −Li xi+1 ),
and
ȳ i+1 = xiT Sxi + uTi Rui ,
we may write the residual as
𝜀(xi , ui , −Li xi+1 , a) = x̄ Ti+1 a − ȳ i+1 ,
and hence, the LS problem reads
1 ∑( T
N
)2
minimize x̄ a − ȳ i ,
2 i=1 i
where we have shifted the summation index. We now use recursive LS as in Exercise 10.2 to update
the parameter a for an LS problem with N terms to an LS problem with N + 1 terms in order to
obtain adaptive control. The recursion for this reads
( )
aN+1 = aN + PN+1 x̄ N+1 ȳ N+1 − x̄ TN+1 aN ,
where
1
PN+1 = PN − PN x̄ N+1 x̄ TN+1 PN .
1 + xN+1 PN x̄ N+1
̄ T
For the definition of PN see the solution of Exercise 10.2. We then have that LN+1 = q̃ −1 ̃T
N+1 r N+1 , where
q̃ N+1 and r̃ N+1 are defined by aN+1 . We may start with L0 = 0 and a0 = 0 if we have no better initial
guess. We should take P0 to have a large value, e.g. a large multiple of the identity matrix.
parameters of the linear regression or the ANN, see Section 11.5. One advantage of this is that no
optimization has to be carried out in real time when the controller is running, since the feedback
function will be known explicitly in terms of the linear regression or ANN. Also, it will in many
cases improve the speed of the learning.
In Section 8.6, we showed how to formulate the Bellman equation as an LP. Our intention is now
to show that this is also possible to do for the Bellman equation for the Q-function. To this end, we
start the VI in (11.9) with a Q0 such that Q1 = TQ (Q0 ) ≥ Q0 for all (x, u) ∈ n × m .1 One possible
choice is Q0 = 0. We obtain TQ (Q1 ) ≥ TQ (Q0 ) from the monotonicity property, see Exercise 11.6,
and hence Q2 = TQ (Q1 ) ≥ Q1 ≥ Q0 . If we repeat this, we obtain Qk = TQ (Qk−1 ) ≥ Qk−1 ≥ Q0 .
Since Qk converges to the solution Q of (11.7), we have shown that Q is the maximum element,
see [22, p. 45] of the set of functions Q that satisfy the linear inequalities
Q(x, u) ≤ f (x, u) + 𝛾Q(F(x, u), 𝑣), ∀(x, u, 𝑣) ∈ n × m × m .
The maximum element can be obtained by solving the LP
∑
maximize c(x, u)Q(x, u) (11.17)
(x,u)∈n × m
1∑
N
minimize ||u − 𝜇(x
̃ k , a)||22 ,
2 k=1 k
with respect to a, where uk , k ∈ ℕN are solutions to (11.19) for the samples x = xk . One benefit of
the approach is that less optimization problems need to be solved when the policy is implemented
in real time, and moreover evaluating 𝜇̃ might be done much faster than solving the optimiza-
tion problem. Hence, smaller sampling times in real-time applications are possible, which may be
important for some applications. Notice that this approach can be used for any policy for which we
know its values for a discrete number of samples, i.e. it does not necessarily have to be related to
(11.19).
From this we realize that not only the states but also their derivatives can be obtained from sim-
ulation using the same dynamical equations or an experiment involving the dynamical system.
For the gradient, one simulation or experiment has to be carried out for each value of l and each
component of the control signal. In case Ak and Bk do not depend on k, the time-invariant case, we
realize that
dxk dx
T
= k−l ,
dul duT0
since the initial value is zero. For the linear time-invariant case, this means that one just needs to
obtain the so-called “impulse response” of the dynamical system. It is of course debatable if it is a
good idea to carry out impulse response experiments to obtain the impulse response. The reason is
that the impulse response is given by Ak−1 B, k ≥ 1, and hence it can be computed from knowledge
of the system matrices (A, B). They can as will be discussed in Chapter 12 be obtained using system
identification. To use an impulse as the input signal for system identification is normally not a good
experiment design, see Section 12.9. However, using system identification in conjunction with the
methods of Chapter 8 is a more involved approach than just using an impulse response which can
be obtained from a very simple experiment.
For the case of nonlinear Fk , the nonlinear dynamical equations may be used instead. The result
will then not be exact, but only hold approximately, i.e.
( )
dxk+1 dxk duk
≈ Fk , .
dui,j dui,j dui,j
Here index i refers to stage index and index j to component index. This all assumes that the coor-
dinate system has been chosen such that Fk (0, 0) = 0. For the nonlinear time-invariant case one
simulation for each component of uk is enough. What has been presented above is strongly related
to what is called iterative learning control (ILC), see, e.g. [49, 77]. This approach is typically used in
applications where the same repeated task is carried out over and over again. Any reference value
is incorporated in the definition of the incremental costs fk . In case the whole state cannot be mea-
sured, it is still possible to use impulse responses for obtaining derivatives assuming that only what
is measured is penalized in the objective function, see Exercise 11.12
is zero for a given so-called reference value r ∈ ℝNm for y. The following root-finding algorithm,
see Appendix 11.6:
uk+1 = uk − t𝜀(uk ),
will be used. Here the subindex k refers to iteration index and not to stage index or time. Notice
that the computations involved in the root-finding algorithm only require us to be able to evaluate
the error e for a known input u. This can be done with simulations or with experiments on a real
dynamical system, i.e. no explicit knowledge of the functions Fk and Gk is needed.
The assumptions for convergence are that 𝜀 is Lipschitz continuous and strongly monotone.
These conditions are not easy to investigate for a general H. However, we may phrase them in
systems theory terms. The Lipschitz constant 𝛽 is the incremental gain of the dynamical system,
i.e. the smallest 𝛽 such that
||H(u) − H(𝑣)||2 ≤ 𝛽||u − 𝑣||2 , ∀u, 𝑣 ∈ ℝNm .
The strong monotonicity condition is the same as saying that the dynamical system is incrementally
strictly passive with dissipation 𝛼, i.e.
(H(u) − H(𝑣))T (u − 𝑣) ≥ 𝛼||u − 𝑣||22 , ∀u, 𝑣 ∈ ℝNm .
Example 11.5 We now assume that Fk (x, u) = Ax + Bu and that Gk = Cx + Du for matrices
(A, B, C, D) of compatible dimensions. Then H is a linear function and we may instead write
y = Hu. Here H now is the matrix defined as
⎡ h0 0 ··· 0⎤
⎢ ⎥
h ⋱ ⋱ ⋮⎥
H=⎢ 1 ,
⎢ ⋮ ⋱ ⋱ 0⎥
⎢h ··· h1 h0 ⎥⎦
⎣ N−1
where h0 = D and hi = CAi B ∈ ℝm for i ∈ ℕ are the Markov parameters, or equivalently the
impulse response coefficients for the linear dynamical system. It is straightforward to see that the
Lipschitz constant is 𝛽 = ||H||2 , i.e. the largest singular value of H. Moreover, the criterion for
strong monotonicity is that
(u − 𝑣)T H T (u − 𝑣) ≥ 𝛼||u − 𝑣||22 , ∀u, 𝑣 ∈ ℝNm .
We realize that this is equivalent to
1 T( )
x H + H T x ≥ 𝛼||x||22 , ∀x ∈ ℝNm ,
2
which is satisfied if and only if the smallest eigenvalue 𝜆min of H + H T is greater than or equal to
2𝛼. If we then denote the largest eigenvalue of H T H with 𝜆max it follows that the above algorithm
converges for t ∈ (0, 4∕(𝜆min + 2𝜆max )], assuming that 𝜆min > 0, see Appendix 11.6.
of this, it is also possible to use feedback in order to make H itself close to the identity matrix,
see e.g. [23].
We realize that there is a strong advantage of using the root-finding algorithm, since no gradients
are needed, which is the case for formulations involving minimization of, e.g. 𝜀T 𝜀. However, this
comes at the price of assumptions on Lipschitz continuity and strong monotonicity.
Example 11.6 For the case when Fk (x, u) = Ak x + Bk u, where Ak ∈ ℝn×n and Bk ∈ ℝn×m , and
[ ]
when 𝜇(x, a) = Lx, where L ∈ ℝm×n with aT = L1 · · · Lm , where Li are the rows of L, we have
dxk+1 dx du
T
= Ak Tk + Bk Tk
da da da
duk dxk
= L T + blkdiag(xkT , … , xkT ).
daT da
We realize that the derivatives are obtained by simulation of the closed-loop system or from exper-
iments involving the closed-loop system with the current xk as an additional input.
The results of the example also hold true approximately for the nonlinear case when Fk (0, 0) is
zero. Clearly, many simulations have to be carried out since one simulation is needed for every
component of a, i.e. there are p = mn simulations. This idea is the basis for what is called iterative
feedback tuning (IFT). There often output feedback is considered, and reference values for the out-
put signal are defined explicitly. Then it is possible with transfer function manipulations to show
342 11 Reinforcement Learning
that only two additional simulations are needed for the case when m = 1 and when there is only
one output signal, see e.g. [56].
Before we end this section, it should be mentioned that just using different initial values to obtain
good experimental conditions for the simulations might not be a good idea. It might be the case
that the simulations are experiments where one does not control the initial values. Also, big initial
values might be needed to get good information and then the approximations for nonlinear state
dynamics might not be good. A remedy to this is to inject a perturbation by changing the dynamical
equations to
xk+1 = Fk (xk , uk + 𝑤k ),
where 𝑤k is a realization of white noise. This should only be done in the first experiment in each
iteration for IFT. Sometimes the experimental conditions are such that one has to live with pertur-
bations also in the other experiments. One should then resort to stochastic optimization methods
as discussed in Section 6.8. We will discuss this more in an exercise.
where the random variable X0 is the initial state of the process. We consider an unbiased estimator
Gk of g(xk ) with bounded variance, i.e.
[ ] [ ]
𝔼 Gk |Xk = xk = g(xk ), 𝔼 ||Gk − g(xk )||22 |Xk = xk ≤ c2 , (11.30)
for all k and for some scalar c ≥ 0. Note that in the special case where c = 0, the iteration (11.28) is
essentially the root-finding method (11.26). We note that G(x, 𝜉) is a random variable, and it is an
unbiased estimator of g(x) and hence, it is natural to choose Gk = G(Xk , 𝜉k ), where 𝜉k has the same
distribution as 𝜉 for all k ≥ 0, and where 𝜉k and 𝜉j are independent for j ≠ k. We assume that g
satisfies (11.24) and (11.25). For a constant step size tk = t, it then follows from the proof preceding
(6.71) that (11.28) converges in mean square to a ball centered at x⋆ , where g(x⋆ ) = 0, and with
a radius that is proportional to the constant step size t. A more elaborate analysis can be done to
show that we have convergence in means square to x⋆ for a diminishing step-size.
Exercises
11.1 Let A ∈ ℝm×n and B ∈ ℝn×m , where m ≤ n. Show that the eigenvalues of BA are the eigen-
values of AB together with n − m zero eigenvalues.
11.2 We will in this exercise compare the fitted value iterations in Example 8.3 with the fitted
Q-function iterations in Example 11.1 for a finite horizon LQ control problem. Implement
the two algorithms in MATLAB and investigate their performance for the case when
[ ] [ ]
0.5 1 0
Ak = ; Bk = ,
0 0.5 1
and Rk = 1 and Sk = I. Consider a time horizon of N = 10. Also compare the resulting feed-
back gains Lk you obtain with the ones obtained from the Riccati recursion in Example 8.1.
11.3 Show that the Q-functions defined in (8.32) satisfy the recursion
[ ( )]
̄ = 𝔼 fk (x, u)
Qk (x, u) ̄ + min Qk+1 Fk (x, u,
̄ Wk ), u . (11.31)
u
11.4 We now investigate how to do fitted Q-function iterations for the stochastic LQ problem in
Example 8.10. We then need to extend Q ̃ k in Example 11.1 with a term a7 , i.e.
[ ]T [ ][ ]
̃ k (x, u, a) = x P̄ r̄ x
Q + a7 .
u r̄ T q̄ u
Make the necessary additional modification to accommodate for this in the algorithm.
Specifically you should replace x+s with x+s = Axks + Busk + 𝑤k , where 𝑤k is a realization
from a white noise random process W with unit variance. Use the same problem data as in
Exercise 11.2 and run the algorithm in MATLAB. We know that the same feedback gains Lk
are optimal for both the stochastic and deterministic case by certainty equivalence. Do you
get the same solution? Does the quality of the solution depend on the number of samples
you consider?
11.6 (a) Show that the Bellman Q-operator in (11.8) is such that
̄ ≤ TQ (Q2 )(x, u),
TQ (Q1 )(x, u) ̄
for all (x, u) ̄ ≤ Q2 (x, u)
̄ ∈ n × m if Q1 (x, u) ̄ for all (x, u)
̄ ∈ n × m . Also, show that
it is a contraction.
(b) We let the Stochastic Bellman Q-operator Q ∶ ℝ → ℝ be defined as
n n
[ ]
̄ = 𝔼 f (x, u)
Q (Q)(x, u) ̄ + min 𝛾Qk (F(x, u,̄ W), u) , (11.33)
u
where F ∶ n × m × p → n . Expectation is with respect to W. Show the same
results as in the previous subexercise hold also for the stochastic case.
11.8 We now consider a stochastic version of the problem in the previous exercise. This is
obtained by defining F(x, u, 𝑤) as
u −1 0 1
x
−1 −1 −1 0
0 −1 0 1
1 0 1 1
for 𝑤 = 0, as
u −1 0 1
x
−1 −1 0 1
0 0 1 1
1 1 1 1
for 𝑤 = 1, and as
u −1 0 1
x
−1 −1 −1 −1
0 −1 −1 0
1 −1 0 1
for 𝑤 = −1. We assume that the random variable Wk takes the values (−1, 0, 1) with prob-
abilities (p, 1 − 2p, p), where p ∈ [0, 0.5]. Notice that the deterministic case is recovered for
p = 0.
(a) Solve the problem using (11.11) modified to the stochastic setting as discussed at the
end of Section 11.2. For what values of 𝛾 and p do you obtain a nontrivial solution?
(b) Solve the problem using the LP formulation in Section 11.4 for the Q-function. See the
end of Section 11.4 for how to modify the LP to the stochastic case.
Exercises 345
[ ]
Q𝜇 (Q) = 𝔼 f (x, u) + 𝛾Q(F(x, u, W), 𝜇(F(x, u, W))) ,
where F ∶ n × m × p → n . Expectation is with respect to W. Show that if
Q1 (x) ≤ Q2 (x) for all (x, u) ∈ n × m , then
Q𝜇 (Q1 ) − Q𝜇 (Q2 ) ≤ 0.
Also show that Q𝜇 is a contraction.
11.11 In this exercise, we consider the same stochastic LQ problem as in the previous subexercise.
Implement the approach in Example 11.4. Are you able to obtain convergence of Lk to the
correct value? Does it depend on the initial value P0 ?
346 11 Reinforcement Learning
11.12 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
optimal control problem:
1∑
N
minimize (y − rk )2
2 k=0 k
subject to xk+1 = Axk + Buk , k ∈ ℤN−1
yk = Cxk + Duk , k ∈ ℤN ,
for a given initial value x0 with variables (u0 , x1 , y1 … , uN−1 , xN , yN ). Denote by J the objec-
tive function. Let u = (u0 , … , uN ) and y = (y0 , … , yN ).
(a) Show that
dy
= H,
duT
where
⎡ D 0 ··· 0⎤
⎢ ⎥
CB ⋱ ⋱ ⋮⎥
H=⎢ .
⎢ ⋮ ⋱ ⋱ 0⎥
⎢CAN−1 B ··· CB D⎥⎦
⎣
(b) Show that the optimality conditions for the optimal control problem are
( )T
dJ dy
= (y − r) = 0,
du duT
where
y = x0 + Hu,
with
⎡ C ⎤
⎢ ⎥
CA ⎥
=⎢ .
⎢ ⋮ ⎥
⎢CAN ⎥
⎣ ⎦
(c) Show that the optimal u is the solution to
H T Hu = H T r,
if x0 = 0.
11.13 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
stochastic optimal control problem:
[ N ]
1 ∑
minimize 𝔼 (Yk − rk )2
2 k=0
previous exercise. We now want to use a stochastic gradient method to find a solution to the
problem.
(a) Show that for arbitrary u we have that
Y = x0 + Hu + W + E,
where
⎡ 0 0 ··· 0⎤
⎢ ⎥
C ⋱ ⋱ ⋮⎥
=⎢ ,
⎢ ⋮ ⋱ ⋱ 0⎥
⎢CAN−1 ··· C 0⎥⎦
⎣
Here and H are defined as in the previous exercise.
(b) We now define
̂
dY [ ]
T
= Yimp SYimp · · · SN−1 Yimp ,
du
where S is a shift matrix that is all zeros except for ones on the first sub-diagonal,
where
Yimp = x0 + H1 + W + E
is the impulse response, and where H1 is the first column of H. Notice that this is
dY
obviously not the true value of the Jacobian du T
= H. However, it can be obtained from
an impulse response experiment. Let
( )T
̂
dY
G= (Y − r),
duT
and show that this is an unbiased estimator of the gradient of the objective function
for the stochastic optimal control problem when x0 = 0. You may assume that we have
evaluated Y and Yimp for different uncorrelated noise sequences.
(c) Consider a simple example where
[ ] [ ]
0.5 1 1 [ ]
A= , B= , C = 0 1 , D = 2.
0 0.5 0
Let N = 50, generate a reference value given by
{
10, 0 ≤ k ≤ 24
rk =
−10. 25 ≤ k ≤ 50,
assume that the initial value x0 is zero, and let the noise sequences both have zero
mean, unit variance and have Gaussian distribution. Solve the stochastic optimal con-
trol problem with a stochastic gradient method using the gradient suggested above.
11.14 We will now investigate a stochastic version of Example 11.5. To this end, we consider the
Markov process in the previous exercise for which we have that
Y = x0 + Hu + W + E,
where W and W are zero mean random vectors. We define the error function
𝜀(u, W, E) = Y − r = x0 + Hu + W + E − r,
similarly as for the deterministic case, and then we apply the root-finding algorithm
uk+1 = uk − tgk ,
348 11 Reinforcement Learning
where gk = yk − r, and yk is a realization of Y for the kth iteration. Notice that the subindex
k does not refer to the stage index for the Markov process.
(a) Show that 𝜀(u, W, E) is an unbiased estimator of 𝔼[ 𝜀(u, W, E) ].
(b) Consider the same numerical example as in the last subexercise of the previous
exercise. Compute the largest singular value of H and the smallest eigenvalue of
H + H T . What is the largest possible step-length that can be used in order to guarantee
convergence in m.s. to a ball centered around the true solution?
(c) Use the stochastic root-finding algorithm applied to the same Markov process as in
the last subexercise in the previous exercise. Use the same reference value. Compare
the convergence for the root-finding algorithm with that of the stochastic gradient
algorithm of the previous exercise. Experiment both with fixed step-size and with
diminishing step-size for the root-finding algorithm. You may assume that x0 = 0.
11.15 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
stochastic optimal control problem:
[ N ]
1 ∑
minimize 𝔼 (Yk − rk )2
2 k=0
where
dXk+1 dX du
= A k + B k + Wk
dai dai dai
duk dXk
=L + Xi,k ,
dai dai
dX
where Xi,k is the ith component of Xk for i ∈ ℤn . The initial value is daT0 = 0.
(b) Solve again the same stochastic control problem numerically that we have considered
in the previous two exercises. This time use a stochastic gradient method to compute
the optimal policy discussed above. You may assume that x0 = 0. You should in each
dX
iteration use three independent noise sequences for computing Xk and dak , and the
i
noise sequences should of course also be independent for each iteration. Are you able
to obtain as small value of the objective function as you were before? If not, why? How
does the optimal policy perform for a different reference value as the one it was trained
for? How do the approaches in the previous two exercises perform for another refer-
ence value in case no new training is performed for this new reference value?
Exercises 349
11.16 Use the reinforcement learning toolbox in MATLAB to solve the shortest path problem
in Example 8.2 using Q-learning. Use the MDP environment. You need to code the states
and actions differently as compared to what is done in the example. You should label the
states from 1 to 14. You also need to label the actions with positive integers, say 1, 2, and 3.
Moreover, you need to code what happens, if you, for example choose action 1, when you
are at stage k = 1 in node 1. A good idea is to say that you stay in the same node and that
the distance is very long, say 10. In this way, this action cannot be optimal. Also remember
that the rewards in reinforcement learning are the negative of the costs in optimal control,
and hence, all the rewards you code have to be the negative of the distances.
350
12
System Identification
System identification is about learning models for dynamical systems. Examples include the flow of
water in a river, the planetary motions, and the number of cars on a segment of a freeway. We will
in this chapter limit ourselves to discrete-time dynamical systems. However, many of the results
can be generalized to continuous-time dynamical systems using sampling. The origin of system
identification goes back to the work by Karl Åström and Torsten Bolin in 1965.
We start by defining what we mean by a dynamical system in state-space form. We define a regres-
sion problem for learning/estimating the dynamical system, and specifically, we define it as an
ML problem. From this, we then derive input–output models and the corresponding ML problem.
We discuss in detail how the parameters can be estimated by solving a nonlinear LS problem.
Then, we discuss how to estimate the model when some of the data are missing. Nuclear norm
system identification is also discussed in this context. Prior information can be incorporated into
system identification easily using Gaussian processes and empirical Bayes. We show how this can
be implemented using the sequential convex optimization technique based on the majorization
minimization principle. Recurrent neural networks and temporal convolutional neural networks
are shown to be generalizations of the linear dynamical models to nonlinear dynamical models.
The chapter is finished off with a discussion on experiment design for system identification.
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
12.2 Regression Problem 351
where K ∈ ℝn×p . The variables x and e in the innovation model are not the same x and e as used
in the previous model. For details about how they are related, see Exercise 3.17. There is no loss in
generality to consider the innovations form. Then we define 𝜃 = (A, B, C, D, K, R2 , x0 ), and the ML
problem for identification is now to solve
1 ∑ T −1
N
N +1
minimize e R e + ln det R2
2 k=0 k 2 k 2
(12.7)
subject to xk+1 = Axk + Buk + Kek , k ∈ ℤN−1
yk = Cxk + Duk + ek , k ∈ ℤN
with variables (𝜃, x, e). For the case when p = 1, it follows that we can solve a constrained LS prob-
lem with no weighting with R2 , and then estimate R2 as eT e∕(N + 1), where e is the optimal solution,
cf . Section 9.8. The optimization problem is however not convex due to bilinearity of the variables
in the constraints. Even worse is the fact that there are uncountably many solutions to the prob-
lem because of the fact that the input–output relations are unaffected by state transformations, cf .
Exercise 2.12. How to circumvent the latter problem will be discussed next. We will from now on
restrict ourselves to the case when m = p = 1. Most of the results presented below carry over to the
general case with some slight modifications.
Multiply this equation with z−k ∕(2𝜋i) for k = −n, −n + 1, … N − n and integrate along the unit cir-
cle C in the complex plane.1 Then, since
{
2𝜋i, k = −1
zk dz = ,
∮C 0, k ≠ −1
it follows that
Ta y = 𝜉 + Tb u + Tc e, (12.8)
where y = (y0 , … , yN ), u = (u0 , … , uN ), and e = (e0 , … , eN ), and where
Ta = I + a1 S + · · · + an−1 Sn−1 + an Sn
Tb = b0 I + b1 S + · · · + bn−1 Sn−1 + bn Sn
Tc = I + c1 S + · · · + cn−1 Sn−1 + cn Sn
are Toeplitz matrices with the shift matrix S ∈ ℝ(N+1)×(N+1) being a matrix of all zeros except for
ones on the first subdiagonal, cf . (2.24). The vector 𝜉 ∈ ℝN+1 is a vector of all zeros except for
the first n elements, which are functions of the initial value x0 . We may hence write 𝜉 = (𝜉0 , 0),
where 𝜉0 ∈ ℝn . Each and every row in the above equation is related to a specific k above. Except
for the first n rows in the above equation, one may equivalently write
(q)yk = (q)uk + (q)ek ,
where we have introduced the shift operator q which shifts the time index of a signal,
i.e. qyk = yk+1 . This is called an autoregressive-moving-average model with exogenous terms
(ARMAX). The case when C(q) = 1 is called an autoregressive model with exogenous terms
(ARX), the case when (q) = (q) is called an output error (OE) model, and the case when
(q) = (q) = 1 is called a finite impulse response (FIR) model.
cf . Section 6.9, where in every other step (ym , 𝜉0 ) and (a, b, c) are fixed, respectively. When (ym , 𝜉0 )
is fixed, we have a standard system identification problem with no missing data and known initial
value. When (a, b, c) is fixed, we have a linear LS problem for (ym , 𝜉0 ).
Also, we define m = Ty Tm
T . Then it can be shown, [52, 113] that the ML problem is equivalent to
1‖ ( ) 1 ( )‖2
‖det T 2no T y − T 𝜉 − T u ‖
minimize ‖ m m y i 0 u ‖ (12.12)
2‖ ‖2
with variables (𝜃, ym ), where no is the number of observed outputs. The difference as compared
( ) 1
to (12.11) is the factor det mT m 2no , which makes the optimization problem much more chal-
lenging. It is still a separable nonlinear LS problem. For the case when also inputs are missing, see
[52, 113].
Example 12.1 In this example, we consider identification of an ARMAX model for which a = 0.7,
b = (0.7, 0), and c = 0.5. The details of the experimental conditions are given in Exercise 12.1.
In total, 40% of the data is missing. In Figure 12.1, the results of 100 experiments are presented
for the estimates of a and c when using both the criterion in (12.11) and the one in (12.12). It is
seen that the first criterion results in biased estimates, i.e. estimates that are not centered around
the true values. This is not the case for the second criterion, which is the ML criterion.
0.6
0.5
0.4
𝑐
0.3
0.2
0.1
0.6 0.65 0.7 0.75
𝑎
Figure 12.1 Plots showing the estimated values of a and c for 100 runs of system identification when data
are missing. The crosses show the result when the criterion in (12.11) is used, and the circles shows the
result when the criterion in (12.12) is used. The true values are a = 0.7 and c = 0.5.
12.5 Nuclear Norm system Identification 357
So far we have assumed that the system we would like to estimate a model for can be described
within the model class we consider, i.e. that both the model and the true system are a linear system
with state-dimension n. However, it can be shown that ARMAX systems can be arbitrarily well
approximated with ARX models, see [69], if the model order n in the model is taken large enough.
If both the number of data N and the model order n goes to infinity, and N faster than n, then the
estimate is consistent. Since computing ARX models is equivalent to solving a linear LS problem,
the solution can be obtained both efficiently and accurately. A drawback with the approach of using
high model order is that even for a modest number of data, the variance of the estimate of the model
parameters will be large. A remedy to this is to model the dynamical system as a Gaussian process,
which by our previous discussions in Chapter 10 equivalently can be interpreted as regression in
a Hilbert space or as a MAP estimate. For simplicity, we will in this section only consider the FIR
case, but the extension to the ARX case is immediate.
y = 𝜉 + Tb u + e,
which is a special case of the regression model derived in Section 12.3. It models a FIR system.
Define
U = u0 I + u1 S + · · · + uN S N ,
where S is a lower triangular shift matrix. Now, we let Un+1 be the first n + 1 columns of U. Then
it holds with 𝜃 = (b, 𝜉0 ) that
y = Φ𝜃 + e,
where
[ ]
Φ = Un+1 J ,
[ ]T
with J = I 0 ∈ ℝ(N+1)×n . This is clearly the same model as for the Gaussian process in
Section 10.3 if we take X = Φ and a = 𝜃. We now assume that e is the outcome of a zero mean
normally distributed random vector with covariance 𝜎 2 I and that 𝜃 is the outcome of a zero
mean normally distributed random vector with covariance Σ and independent of e. It then follows
that the estimate of 𝜃 is given by the conditional mean of 𝜃 given observations of y as
( )−1
𝜃̂ = ΣΦT 𝜎 2 I + ΦΣΦT y.
similarly as in Section 10.3. It can be shown that this is a consistent estimate if Σ ∈ 𝕊N++ , ΦΦT ∕N
converges to an invertible matrix and if Φe∕N converges to zero as N goes to infinity [26].
that y is the outcome of a zero mean normally distributed random vector with covariance Z ∶ ℝq ×
ℝ+ → 𝕊N+1 ++ defined by
Z(𝜂, 𝜎 2 ) = ΦΣ(𝜂)ΦT + 𝜎 2 I.
Therefore, the ML problem is equivalent to
minimize yT Z(𝜂, 𝜎 2 )−1 y + ln det Z(𝜂, 𝜎 2 ),
with variables (𝜂, 𝜎 2 ), where we also consider 𝜎 2 as a hyper parameter. This approach is known as
Empirical Bayes or Type II Maximum Likelihood. The advantage with this parameterization is that
the ML criterion then is the difference between the two convex functions g ∶ ℝq → ℝ and h ∶ ℝq →
ℝ defined by g(𝜂) = yT Z(𝜂, 𝜎 2 )−1 y and h(𝜂) = − ln det Z(𝜂, 𝜎 2 ), see exercises 4.5 and 4.7. Hence,
the sequential convex optimization technique based on the majorization minimization principle
applies, see Section 6.2.
How should then a suitable subset be chosen? Some insight can be obtained from the fact that for
a FIR model, the parameter b is the impulse response of a linear dynamical system. One possible
choice is to take Σb ∶ ℝ3 → 𝕊+n+1 defined by
{ }
Σb (𝜂) j,k = 𝜆𝛼 (k+j)∕2 𝜌|j−k| ,
for (j, k) ∈ ℕn+1 × ℕn+1 , where 𝜂 = (𝜆, 𝛼, 𝜌) with 𝜆 ∈ ℝ+ , 0 ≤ 𝛼 < 1, and |𝜌| ≤ 1, see [26]. This is
called a diagonal/correlated (DC) kernel.3 We then let Σ = bdiag(Σb , Σ𝜉 ), where Σ𝜉 ∈ 𝕊n+ . Here 𝛼
accounts for an exponential decay along the diagonal of Σb and 𝜌 describes the correlation between
neighboring impulse response coefficients. Exponential decay is expected for a stable linear system.
There are several other choices for Σb discussed in the above reference. One interesting choice is to
∑q q
take Σ(𝜂) = i=1 𝜂i Σi , where Σi ∈ 𝕊2n+1
+ and 𝜂 ∈ ℝ+ . This is motivated by the fact that for a linear
dynamical system, the impulse response is the sum of the impulse responses of its partial fraction
expansion. The different Σi could be obtained from fixed DC kernels modeling the different modes
expected to be present in the model. For a survey on kernel methods in system identification, we
refer the reader to [88].
We consider ek to be realizations of i.i.d. Gaussian random variables with zero mean and unit vari-
ance. The input signal uk is generated in a similar way. However, it is then low-pass filtered with a
filter with cut-off frequency of 0.9. We generate data for k ∈ ℕN , where N = 200. We then estimate
an FIR model of order n = 50 using Empirical Bayes with the DC kernel and another FIR model
without using Empirical Bayes. After this, we generate new data denoted (u, ̄ ȳ ) in the same way as
we generated the data (u, y) for estimation of the FIR model. We then generate ŷ from
∑
n
ŷ k = b̂ j u(k
̄ − j),
j=1
where b̂ j are the coefficients for the FIR models we estimated. This results in one ŷ for the Empirical
Bayes estimate and one for the estimate without using Empirical Bayes. We compare these ŷ with
one another and with ȳ in Figure 12.2. We see that Empirical Bayes results in a much better model,
since for that model ȳ and ŷ are much closer to one another.
50
10
0 0
𝑦,̄ 𝑦̂
𝑦,̄ 𝑦̂
−10
−50
50 100 150 200 50 100 150 200
𝑘 𝑘
Figure 12.2 The left plot shows ȳ k (solid), and ŷ (dashed), as function of k, for Empirical Bayes. The right
plot shows ȳ k (solid), and ŷ (dashed), as function of k, without Empirical Bayes.
The vector 𝜃 contains all the parameters that define the predictor. The idea in temporal convolu-
tional networks (TCNs) is to build up the function f using a tree structure, where each node in the
tree is defined by a nonlinear ARX-model as well. The formal definition is as follows: let xk = (yk , uk )
and let
( )
ŷ k+1 = f (L) Zk(L−1)
( )
zk(l) = f (l) Zk(l−1) , l ∈ ℕL−1
zk(0) = xk ,
where
( )
Zk(l−1) = zk(l−1) , zk−d
(l−1) (l−1)
, … , zk−( ̄
n−1)d
,
l l
and where dl is the so-called dilation factor. Typically, dl increases exponentially with l, e.g.
dl = 2l−1 . We notice that n − 1 = (n̄ − 1)dL is the effective memory of the overall predictor. Each
function f (l) can be defined in many different ways, and we refer to [4] for some examples. A very
simple example would be to take it as a standard one-layer ANN. Because of the tree-structure of
the TCN, it is possible to introduce further parallelism when computing gradients as compared to
the back-propagation algorithm.
4 The estimate 𝜃̂ satisfies the normal equations X T X 𝜃̂ = X T Y , where Y = X𝜃 + E with 𝜃 being the true parameter
value. Hence, X T X𝔼𝜃̂ = X T (X𝜃 + 𝔼E) = X T X𝜃. This shows that 𝔼𝜃̂ = 𝜃. Then the covariance is given by
[ ] [ ] [ ] ( )
𝔼 (𝜃̂ − 𝔼𝜃)( ̂ T = 𝔼 (𝜃̂ − 𝜃)(𝜃̂ − 𝜃)T = 𝔼 (X T X)−1 X T EET X(X T X)−1 = 𝜎 2 X T X −1 .
̂ 𝜃̂ − 𝔼𝜃)
362 12 System Identification
where Ru ∶ ℤ → ℝ is the covariance function for the input signal of which uk is a realization, see
Sections 3.6 and 3.9, and (5.45). Our idea is now to find a good covariance function Ru for the input.
[ ]T
We consider the vector r ∈ ℝn+1 defined as r = Ru (0) · · · Ru (n) as a variable that we are going to
find a good value for. We define the function R ∶ ℝn+1 → 𝕊+n+1 as
⎡ r1 · · · rn+1 ⎤
R(r) = N ⎢ ⋮ ⋱ ⋮ ⎥ ,
⎢ ⎥
⎣rn+1 · · · r1 ⎦
which is a linear function of r. We also define the function P ∶ ℝn+1 → 𝕊+n+1 as
P(r) = 𝜎 2 R(r)−1 .
We have that P̄ ≈ P assuming that we are able to generate an input with the covariance function
defined by r.
We now give different measures of what is a good covariance P(r). The following quantities
should be small:
A-optimality tr P(r)
D-optimality ln det P(r)
E-optimality 𝜆max (P(r))
L-optimality tr(WP(r)), where W ∈ 𝕊++
n+1
.
The reason for the name optimality above is that we would like to make the scalar functions of
r above small. Then we will have a small covariance of the estimate 𝜃. ̂ All optimality criteria can be
related to the eigenvalues of P(r), and the eigenvalues of a symmetric positive definite matrix are
related to the length of the different principal axes of the confidence ellipsoids for the estimate that
{ }
are given by 𝜃 ∈ ℝn+1 | (𝜃̂ − 𝜃)T P(r)−1 (𝜃̂ − 𝜃) ≤ 𝛼 , where 𝛼 ∈ ℝ++ . Hence, A-optimality tries to
make the sum of the lengths of principal axes as small as possible, and E-optimality tries to make
the largest principal axis as small as possible. Regarding L-optimality this is a scaled version of
A-optimality, and we will discuss why this is the case, and what it means later on. Since the volume
of an ellipsoid is proportional to the product of the eigenvalues it follows that D-optimality tries to
make the volume of the confidence ellipsoid as small as possible. Notice that we can make each of
the criteria arbitrarily small by choosing r big enough, i.e. by using a high power in the signal uk .
In order to avoid this trivial and uninteresting solution, we will impose a bound on r1 ≤ L, where
L ∈ ℝ++ . Then the signal-to-noise ratio (SNR) will be L∕𝜎 2 .
and we realize that Ru (k) = 0 for k > n. Here it holds that c0 = 1. The Z-transform Φu ∶ ℂ → ℂ of
the covariance function Ru is given by
∑
n
Φu (z) = Ru (k)z−k .
k=−n
12.9 Experiment Design 363
A necessary and sufficient condition for Ru or equivalently r to be valid is that Φu (ei𝜔 ) ≥ 0 for all
𝜔 ∈ ℝ. It is possible to characterize these r with a convex set. To this end, let
[ ] [ ]
0 0 1 [ ]
A= , B= , C = r2 · · · rn+1 .
In−1 0 0
Define the function Ψ ∶ ℂ → ℂ as
1
Ψ(z) = C(zI − A)−1 B + r1 .
2
Since
⎡z−1 ⎤
(zI − A) ⎢ ⋮ ⎥ = B,
⎢ −n ⎥
⎣z ⎦
it follows that
1 1
Ψ(z) = r1 + r2 z−1 + · · · + rn+1 z−n = Ru (0) + Ru (1)z−1 + · · · + Ru (n)z−n .
2 2
We now have that
Φu (z) = Ψ(z) + Ψ(1∕z).
We will see that Φu (ei𝜔 ) = Ψ(ei𝜔 ) + Ψ(e−i𝜔 ) ≥ 0 for all 𝜔 ∈ ℝ is equivalent to the existence of a
Q ∈ 𝕊n such that the following constraint holds
[ ]
Q − AT QA CT − AT QB
K(Q, r) = ∈ 𝕊+n+1 ,
C − BT QA r1 − BT QB
where K ∶ 𝕊n × ℝn+1 → 𝕊n+1 . This equivalence is known as the positive real lemma. The set of Q
and r that satisfies this constraint is convex since 𝕊+n+1 is a convex cone, and the matrix K is affine in
Q and r. We will show one direction of the proof of the positive real lemma. Assume that K(Q, r) ∈
𝕊+n+1 . Notice that
[ ]
Q CT [ ]T [ ]
K(Q, r) = − A B Q A B ,
C r1
and that
[ ]
[ ] (zI − A)−1 B
A B = z(zI − A)−1 B.
1
From this, it is possible to show that
[ −1 ]T [ ]
(z I − A)−1 B (zI − A)−1 B
K(Q, r) = Φu (z),
1 1
from which the result follows that Φu (ei𝜔 ) ≥ 0 for all 𝜔 ∈ ℝ. For a proof in the other direction see,
e.g. [92].
12.9.2 D-Optimality
We will first look at D-optimality. It holds that ln det P(r) = ln det R(r)−1 + 2(n + 1) ln 𝜎 is a con-
vex function of r. Hence, minimizing it is tractable. We get the following equivalent optimization
problem, where we have removed the constant terms in the objective function:
minimize − ln det R(r)
subject to r1 ≤ L
K(Q, r) ∈ 𝕊+n+1
364 12 System Identification
with variables r ∈ ℝn+1 and Q ∈ 𝕊n . This is a conic optimization problem. We will see that the other
optimization problems related to the other optimality criteria can also be cast as conic optimization
problems.
12.9.3 E-Optimality
Minimizing a function f ∶ ℝn → ℝ over some set D ⊂ ℝn is equivalent to minimizing t over the
epigraph of f . The minimal value of f (x) will be equal to the minimal value of t. This is called the
epigraph formulation of the optimization problem. Thus, minimizing 𝜆max (P(r)) with respect to r
is equivalent to minimizing t subject to 𝜆max (P(r)) ≤ t or, equivalently, subject to 𝜆min (R(r)) ≥ 1∕t.
The smallest eigenvalue is greater than 1∕t if and only if all eigenvalues are greater than 1∕t, and it
can be shown that these inequalities hold if and only if R(r) − 1∕tI ∈ 𝕊+n+1 . Also notice that mini-
mizing t is the same as maximizing 1∕t. We now introduce a new variable s = 1∕t. Then notice that
maximizing s is the same as minimizing −s. Hence, the optimization problem for E-optimality can
be stated as
minimize − s
subject to r1 ≤ L
R(r) − sI ∈ 𝕊+n+1
K(Q, r) ∈ 𝕊+n+1
12.9.4 A-Optimality
∑m
Similarly as the epigraph formulation, it holds that minimizing a sum of functions i=1 fi (x),
∑ m
where fi ∶ ℝn → ℝ is equivalent to minimizing i=1 ti subject to fi (x) ≤ ti , where ti ∈ ℝ and where
i ∈ ℕm , cf . Exercise 4.2. We now apply this to the A-optimality criterion and obtain the optimization
problem:
∑
n+1
minimize ti
i=1
subject to r1 ≤ L
( 2 )
𝜎 R(r)−1 ii ≤ ti , i ∈ ℕn+1
K(Q, r) ∈ 𝕊+n+1
with variables r ∈ ℝn+1 , t ∈ ℝn+1 and Q ∈ 𝕊n . With ei ∈ ℝn+1 , i ∈ ℕn+1 being the standard basis
vectors for ℝn+1 , we can write the last constraints as
( )
ti − eTi 𝜎 2 R(r)−1 ei ≥ 0, i ∈ ℕn+1 .
and hence, it follows that the above optimization problem is also equivalent to a conic optimization
problem.
12.9 Experiment Design 365
12.9.5 L-Optimality
For L-optimality, we factorize W as W = V −1 V −T which can always be done with the help of, e.g. a
Cholesky factorization, see (2.45). Now,
( ) ( ( )−1 )
tr (WP(r)) = tr V −T 𝜎 2 R(r)−1 V −1 = tr 𝜎 2 VR(r)V T ,
and hence, we realize that we obtain the optimization problem for L-optimality by replacing R(r)
in the optimization problem for A-optimality with VR(r)V T . Hence, this is just a weighted version
of A-optimality with the weight matrix W.
We will discuss an application where L-optimality comes up naturally. It is not necessarily the
case that we are interested in good values only of the parameters, but we might be more interested
in obtaining a small Mean Square Error (MSE) when we use the model for prediction with new
data. Hence, we define X̄ similarly as X in the beginning of this section, but for the typical input
signal for which we would like to have a good prediction. Then it can be shown5 that the MSE for
the predictor is
( )
̄ T X)−1 X̄ T + 𝜎 2 trI,
𝜎 2 tr X(X
and hence, the choice of W = X̄ X̄ in L-optimal experiment design will result in minimizing the
T
[ ]
MSE for the predictor. We can interpret W∕N as an estimate of the covariance 𝔼 Ū Ū T , where Ū
is the input for which we would like to have a small MSE.
Since Ru is an even function, we have that Φu (z) = Φu (1∕z). Hence, if zi is a zero or pole of Φu so is
1∕zi . Based on this the following spectral factorization follows, i.e. we may write
Φu (z) = 𝜅H(z)H(1∕z),
5 Let the new data be generated from Ȳ = X𝜃 ̄ + E,̄ where the random vector Ē has the same distribution as E. Also,
let us estimate Ȳ with Ŷ = X̄ 𝜃,̂ where 𝜃̂ solves the normal equations X T X 𝜃̂ = X T Y , where Y = X𝜃 + E. Then
[ ] [
𝔼Ŷ = X𝔼̄ 𝜃̂ = X𝜃.
̄ Also Ŷ − 𝔼Ŷ = X( ̄ 𝜃̂ − 𝔼𝜃)
̂ = X(X
̄ T X)−1 X T E. Hence, tr 𝔼 (Ȳ − Ŷ )(Ȳ − Ŷ )T = tr 𝔼 (Ȳ − 𝔼Ŷ +
] ( )
̄ T X)−1 X̄ T + 𝜎 2 trI.
𝔼Ŷ − Ŷ )(Ȳ − 𝔼Ŷ + 𝔼Ŷ − Ŷ )T = 𝜎 2 tr X(X
366 12 System Identification
where the right-hand side will also be a polynomial. Let us now take pi = 0, i ∈ ℕn . With C ∶ ℂ → ℂ
defined as the polynomial
where C̃ ∶ ℂ → ℂ is defined as C(z) ̃ = zn C(1∕z). It follows that C̃ will have all its zeros outside
the unit disk, since C has all its zeros inside the unit disk. Moreover, H(z) = C(z)∕zn . From this, it
follows that the MA-process defined as
Uk = Vk + c1 Vk−1 + · · · + cn Vk−n ,
with Vk independent of Vj for j ≠ k with variance 𝜅 will have the covariance function Ru .
12.9.7 OE Model
It turns out that the solutions to the above optimization problems for optimal experiment design
will all be trivial except for L-optimality. The optimal input will be white noise. However, for more
general models than FIR models this is not the case. We will here consider a special case of an
ARMAX model obtained by letting Tc = Ta in (12.8), i.e.
Ta y = Tb u + Ta e.
Here we have assumed that 𝜉 = 0. This model is an OE model. Since Ta is invertible we may write
y = Ta−1 Tb u + e.
[ ]T [ ]T
We now consider Ta to be a function of a = a1 · · · ana and Tb to be a function of b = b0 · · · bnb
[ T T]
We also introduce the vector 𝜃 = a b , and we define the function T ∶ ℝn → ℝN×N by T(𝜃) =
Ta (a)−1 Tb (b), where n = na + nb + 1. We realize that we have a nonlinear regression model
y = f (𝜃) + e,
6 The inverse of this covariance matrix is known as the Fischer information matrix. Under mild conditions it holds
that this covariance, as N goes to infinity, is the smallest possible covariance, after normalization with N, that can be
obtained for any unbiased estimator, and this is called the Cramér-Rao lower bound.
12.9 Experiment Design 367
We have that
𝜕f
= −Ta−1 i Ta−1 Tb u, 1 ≤ i ≤ na
𝜕ai
𝜕f
= Ta−1 i u, 0 ≤ i ≤ nb ,
𝜕bi
where 0 = I. These expressions can be simplified since the shift matrix commutes with the
Toeplitz matrices and their inverses. Actually, it can be shown that
Y = Ty Uy , Z = Tz Uz ,
where Ty = −Ta−1 Ta−1 Tb ∈ ℝN×N and Tz = Ta−1 ∈ ℝN×N are lower triangular Toeplitz matrices, and
where
[ ] [ ]
Uy = u u · · · na −1 u , Uz = u u · · · nb u .
We realize that Uy and Yz are the first na and nb + 1 columns of the lower triangular Toeplitz matrix
[ ]
U = u u · · · N−1 u ∈ ℝN×N .
Since both Ty and Tz commute with this matrix, we have that
Y = U T̃ y , Z = U T̃ z ,
where T̃ y ∈ ℝN×na and T̃ z ∈ ℝN×(nb +1) are the first columns of Ty and Tz , respectively. Since we
assume stability of the OE model, it follows that Ty and Tz are diagonally dominant, and hence,
we may approximate Y and Z as
Y = Ũ T̃ y , Z = Ũ T̃ z ,
[ ]
where Ũ is a Toeplitz matrix with first column u and first row u(0) u(−1) · · · u(−N + 1) .
[ ]
Remember that X = Y Z , and hence,
[ ]T [ ]
X T X = T̃ y T̃ z Ũ T Ũ T̃ y T̃ z ,
where
⎡ Ru (0) · · · Ru (N − 1)⎤
Ũ T Ũ ≈ N ⎢ ⋮ ⋱ ⋮ ⎥.
⎢ ⎥
⎣Ru (N − 1) · · · Ru (0) ⎦
Thus, we can model this covariance matrix as a symmetric Toeplitz function R̄ ∶ ℝN → 𝕊N
defined as
⎡ r1 · · · r N ⎤
̄
R(r) = N ⎢ ⋮ ⋱ ⋮ ⎥,
⎢ ⎥
⎣rN · · · r1 ⎦
and hence, all entries of X T X will be linear in r. We now realize a difference as compared to the
FIR-model case. There the vector r only had dimension n + 1, but here it has dimension N. We will
limit ourselves to a lower dimension by considering inputs u which are such that ri = 0 for i > m.
Hence, the function R̄ will be a banded matrix, and we will consider it to be a function of only
[ ]T [ ]
r ∈ ℝm . We now define R ∶ ℝm → 𝕊N as R(r) = T̃ y Ỹ u R(r) ̄ T̃ y Ỹ u , which is a linear function
of r. We finally define P ∶ ℝm → 𝕊n+ as
P(r) = 𝜎 2 R(r)−1 ,
similarly as for the FIR-model, and we can use the above optimization problem formulations to
compute optimal experiments.
368 12 System Identification
One might think that it is very costly to set up the optimization problems, since there are several
matrix multiplications and inverses involved. However, it is the case that multiplication with both
a banded Toeplitz matrix and the inverse of a banded Toeplitz matrix is equivalent to a filtering, and
the multiplication with shift matrices is the same as delaying signals. Using sparse linear algebra
is just as efficient as the filtering approach. We will now look at an example.
10 1
Magnitude
10 −1
10 −3
10 −2 10 −1 10 0
𝜔
Phase (degrees)
−100
10 −2 10 −1 10 0
𝜔
Preliminary Optimal
0.12 0.12
0.11 0.11
0.1 0.1
𝑏
𝑏
9 ⋅ 10−2 9 ⋅ 10−2
8 ⋅ 10−2 8 ⋅ 10−2
−0.98 −0.96 −0.94 −0.92 −0.98 −0.96 −0.94 −0.92
𝑎 𝑎
Figure 12.4 Scatter-plots showing the estimated values of a and b. The left figure shows the results from
the preliminary nonoptimal experiment and the right figure shows the results from the optimal
experiments.
one wants to use the model for prediction, then it is natural to try to make optimal experiments
for minimizing the MSE of the predictor for the intended input signal. It turns out that also other
application criteria related to the model can be interpreted as wanting to have a small MSE for a
certain input signal. One example is if one wants to have a good model of the transfer function in
a certain frequency band. Then this can be accomplished by using an input ū ∈ ℝN that has most
[ ]
of its energy in this frequency band. We will model this with the covariance Σ̄ = 𝔼 Ū Ū T , where
we assume that 𝔼Ū = 0. Here ū is the outcome of the random variable U. ̄ We will as an example
consider the OE-model in the previous section. The predictor is given by
̂ = T(𝜃)
Ŷ (𝜃) ̂ U,
̄
[( )( )T ]
where 𝜃̂ is the estimate of the model parameters 𝜃. Let 𝜇 = 𝔼𝜃̂ and Σ = 𝔼 𝜃̂ − 𝜇 𝜃̂ − 𝜇 =
P(r), which are the mean and covariance of the ML estimate of the model parameters 𝜃. We make
̂
a first-order Taylor series expansion of the ith column of T(𝜃):
𝜕Ti (𝜇)
̂ ≈ Ti (𝜇) +
Ti (𝜃) (𝜃̂ − 𝜇), i ∈ ℕn .
𝜕𝜃 T
We introduce the notation
𝜕Ti (𝜇)
Li = , i ∈ ℕn ,
𝜕𝜃 T
for convenience. Since 𝜃̂ and Ū are assumed to be independent it follows that
∑
n
[ ]
̂ ≈ T(𝜇)𝔼Ū +
𝔼Ŷ (𝜃) Li 𝔼 (𝜃̂ − 𝜇)Ū i = 0.
i=1
We also have
[ ]
̂ Ŷ (𝜃)
tr 𝔼 Ŷ (𝜃) ̂ T ≈ A + 2B + C,
where
( )
A = tr T(𝜇)ΣT̄ T (𝜇)
[ n ]
∑
B = tr 𝔼 ̂ ̄ ̄
Li (𝜃 − 𝜇)Ui U T (𝜇) = 0
T T
i=1
[ ] ( )
∑
n
∑
n
∑ ∑
n n
C = tr 𝔼 Li (𝜃̂ − 𝜇)Ū i Ū j (𝜃̂ − 𝜇)T LTj = tr Σ Σ̄ ij LTj Li .
i=1 j=1 i=1 j=1
370 12 System Identification
i=1 j=1
∑n ∑n
We now define W = i=1 j=1 Σ̄ ij LTj Li and remember that Σ = P(r), and we have shown that we
have obtained a criterion for L-optimality with the weight W.
Exercises
12.1 Show that the optimization problems in (12.11) and (12.12) are the same when Ta = Tc .
12.2 We will investigate the variable projection method in Section 6.7 when applied to (12.12).
We will take x = ym and 𝛼 = (a, b, c). We will not consider 𝜉0 as variable to estimate, i.e. we
assume that it is zero, and we will generate data such that this holds true. We then write
( ) ( ) 1
F(x, 𝛼) = 𝛾 Ty y − Tu u , where 𝛾 = det mT m 2no . With A(𝛼) = 𝛾m and b(𝛼) = 𝛾(−o yo +
Tu u) it holds that F(x, 𝛼) = A(𝛼)x − b(𝛼).
(a) Show that P and x(𝛼) as defined in Section 6.7 are given by
P = I − m m†
( )
x(𝛼) = m† −o yo + Tu u ,
search direction and the residual 𝛾e. For this purpose, you can use lsqnonlin in
MATLAB. The calling sequence is
OPTIONS = optimoptions('lsqnonlin','Algorithm',...
'levenberg-marquardt','SpecifyObjectiveGradient',true);
x = lsqnonlin(@(x) missing_ML_kaufman(x,par),...
zeros(3*n+1,1),[],[],OPTIONS);
You then need to write a function missing_ML_kaufman that as first output argu-
ment computes 𝛾e and as second output argument computes the Kaufman–Jacobian
approximation. These computations should then be repeated with new data for in total
100 times. Make a two-dimensional scatter plot of the estimated values of a and c. You
can then compare this with the results you obtain if you fix 𝛾 to one, i.e. solving (12.11).
This means that the residual is e and that the Kaufman–Jacobian approximation is P 𝜕𝛼𝜕eT .
Y = X + U, (12.16)
where
⎡ C ⎤ ⎡ D 0 0 · · · 0⎤
⎢ ⎥ ⎢ CB D 0 · · · 0⎥
AC ⎥ ⎢ ⎥
=⎢ , = ⎢ CAB CB D 0⎥ ,
⎢ ⋮ ⎥ ⎢ ⋮
⎢Ar−1 C⎥ ⋱ ⋱ ⎥
⎣ ⎦ ⎢ r−2 ⎥
⎣CA B CA B · · · CB D⎦
r−3
[ ]
and where X = x0 · · · xM−1 .
[ ]
X
(b) Assume that has full row rank and that has full column rank. Then use the results
U
and definitions from Exercise 2.14 to conclude that = L22 . Let
[ ][ ]
[ ] Σ 0 V1T
L22 = U1 U2 = U1 ΣV1T ,
0 0 V2T
be an SVD. Show that there exists invertible state transformation T ∈ ℝn×n , c.f.
Exercise 2.12, such that C̄ = CT = Ū 1 , where Ū 1 contains the first n rows of U1 , and
that Ā = T −1 AT can be computed from U1 by solving a linear system of equations.
(c) Show how B̄ = T −1 B and D can be computed from (12.16) by solving a linear system of
equations assuming that C̄ and Ā are known.
12.4 Consider Example 12.2. You are asked to write a MATLAB code using the System Identi-
fication Toolbox that reproduces the result in the example. Play around with the value of
the bandwidth. What happens if you use a bandwidth of one instead of 0.9 as is used in the
example?
372 12 System Identification
(a) Write a MATLAB function utilizing YALMIP that outputs the positive real lemma con-
straint with input variables being Q, and r. You should assume that the calling code
has declared the input variables as Q = sdpvar(n) and r = sdpvar(n+1,1),
respectively.
[ ]T
(b) Write a simple code to test your function by checking if r = 3 21 is valid. Also, try
[ ]T
r = 3 2 −1 .
12.6 Write a MATLAB function specfac that given a vector r ∈ ℝn+1 containing values
of a covariance function computes the vector c ∈ ℝn containing the coefficients of
the C-polynomial and the variance 𝜅 such that the MA-process defined in (12.15) has
covariance as specific by r.
Appendix A
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
374 Appendix A
We define the maximum of f analogously. Consider the function ln(x), which has domain ℝ++ .
We have that the infimum over ℝ++ is −∞, which is not attained for an element of ℝ++ , and hence,
the function does not have a minimum over its domain. However, it has a minimum over the set
[1, ∞). The infimum is attained for x = 1, and hence, the minimum value is equal to zero.
̄ where is an open subset of ℝ,
For a differentiable function f ∶ → ℝ, ̄ we denote its derivative
at x ∈ with
df (x)
, f ′ (x) or ḟ (x).
dx
For a differentiable function f ∶ → ℝ, ̄ where is an open subset of ℝ
̄ n , we denote its partial
derivative with respect to xi at x = (x1 , … , xn ) ∈ with
𝜕f (x)
.
𝜕xi
̄ we denote the definite integral over C ⊆ ⊆ ℝ
For an integrable function f ∶ → ℝ, ̄ with n
f (x)dx.
∫C
When C = C1 × · · · × Cn , we may also write
A.2 Software
Many of the problems in this book can be solved numerically with readily available software
packages. The landscape of software for optimization, learning, and control is vast, so rather than
attempting to provide a comprehensive list of software, we will provide only a brief overview of
select software packages that are related to problems in this book. We will focus on two types of
software: solvers and modeling tools. Generally speaking, solvers implement numerical methods
for learning or optimization of some class of problems, and modeling tools allow the user to specify
a wide range of problems in a solver-agnostic way using a high-level syntax.
Problem
Transform Solve Recover
specification
minimize
subject to = ⋆ ⇝ ⋆ ⋆ ⋆
∈
Figure A.1 Modeling tools allow the users to specify their problems using a high-level syntax.
The problem is then transformed to a format that is accepted as some solver and, if possible, the
information returned from the solver is used to recover a solution to the problem provided by the user.
with variables 𝑤 ∈ ℝn , b ∈ ℝ, and u ∈ ℝm , problem data C ∈ ℝm×n and b ∈ {−1, 1}m , and regular-
ization parameter 𝛾 > 0. This problem is equivalent to an second-order cone program (SOCP), and
it can be solved using one of many different software packages for conic optimization. However, the
task of reformulating the problem so that it can be accepted by a given solver is often tedious and
error-prone. Moreover, it is often necessary to start from scratch if we wish to use a another solver
that requires the problem to be specified differently. Modeling tools automate this process, as illus-
trated in Figure A.1. The problem specification in the figure is based on the CVX modeling package
for MATLAB, and it is clearly a self-explanatory and high-level representation of the mathemati-
cal description of the problem. Modeling tools also make it very easy to experiment with different
problem formulations, but the user often has little or no control over transformations, which can
make it difficult to exploit problem structure.
A.2.1.2 YALMIP
The MATLAB toolbox YALMIP [71] is a modeling package that can be used with a wide range of
solvers. YALMIP is an acronym for “Yet Another Linear Matrix Inequality Parser” and was ini-
tially developed for applications in control but has since grown into a full-fledged general-purpose
modeling package with support for many classes of both convex and nonconvex problems.
A.2.1.3 JuMP
The Julia modeling package JuMP [36] is highly flexible modeling language for mathematical opti-
mization, and it supports a large number of solvers through a generic solver-independent interface.
version called TensorLayer, which provides some popular reinforcement learning modules that can
easily be customized and assembled. DeepMind Lab is a Google 3D platform with customization
for agent-based AI research. There is also the Reinforcement Learning Toolbox of MATLAB.
References
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
380 References
80 Y. E. Nesterov. A method for solving the convex programming problem with convergence rate
o(1∕k2 ). Doklady Akademii Nauk SSSR, 269:543–547, 1983.
81 Y. Nesterov. Lectures on Convex Optimization. Springer Optimization and its Applications.
Springer International Publishing, 2nd edition, 2018.
82 Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global
performance. Mathematical Programming, 108(1):177–205, April 2006. doi: 10.1007/
s10107-006-0706-8. URL https://doi.org/10.1007/s10107-006-0706-8.
83 B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and
homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):
1042–1068, June 2016. URL http://stanford.edu/boyd/papers/scs.html.
84 J. Omura. On the Viterbi decoding algorithm. IEEE Transactions on Information Theory,
15(1):177–179, 1969. doi: 10.1109/TIT.1969.1054239.
85 N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization,
1(3):123–231, 2013.
86 T. A. Parks. Reducible Nonlinear Programming Problems. Ph.D. thesis, Rice University, 1985.
87 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, 2019.
88 G. Pillonetto, F. Dinuzzo, T. Chen, G. de Nicola, and L. Ljung. Kernel methods in system
identification, machine learning and function estimation: A survey. Automatica, 50:657–682,
2014.
89 B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, January 1964. doi:
10.1016/0041-5553(64)90137-5.
90 M. J. D. Powell. On search directions for minimization algorithms. Mathematical Program-
ming, 4(1):193–201, December 1973. doi: 10.1007/bf01584660. URL https://doi.org/10.1007/
bf01584660.
91 J. C. Pratt. Sequential minimal optimization: A fast algorithm for training support vector
machines. Technical report, Microsoft Research, April 1998.
92 A. Rantzer. On the Kalman-Yakobovich-Popov lemm. System & Control Letters, 28(1):7–10,
1996.
93 A. V. Rao. A survey of numerical methods for optimal control. Advances in the Astronautical
Sciences, 135(1):497–528, 2010.
94 C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
95 S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International
Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-
RZ.
96 H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical
Statistics, 22(3):400–407, 9 1951. doi: 10.1214/aoms/1177729586.
97 W. J. Rugh. Linear System Theory. Prentice Hall, Englewood Cliffs, NJ, 1996.
98 A. N. Shiryayev. Probability. Springer-Verlag, New York, 1984.
99 J Sjöberg and M Viberg. Separable non-linear least squares minimization–possible improve-
ments for neural net fitting. In IEEE Workshop in Neural Networks for SignalProcessing,
1997.
384 References
100 J. F. Sturm. Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric
cones. Optimization Methods and Software, 11(1–4):625–653, January 1999. doi: 10.1080/
10556789908805766. URL https://doi.org/10.1080/10556789908805766.
101 M. Tawarmalani and N. V. Sahinidis. A polyhedral branch-and-cut approach to global opti-
mization. Mathematical Programming, 103:225–249, 2005.
102 TensorFlow. URL https://www.tensorflow.org/. Software available from tensorflow.org.
103 T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average
of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
104 K. C. Toh, M. J. Todd, and R. H. Tütüncü. SDPT3 — a Matlab software package for semidefi-
nite programming, version 1.3. Optimization Methods and Software, 11(1–4):545–581, January
1999. doi: 10.1080/10556789908805762. URL https://doi.org/10.1080/10556789908805762.
105 M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convex optimization in
Julia. In SC14 Workshop on High Performance Technical Computing in Dynamic Languages,
2014.
106 L. Vandenberghe and M. S. Andersen. Chordal graphs and semidefinite optimization.
Foundations and Trends® in Optimization, 1(4):241–433, 2015. ISSN 2167-3888. doi: 10.1561/
2400000006. URL http://dx.doi.org/10.1561/2400000006.
107 M. Verhaegen and V. Verdult. Filtering and System Identification. Cambridge University Press,
Cambridge, UK, 2007.
108 S. Vigerske and A. Gleixner. SCIP: Global optimization of mixed-integer nonlinear programs
in a branch-and-cut framework. Optimization Methods and Software, 33(3):563–593, 2018.
doi: 10.1080/10556788.2017.1335312.
109 A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decod-
ing algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. doi:
10.1109/TIT.1967.1054010.
110 A. J. Viterbi and J. K. Omura. Principles of Digital Communication and Coding. McGraw-Hill
Book Company, New York, 1979.
111 A. Wächter and L. T. Biegler. On the implementation of an interior-point filter line-search
algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57,
April 2005. doi: 10.1007/s10107-004-0559-y. URL https://doi.org/10.1007/s10107-004-0559-y.
112 M. J Wainwright and M. I. Jordan. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
113 R. Wallin and A. Hansson. Maximum likelihood estimatin of linear SISO models subject
to missing output data and missing input data. International Journal of Control, 87(11):
2354–2364, 2014.
114 C.J.C.H. Watkins. Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University,
Cambridge, 1989.
115 S. J. Wright. Primal-Dual Interior-Point Methods. SIAM, 1997. ISBN 978-0-89871-382-4.
116 S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, March
2015. doi: 10.1007/s10107-015-0892-3. URL https://doi.org/10.1007/s10107-015-0892-3.
117 C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics,
11(1):95–103, 1983.
118 S. P. Wu, S. Boyd, and L. Vandenberghe. FIR filter design via spectral factorization and convex
optimization. In B. Datta, editor, Applied Computational Control, Signal and Communications.
Birkhauser, Boston, MA, 1997.
References 385
119 M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic
Discrete Methods, 2(1):77–79, March 1981. doi: 10.1137/0602010. URL https://doi.org/10.1137/
0602010.
120 C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning
requires rethinking generalization, 2017. https://doi.org/10.48550/arXiv.1611.03530
121 H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to
unconstrained optimization. SIAM Journal on Optimization, 14(4):1043–1056, January 2004.
doi: 10.1137/s1052623403428208. URL https://doi.org/10.1137/s1052623403428208.
387
Index
Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
388 Index
normal partial
cone 84 derivative 374
distribution 45, 252, 267 infimum 77
equations 95 order 72
normalized steepest descent directio 119 partial order 44, 108
nuclear norm 24, 80 separability 106
system identification 357 path-following method 157
nullity 10 penalty
null space 10 parameter 161
problem 161
o Penrose conditions 24
objective function 64 permutation matrix 18
observability matrix 355 persistence of excitation 353
observation of experiment 41 perspective function 68, 76
OE model 353, 366 persymmetric 20
open loop 206, 207 phase I problem 158
policy 208 PLU factorization 27
operator norm 14 Poisson distribution 252
optimal policy
control 175, 206 control 208
finite horizon problem 206 evaluation step 216
finite horizon stochastic problem 227 feedback 208
infinite horizon problem 213 improvement step 216
infinite horizon stochastic problem 231 iteration 216, 332
linear programming formulation 220 open loop 208
problem 181, 207 space approximation 338
stability 233, 234 polyhedral set 69
cost-to-go function 207 positive
duality gap 88 definite 21
point 65 real lemma 362
set 65 semidefinite 21
stopping problem 230 positively homogeneous 100
optimality conditions 90 posterior distribution 302
optimization power cone 101
methods 118 prediction 254
problem 64 error 351
Ornstein–Uhlenbeck process 57, 303 horizon 224
orthogonal 7, 17 predictor 298, 351
matrix 12 affine 256
orthonormal 10 ensemble average 318
outcome of experiment 41 linear 256
output 53 unbiased 257
signal 350 principle component analysis 280
probability 40
p categorical probability function 42
page rank 247 conditional 42, 43
pairwise independent 44 density function 42
parametric approximation 211 conditional 48
394 Index
strictly probabilities 53
convex function 72 triangular matrix 18
positive definite 214 truncated SVD 103
strong trust-region method 123
duality 88 two point boundary value problem 186
ergodic theorem 52 type-II maximum likelihood 359
extremum 176
law of large numbers 51 u
strongly unbiased 257
convex function 72 unbounded 64
monotone 342 uncorrelated 50
stationary random process 52 union 373
subdifferential 81 unrolling 360
subgradient 80 unsupervised learning 245
sublevel set 63
suboptimal 65 v
subspace 10 value
methods 357 function 109
supervised learning 297 iteration 215
support finite horizon 327
function 78 fitted 211, 328
vector 308 infinite horizon 330
machine 306, 308 vanishing gradient 360
supremum 373 variable
surrogate function 134 metric method 134
surrogate problem 122 projection method 144
Sylvester’s determinant identity 30 splitting 164
symbolic factorization 29 variance 49
symmetric variation
matrix 19 first 176
square root 22 second 177
system vector 373
identification 350 field 179
matrices 351 violated constraint 64
visible variables 278
t Viterbi
temporal algorithm 259, 260
convolutional networks 360 decoder 264
difference 215 von Neumann’s trace inequality 23
terminal cost 179, 207
time 350 w
horizon 207 weak
time-invariant 350 duality 87
time-varying 350 ergodic theorem 52
Toeplitz matrix 19 extremum 176
trace norm 24 maximum 177
transition minimum 176
matrix 53, 182 weakly stationary random process 52
Index 397